Dataproc Mcp

@dipsethon a year ago

9 MIT

FreeCommunity

AI Systems

Private MCP Dataproc server repository

What is Dataproc Mcp

dataproc-mcp is a production-ready Model Context Protocol (MCP) server designed for Google Cloud Dataproc operations. It features intelligent parameter injection, enterprise-grade security, and comprehensive tooling, making it suitable for integration with development environments like Roo (VS Code).

Use cases

Use cases for dataproc-mcp include running data processing jobs, managing cloud resources efficiently, and integrating with development tools for enhanced productivity in data-centric applications.

How to use

To use dataproc-mcp, you can integrate it with Roo (VS Code) by adding specific configurations to your MCP settings. Alternatively, you can install it globally using npm and start the server directly from the command line.

Key features

Key features of dataproc-mcp include intelligent parameter injection, enterprise-grade security, compatibility with MCP, and support for TypeScript. It also provides comprehensive tooling for efficient operation management.

Where to use

dataproc-mcp is primarily used in cloud computing environments, particularly for data processing tasks on Google Cloud Dataproc. It is suitable for enterprises that require robust data handling and processing capabilities.

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Overview

What is Dataproc Mcp

Use cases

Use cases for dataproc-mcp include running data processing jobs, managing cloud resources efficiently, and integrating with development tools for enhanced productivity in data-centric applications.

How to use

Key features

Where to use

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Content

Dataproc MCP Server

A production-ready Model Context Protocol (MCP) server for Google Cloud Dataproc operations with intelligent parameter injection, enterprise-grade security, and comprehensive tooling. Designed for seamless integration with Roo (VS Code).

🚀 Quick Start

Recommended: Roo (VS Code) Integration

Add this to your Roo MCP settings:

{
  "mcpServers": {
    "dataproc": {
      "command": "npx",
      "args": [
        "@dipseth/dataproc-mcp-server@latest"
      ],
      "env": {
        "LOG_LEVEL": "info"
      }
    }
  }
}

With Custom Config File

{
  "mcpServers": {
    "dataproc": {
      "command": "npx",
      "args": [
        "@dipseth/dataproc-mcp-server@latest"
      ],
      "env": {
        "LOG_LEVEL": "info",
        "DATAPROC_CONFIG_PATH": "/path/to/your/config.json"
      }
    }
  }
}

Alternative: Global Installation

# Install globally
npm install -g @dipseth/dataproc-mcp-server

# Start the server
dataproc-mcp-server

# Or run directly
npx @dipseth/dataproc-mcp-server@latest

5-Minute Setup

Install the package:

npm install -g @dipseth/dataproc-mcp-server@latest

Run the setup:
```
dataproc-mcp --setup
```

Configure authentication:

# Edit the generated config file
nano config/server.json

Start the server:
```
dataproc-mcp
```

✨ Features

🎯 Core Capabilities

22 Production-Ready MCP Tools - Complete Dataproc management suite
🧠 Knowledge Base Semantic Search - Natural language queries with optional Qdrant integration
🚀 Response Optimization - 60-96% token reduction with Qdrant storage
🔄 Generic Type Conversion System - Automatic, type-safe data transformations
60-80% Parameter Reduction - Intelligent default injection
Multi-Environment Support - Dev/staging/production configurations
Service Account Impersonation - Enterprise authentication
Real-time Job Monitoring - Comprehensive status tracking

🚀 Response Optimization

96.2% Token Reduction - list_clusters: 7,651 → 292 tokens
Automatic Qdrant Storage - Full data preserved and searchable
Resource URI Access - dataproc://responses/clusters/list/abc123
Graceful Fallback - Works without Qdrant, falls back to full responses
9.95ms Processing - Lightning-fast optimization with <1MB memory usage

🔄 Generic Type Conversion System

75% Code Reduction - Eliminates manual conversion logic across services
Type-Safe Transformations - Automatic field detection and mapping
Intelligent Compression - Field-level compression with configurable thresholds
0.50ms Conversion Times - Lightning-fast processing with 100% compression ratios
Zero-Configuration - Works automatically with existing TypeScript types
Backward Compatible - Seamless integration with existing functionality

� Enterprise Security

Input Validation - Zod schemas for all 16 tools
Rate Limiting - Configurable abuse prevention
Credential Management - Secure handling and rotation
Audit Logging - Comprehensive security event tracking
Threat Detection - Injection attack prevention

📊 Quality Assurance

90%+ Test Coverage - Comprehensive test suite
Performance Monitoring - Configurable thresholds
Multi-Environment Testing - Cross-platform validation
Automated Quality Gates - CI/CD integration
Security Scanning - Vulnerability management

🚀 Developer Experience

5-Minute Setup - Quick start guide
Interactive Documentation - HTML docs with examples
Comprehensive Examples - Multi-environment configs
Troubleshooting Guides - Common issues and solutions
IDE Integration - TypeScript support

🛠️ Complete MCP Tools Suite (22 Tools)

🔄 Enhanced with Generic Type Conversion: All tools now benefit from automatic, type-safe data transformations with intelligent compression and field mapping.

🚀 Cluster Management (8 Tools)

Tool	Description	Smart Defaults	Key Features
`start_dataproc_cluster`	Create and start new clusters	✅ 80% fewer params	Profile-based, auto-config
`create_cluster_from_yaml`	Create from YAML configuration	✅ Project/region injection	Template-driven setup
`create_cluster_from_profile`	Create using predefined profiles	✅ 85% fewer params	8 built-in profiles
`list_clusters`	List all clusters with filtering	✅ No params needed	Semantic queries, pagination
`list_tracked_clusters`	List MCP-created clusters	✅ Profile filtering	Creation tracking
`get_cluster`	Get detailed cluster information	✅ 75% fewer params	Semantic data extraction
`delete_cluster`	Delete existing clusters	✅ Project/region defaults	Safe deletion
`get_zeppelin_url`	Get Zeppelin notebook URL	✅ Auto-discovery	Web interface access

💼 Job Management (7 Tools)

Tool	Description	Smart Defaults	Key Features
`submit_hive_query`	Submit Hive queries to clusters	✅ 70% fewer params	Async support, timeouts
`submit_dataproc_job`	Submit Spark/PySpark/Presto jobs	✅ 75% fewer params	Multi-engine support, Local file staging
`cancel_dataproc_job`	Cancel running or pending jobs	✅ JobID only needed	Emergency cancellation, cost control
`get_job_status`	Get job execution status	✅ JobID only needed	Real-time monitoring
`get_job_results`	Get job outputs and results	✅ Auto-pagination	Result formatting
`get_query_status`	Get Hive query status	✅ Minimal params	Query tracking
`get_query_results`	Get Hive query results	✅ Smart pagination	Enhanced async support

📋 Configuration & Profiles (3 Tools)

Tool	Description	Smart Defaults	Key Features
`list_profiles`	List available cluster profiles	✅ Category filtering	8 production profiles
`get_profile`	Get detailed profile configuration	✅ Profile ID only	Template access
`query_cluster_data`	Query stored cluster data	✅ Natural language	Semantic search

📊 Analytics & Insights (4 Tools)

Tool	Description	Smart Defaults	Key Features
`check_active_jobs`	Quick status of all active jobs	✅ No params needed	Multi-project view
`get_cluster_insights`	Comprehensive cluster analytics	✅ Auto-discovery	Machine types, components
`get_job_analytics`	Job performance analytics	✅ Success rates	Error patterns, metrics
`query_knowledge`	Query comprehensive knowledge base	✅ Natural language	Clusters, jobs, errors

🎯 Key Capabilities

🧠 Semantic Search: Natural language queries with Qdrant integration
⚡ Smart Defaults: 60-80% parameter reduction through intelligent injection
📊 Response Optimization: 96% token reduction with full data preservation
🔄 Async Support: Non-blocking job submission and monitoring
🏷️ Profile System: 8 production-ready cluster templates
📈 Analytics: Comprehensive insights and performance tracking

📋 Configuration

Project-Based Configuration

The server supports a project-based configuration format:

# profiles/@analytics-workloads.yaml
my-company-analytics-prod-1234:
  region: us-central1
  tags:
    - DataProc
    - analytics
    - production
  labels:
    service: analytics-service
    owner: data-team
    environment: production
  cluster_config:
    # ... cluster configuration

Authentication Methods

Service Account Impersonation (Recommended)
Direct Service Account Key
Application Default Credentials
Hybrid Authentication with fallbacks

📚 Documentation

Quick Start Guide - Get started in 5 minutes
Knowledge Base Semantic Search - Natural language queries and setup
Generic Type Conversion System - Architectural design and implementation
Generic Converter Migration Guide - Migration from manual conversions
API Reference - Complete tool documentation
Configuration Examples - Real-world configurations
Security Guide - Best practices and compliance
Installation Guide - Detailed setup instructions

🔧 MCP Client Integration

Claude Desktop

{
  "mcpServers": {
    "dataproc": {
      "command": "npx",
      "args": [
        "@dataproc/mcp-server"
      ],
      "env": {
        "LOG_LEVEL": "info"
      }
    }
  }
}

Roo (VS Code)

{
  "mcpServers": {
    "dataproc-server": {
      "command": "npx",
      "args": [
        "@dataproc/mcp-server"
      ],
      "disabled": false,
      "alwaysAllow": [
        "list_clusters",
        "get_cluster",
        "list_profiles"
      ]
    }
  }
}

🏗️ Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   MCP Client    │────│  Dataproc MCP    │────│  Google Cloud   │
│  (Claude/Roo)   │    │     Server       │    │    Dataproc     │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                              │
                       ┌──────┴──────┐
                       │   Features  │
                       ├─────────────┤
                       │ • Security  │
                       │ • Profiles  │
                       │ • Validation│
                       │ • Monitoring│
                       │ • Generic    │
                       │   Converter  │
                       └─────────────┘

🔄 Generic Type Conversion System Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│  Source Types   │────│ Generic Converter │────│ Qdrant Payloads │
│ • ClusterData   │    │    System        │    │ • Compressed    │
│ • QueryResults  │    │                  │    │ • Type-Safe     │
│ • JobData       │    │ ┌──────────────┐ │    │ • Optimized     │
└─────────────────┘    │ │Field Analyzer│ │    └─────────────────┘
                       │ │Transformation│ │
                       │ │Engine        │ │
                       │ │Compression   │ │
                       │ │Service       │ │
                       │ └──────────────┘ │
                       └──────────────────┘

🚦 Performance

Response Time Achievements

Schema Validation: ~2ms (target: <5ms) ✅
Parameter Injection: ~1ms (target: <2ms) ✅
Generic Type Conversion: ~0.50ms (target: <2ms) ✅
Credential Validation: ~25ms (target: <50ms) ✅
MCP Tool Call: ~50ms (target: <100ms) ✅

Throughput Achievements

Schema Validation: ~2000 ops/sec ✅
Parameter Injection: ~5000 ops/sec ✅
Generic Type Conversion: ~2000 ops/sec ✅
Credential Validation: ~200 ops/sec ✅
MCP Tool Call: ~100 ops/sec ✅

Compression Achievements

Field-Level Compression: Up to 100% compression ratios ✅
Memory Optimization: 30-60% reduction in memory usage ✅
Type Safety: Zero runtime type errors with automatic validation ✅

🧪 Testing

# Run all tests
npm test

# Run specific test suites
npm run test:unit
npm run test:integration
npm run test:performance

# Run with coverage
npm run test:coverage

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

# Clone the repository
git clone https://github.com/dipseth/dataproc-mcp.git
cd dataproc-mcp

# Install dependencies
npm install

# Build the project
npm run build

# Run tests
npm test

# Start development server
npm run dev

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

GitHub Issues: Report bugs and request features
Documentation: Complete documentation
NPM Package: Package information

🏆 Acknowledgments

Model Context Protocol - The protocol that makes this possible
Google Cloud Dataproc - The service we’re integrating with
Qdrant - High-performance vector database powering our semantic search and knowledge indexing
TypeScript - For type safety and developer experience

Made with ❤️ for the MCP and Google Cloud communities

Dev Tools Supporting MCP

The following are the main code editors that support the Model Context Protocol. Click the link to visit the official website for more information.

Zed: High-performance collaborative code editor, supports MCP protocol, providing a smooth programming experience. zed.dev

Cursor: AI code editor built on VS Code, supports MCP protocol for context-aware programming. cursor.com

Windsurf: AI code editor from Codeium, integrates MCP protocol to provide intelligent code assistance. windsurf.com

Continue: Open-source AI programming assistant plugin, supports VS Code and JetBrains, compatible with MCP protocol. continue.dev

Trae: AI-driven code editor, supports MCP protocol, focusing on enhancing developer programming experience. trae.ai

View More MCP Dev Tools

Tools

No tools

Comments

Recommend MCP Servers

Tavily MCP Server The Tavily MCP server provides: search, extract, map, crawl tools Real-time web search capabilities through the tavily-search tool Intelligent data extraction from web pages via the tavily-extract tool Powerful web mapping tool that creates a structured map of website Web crawler that systematically explores websites.

MCP Server Chart This is a TypeScript-based MCP server that provides chart generation capabilities. It allows you to create various types of charts through MCP tools. You can also use it in Dify.

GitHub MCP Server MCP Server for the GitHub API, enabling file operations, repository management, search functionality, and more.

Brave Search MCP Server Web and local search using Brave's Search API

Firecrawl MCP Server Advanced web scraping with JavaScript rendering, PDF support, and smart rate limiting

Context7 MCP LLMs rely on outdated or generic information about the libraries you use. You get:

Slack MCP server Channel management and messaging capabilities

Sequential Thinking MCP Server Dynamic and reflective problem-solving through thought sequences

Fetch MCP Server A Model Context Protocol server that provides web content fetching capabilities.

Playwright MCP A Model Context Protocol (MCP) server that provides browser automation capabilities using [Playwright](https://playwright.dev). This server enables LLMs to interact with web pages through structured accessibility snapshots, bypassing the need for screenshots or visually-tuned models.

View All MCP Servers