Codebasemcp

4 MIT

FreeCommunity

AI Systems

CodebaseMCP analyzes Python code using AST and stores data in Weaviate.

What is Codebasemcp

CodebaseMCP is a Python code analysis system that utilizes Abstract Syntax Trees (AST) to extract and store information about code elements in a Weaviate vector database. It provides querying and understanding tools through a Model Context Protocol (MCP) server, leveraging Google’s Gemini models for generating embeddings and natural language descriptions.

Use cases

Use cases include analyzing large codebases for documentation, answering specific queries about code functionality, visualizing code relationships, and enhancing code comprehension for new developers or stakeholders.

How to use

To use CodebaseMCP, set up the environment by configuring the .env file to enable or disable optional features. Run the code scanning process to analyze Python files, which will populate the Weaviate database with extracted code elements. Utilize the MCP server to query the codebase and retrieve insights or answers to natural language questions.

Key features

Key features include code scanning for identifying code elements and relationships, vector storage in Weaviate, optional LLM enrichment for generating semantic descriptions, automatic refinement of descriptions, Retrieval-Augmented Generation (RAG) for Q&A, user clarifications for manual notes, and visualization through MermaidJS call graphs.

Where to use

CodebaseMCP is suitable for software development teams, code review processes, educational purposes, and any scenario requiring in-depth analysis and understanding of Python codebases.

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Overview

What is Codebasemcp

Use cases

How to use

Key features

Where to use

CodebaseMCP is suitable for software development teams, code review processes, educational purposes, and any scenario requiring in-depth analysis and understanding of Python codebases.

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Content

Python Codebase Analysis RAG System

This system analyzes Python code using Abstract Syntax Trees (AST), stores the extracted information (functions, classes, calls, variables, etc.) in a Weaviate vector database, and provides tools for querying and understanding the codebase via a Model Context Protocol (MCP) server. It leverages Google’s Gemini models for generating embeddings and natural language descriptions/answers.

Features

Code Scanning: Parses Python files to identify code elements (functions, classes, imports, calls, assignments) and their relationships. Extracts:
- Basic info: Name, type, file path, line numbers, code snippet, docstring.
- Function/Method details: Parameters, return type, signature, decorators.
- Scope info: Parent scope (class/function) UUID, readable ID (e.g., file:type:name:line), base class names.
- Usage info: Attribute accesses within scopes, call relationships (partially tracked).
Vector Storage: Uses Weaviate to store code elements and their vector embeddings (when LLM generation is enabled).
LLM Enrichment (Optional & Background): Generates semantic descriptions and embeddings for functions and classes using Gemini. This now runs as background tasks triggered after scanning or manually. Can be enabled/disabled via the .env file.
Automatic Refinement (Optional & Background): When LLM generation is enabled, automatically refines descriptions for new/updated functions using context (callers, callees, siblings, related variables) as part of the background processing.
RAG Q&A: Answers natural language questions about the codebase using Retrieval-Augmented Generation (requires LLM features enabled and background processing completed).
User Clarifications: Allows users to add manual notes to specific code elements.
Visualization: Generates MermaidJS call graphs based on stored relationships.
MCP Server: Exposes analysis and querying capabilities through MCP tools, managing codebases and an active codebase context.
File Watcher (Integrated): Automatically starts when a codebase is scanned (scan_codebase) and stops when another codebase is selected (select_codebase) or the codebase is deleted (delete_codebase). Triggers re-analysis and database updates for the active codebase when its files change. Can also be manually controlled via start_watcher and stop_watcher tools.
Codebase Dependencies: Allows defining dependencies between scanned codebases (add_codebase_dependency, remove_codebase_dependency).
Cross-Codebase Querying: Enables searching (find_element) and asking questions (ask_question) across the active codebase and its declared dependencies.

Setup

Environment: Ensure Python 3.10+ and Docker are installed.
Weaviate: Start the Weaviate instance using Docker Compose:
```
docker-compose up -d
```
Dependencies: Install Python packages:
```
pip install -r requirements.txt
```

API Key & Configuration: Create a .env file in the project root and add your Gemini API key. You can also configure other settings:

# --- Required ---
GEMINI_API_KEY=YOUR_API_KEY_HERE

# --- Optional ---
# Set to true to enable background LLM description generation and refinement
GENERATE_LLM_DESCRIPTIONS=true
# Max concurrent background LLM tasks (embeddings/descriptions/refinements)
LLM_CONCURRENCY=5
# ANALYZE_ON_STARTUP is no longer used. Scanning is done via the scan_codebase tool.

# Specify Weaviate connection details if not using defaults
# WEAVIATE_HOST=localhost
# WEAVIATE_PORT=8080
# WEAVIATE_GRPC_PORT=50051

# Specify alternative Gemini models if desired
# GENERATION_MODEL_NAME="models/gemini-pro"
# EMBEDDING_MODEL_NAME="models/embedding-001"

# Adjust Weaviate batch size
# WEAVIATE_BATCH_SIZE=100

# SEMANTIC_SEARCH_LIMIT=5
# SEMANTIC_SEARCH_DISTANCE=0.7
# Watcher polling interval (seconds)
# WATCHER_POLLING_INTERVAL=5

Run MCP Server: Start the server in a separate terminal:
```
python src/code_analysis_mcp/mcp_server.py
```
(Ensure this terminal stays running for the tools to be available)

Architecture Overview

This system analyzes Python code, stores the extracted information in a Weaviate vector database, and provides tools for querying and understanding the codebase via a Model Context Protocol (MCP) server. It leverages Google’s Gemini models for generating embeddings and natural language descriptions/answers.

The main modules are:

code_scanner.py: Finds Python files, parses them using AST, extracts structural elements (functions, classes, imports, calls, etc.), and prepares data for Weaviate.
weaviate_client.py: Manages the connection to Weaviate, defines the data schema (CodeFile, CodeElement, CodebaseRegistry), and provides functions for batch uploading, querying, updating, and deleting data.
rag.py: Implements Retrieval-Augmented Generation (RAG) for answering questions about the codebase. It uses semantic search to find relevant code elements and an LLM to synthesize an answer.
mcp_server.py: Sets up the FastMCP server, manages codebases in a CodebaseRegistry collection, handles the active codebase context (ACTIVE_CODEBASE_NAME), integrates file watching logic (including automatic start/stop), manages codebase dependencies, and exposes analysis functionalities as MCP tools with detailed argument descriptions.
visualization.py: Generates MermaidJS call graphs based on stored relationships.

The system uses Weaviate’s multi-tenancy feature for CodeFile and CodeElement collections, where the tenant ID is the user-defined codebase_name. A separate, non-multi-tenant CodebaseRegistry collection tracks codebase metadata (name, directory, status, summary, watcher status, dependencies). The ACTIVE_CODEBASE_NAME global variable in the server determines the primary codebase tenant for queries. Query tools (find_element, ask_question) can optionally search across the active codebase and its declared dependencies stored in the registry. The list_codebases tool can be used to view the status and dependencies of all codebases.

Background LLM processing is used to generate semantic descriptions and embeddings for code elements. This is an optional feature that can be enabled/disabled via the .env file.

Detailed information on the available tools and their arguments can be retrieved directly from the MCP server using standard MCP introspection methods once the server is running.

Dev Tools Supporting MCP

The following are the main code editors that support the Model Context Protocol. Click the link to visit the official website for more information.

Zed: High-performance collaborative code editor, supports MCP protocol, providing a smooth programming experience. zed.dev

Cursor: AI code editor built on VS Code, supports MCP protocol for context-aware programming. cursor.com

Windsurf: AI code editor from Codeium, integrates MCP protocol to provide intelligent code assistance. windsurf.com

Continue: Open-source AI programming assistant plugin, supports VS Code and JetBrains, compatible with MCP protocol. continue.dev

Trae: AI-driven code editor, supports MCP protocol, focusing on enhancing developer programming experience. trae.ai

View More MCP Dev Tools

Tools

No tools

Comments

Recommend MCP Servers

Tavily MCP Server The Tavily MCP server provides: search, extract, map, crawl tools Real-time web search capabilities through the tavily-search tool Intelligent data extraction from web pages via the tavily-extract tool Powerful web mapping tool that creates a structured map of website Web crawler that systematically explores websites.

MCP Server Chart This is a TypeScript-based MCP server that provides chart generation capabilities. It allows you to create various types of charts through MCP tools. You can also use it in Dify.

GitHub MCP Server MCP Server for the GitHub API, enabling file operations, repository management, search functionality, and more.

Brave Search MCP Server Web and local search using Brave's Search API

Firecrawl MCP Server Advanced web scraping with JavaScript rendering, PDF support, and smart rate limiting

Context7 MCP LLMs rely on outdated or generic information about the libraries you use. You get:

Slack MCP server Channel management and messaging capabilities

Sequential Thinking MCP Server Dynamic and reflective problem-solving through thought sequences

Fetch MCP Server A Model Context Protocol server that provides web content fetching capabilities.

Playwright MCP A Model Context Protocol (MCP) server that provides browser automation capabilities using [Playwright](https://playwright.dev). This server enables LLMs to interact with web pages through structured accessibility snapshots, bypassing the need for screenshots or visually-tuned models.

View All MCP Servers