Doctor

@sisig-aion 18 days ago

432 MIT

FreeCommunity

AI Systems

Doctor is a tool for discovering, crawl, and indexing web sites to be exposed as an MCP server for LLM agents.

What is Doctor

Doctor is a tool designed for discovering, crawling, and indexing websites, enabling them to be exposed as an MCP server for LLM agents, enhancing their reasoning and code generation capabilities.

Use cases

Use cases for Doctor include building search engines, enhancing LLMs with real-time web data, and creating applications that require dynamic content generation based on the latest web information.

How to use

To use Doctor, clone the repository, set up your environment variables including your OpenAI API key, and run the stack using Docker Compose. This will start the necessary services for crawling and indexing.

Key features

Key features of Doctor include web crawling with crawl4ai, text chunking with LangChain, creating embeddings with OpenAI, storing data in DuckDB with vector search support, and exposing a FastAPI web service for search functionality.

Where to use

Doctor can be utilized in various fields such as web development, data analysis, and artificial intelligence, particularly for applications requiring up-to-date information retrieval and processing.

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Overview

What is Doctor

Doctor is a tool designed for discovering, crawling, and indexing websites, enabling them to be exposed as an MCP server for LLM agents, enhancing their reasoning and code generation capabilities.

Use cases

Use cases for Doctor include building search engines, enhancing LLMs with real-time web data, and creating applications that require dynamic content generation based on the latest web information.

How to use

Key features

Where to use

Doctor can be utilized in various fields such as web development, data analysis, and artificial intelligence, particularly for applications requiring up-to-date information retrieval and processing.

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Content

🩺 Doctor

A tool for discovering, crawl, and indexing web sites to be exposed as an MCP server for LLM agents for better and more up-to-date reasoning and code generation.

🔍 Overview

Doctor provides a complete stack for:

Crawling web pages using crawl4ai with hierarchy tracking
Chunking text with LangChain
Creating embeddings with OpenAI via litellm
Storing data in DuckDB with vector search support
Exposing search functionality via a FastAPI web service
Making these capabilities available to LLMs through an MCP server
Navigating crawled sites with hierarchical site maps

🏗️ Core Infrastructure

🗄️ DuckDB

Database for storing document data and embeddings with vector search capabilities
Managed by unified Database class

📨 Redis

Message broker for asynchronous task processing

🕸️ Crawl Worker

Processes crawl jobs
Chunks text
Creates embeddings

🌐 Web Server

FastAPI service exposing endpoints
Fetching, searching, and viewing data
Exposing the MCP server

💻 Setup

⚙️ Prerequisites

Docker and Docker Compose
Python 3.10+
uv (Python package manager)
OpenAI API key

📦 Installation

Clone this repository
Set up environment variables:
```
export OPENAI_API_KEY=your-openai-key
```
Run the stack:
```
docker compose up
```

👁 Usage

Go to http://localhost:9111/docs to see the OpenAPI docs
Look for the /fetch_url endpoint and start a crawl job by providing a URL
Use /job_progress to see the current job status
Configure your editor to use http://localhost:9111/mcp as an MCP server

☁️ Web API

Core Endpoints

POST /fetch_url: Start crawling a URL
GET /search_docs: Search indexed documents
GET /job_progress: Check crawl job progress
GET /list_doc_pages: List indexed pages
GET /get_doc_page: Get full text of a page

Site Map Feature

The Maps feature provides a hierarchical view of crawled websites, making it easy to navigate and explore the structure of indexed sites.

Endpoints:

GET /map: View an index of all crawled sites
GET /map/site/{root_page_id}: View the hierarchical tree structure of a specific site
GET /map/page/{page_id}: View a specific page with navigation (parent, siblings, children)
GET /map/page/{page_id}/raw: Get the raw markdown content of a page

Features:

Hierarchical Navigation: Pages maintain parent-child relationships, allowing you to navigate through the site structure
Domain Grouping: Pages from the same domain crawled individually are automatically grouped together
Automatic Title Extraction: Page titles are extracted from HTML or markdown content
Breadcrumb Navigation: Easy navigation with breadcrumbs showing the path from root to current page
Sibling Navigation: Quick access to pages at the same level in the hierarchy
Legacy Page Support: Pages crawled before hierarchy tracking are grouped by domain for easy access
No JavaScript Required: All navigation works with pure HTML and CSS for maximum compatibility

Usage Example:

Crawl a website using the /fetch_url endpoint
Visit /map to see all crawled sites
Click on a site to view its hierarchical structure
Navigate through pages using the provided links

🔧 MCP Integration

Ensure that your Docker Compose stack is up, and then add to your Cursor or VSCode MCP Servers configuration:

🧪 Testing

Running Tests

To run all tests:

# Run all tests with coverage report
pytest

To run specific test categories:

# Run only unit tests
pytest -m unit

# Run only async tests
pytest -m async_test

# Run tests for a specific component
pytest tests/lib/test_crawler.py

Test Coverage

The project is configured to generate coverage reports automatically:

# Run tests with detailed coverage report
pytest --cov=src --cov-report=term-missing

Test Structure

tests/conftest.py: Common fixtures for all tests
tests/lib/: Tests for library components
- test_crawler.py: Tests for the crawler module
- test_crawler_enhanced.py: Tests for enhanced crawler with hierarchy tracking
- test_chunker.py: Tests for the chunker module
- test_embedder.py: Tests for the embedder module
- test_database.py: Tests for the unified Database class
- test_database_hierarchy.py: Tests for database hierarchy operations
tests/common/: Tests for common modules
tests/services/: Tests for service layer
- test_map_service.py: Tests for the map service
tests/api/: Tests for API endpoints
- test_map_api.py: Tests for map API endpoints
tests/integration/: Integration tests
- test_processor_enhanced.py: Tests for enhanced processor with hierarchy

🐞 Code Quality

Pre-commit Hooks

The project is configured with pre-commit hooks that run automatically before each commit:

ruff check --fix: Lints code and automatically fixes issues
ruff format: Formats code according to project style
Trailing whitespace removal
End-of-file fixing
YAML validation
Large file checks

Setup Pre-commit

To set up pre-commit hooks:

# Install pre-commit
uv pip install pre-commit

# Install the git hooks
pre-commit install

Running Pre-commit Manually

You can run the pre-commit hooks manually on all files:

# Run all pre-commit hooks
pre-commit run --all-files

Or on staged files only:

# Run on staged files
pre-commit run

⚖️ License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Dev Tools Supporting MCP

The following are the main code editors that support the Model Context Protocol. Click the link to visit the official website for more information.

Zed: High-performance collaborative code editor, supports MCP protocol, providing a smooth programming experience. zed.dev

Cursor: AI code editor built on VS Code, supports MCP protocol for context-aware programming. cursor.com

Windsurf: AI code editor from Codeium, integrates MCP protocol to provide intelligent code assistance. windsurf.com

Continue: Open-source AI programming assistant plugin, supports VS Code and JetBrains, compatible with MCP protocol. continue.dev

Trae: AI-driven code editor, supports MCP protocol, focusing on enhancing developer programming experience. trae.ai

View More MCP Dev Tools

Tools

No tools

Doctor

What is Doctor

Use cases

How to use

Key features

Where to use

Clients Supporting MCP

Overview

What is Doctor

Use cases

How to use

Key features

Where to use

Clients Supporting MCP

Content

🩺 Doctor

🗄️ DuckDB

📨 Redis

🕸️ Crawl Worker

🌐 Web Server

⚙️ Prerequisites

📦 Installation

Core Endpoints

Site Map Feature

Running Tests

Test Coverage

Test Structure

Pre-commit Hooks

Setup Pre-commit

Running Pre-commit Manually

Dev Tools Supporting MCP

Tools

Comments