Crawl4claude

2 MIT

FreeCommunity

AI Systems

A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.

What is Crawl4claude

Crawl4Claude is a comprehensive, domain-agnostic documentation scraping and AI integration toolkit that allows users to scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP for AI-powered documentation assistance.

Use cases

Use cases include scraping documentation for software libraries, creating searchable databases for technical manuals, integrating AI assistance in documentation workflows, and enabling quick access to information across multiple documentation sites.

How to use

To use Crawl4Claude, clone the repository, install the dependencies, configure your target documentation site in the config.py file, run the scraper using the provided command, and query the scraped documentation with the query interface.

Key features

Key features include a universal documentation scraper, structured SQLite database with full-text search, native MCP server integration, LLM-optimized output, a command-line query interface, comprehensive debugging tools, automatic MCP setup generation, progress tracking, and resumable crawls.

Where to use

Crawl4Claude can be used in various fields such as software development, technical writing, research, and any domain that requires efficient documentation scraping and AI integration.

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Overview

What is Crawl4claude

Use cases

How to use

Key features

Where to use

Crawl4Claude can be used in various fields such as software development, technical writing, research, and any domain that requires efficient documentation scraping and AI integration.

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Content

Documentation Scraper & MCP Server

🚀 Features

Core Functionality

🌐 Universal Documentation Scraper: Works with any documentation website
📊 Structured Database: SQLite database with full-text search capabilities
🤖 MCP Server Integration: Native Claude Desktop integration via Model Context Protocol
📝 LLM-Optimized Output: Ready-to-use context files for AI applications
⚙️ Configuration-Driven: Single config file controls all settings

Advanced Tools

🔍 Query Interface: Command-line tool for searching and analyzing scraped content
🛠️ Debug Suite: Comprehensive debugging tools for testing and validation
📋 Auto-Configuration: Automatic MCP setup file generation
📈 Progress Tracking: Detailed logging and error handling
💾 Resumable Crawls: Smart caching for interrupted crawls

📋 Prerequisites

Python 3.8 or higher
Internet connection
~500MB free disk space per documentation site

🛠️ Quick Start

1. Installation

# Clone the repository
git clone <repository-url>
cd documentation-scraper

# Install dependencies
pip install -r requirements.txt

2. Configure Your Target

Edit config.py to set your documentation site:

SCRAPER_CONFIG = {
    "base_url": "https://docs.example.com/",  # Your documentation site
    "output_dir": "docs_db",
    "max_pages": 200,
    # ... other settings
}

3. Run the Scraper

python docs_scraper.py

4. Query Your Documentation

# Search for content
python query_docs.py --search "tutorial"

# Browse by section
python query_docs.py --section "getting-started"

# Get statistics
python query_docs.py --stats

5. Set Up Claude Integration

# Generate MCP configuration files
python utils/gen_mcp.py

# Follow the instructions to add to Claude Desktop

🏗️ Project Structure

📁 documentation-scraper/
├── 📄 config.py                    # Central configuration file
├── 🕷️ docs_scraper.py              # Main scraper script
├── 🔍 query_docs.py                # Query and analysis tool
├── 🤖 mcp_docs_server.py           # MCP server for Claude integration
├── 📋 requirements.txt             # Python dependencies
├── 📁 utils/                       # Debug and utility tools
│   ├── 🛠️ gen_mcp.py               # Generate MCP config files
│   ├── 🧪 debug_scraper.py         # Test scraper functionality
│   ├── 🔧 debug_mcp_server.py      # Debug MCP server
│   ├── 🎯 debug_mcp_client.py      # Test MCP tools directly
│   ├── 📡 debug_mcp_server_protocol.py # Test MCP via JSON-RPC
│   └── 🌐 debug_site_content.py    # Debug content extraction
├── 📁 docs_db/                     # Generated documentation database
│   ├── 📊 documentation.db         # SQLite database
│   ├── 📄 documentation.json       # JSON export
│   ├── 📋 scrape_summary.json      # Statistics
│   └── 📁 llm_context/             # LLM-ready context files
└── 📁 mcp/                         # Generated MCP configuration
    ├── 🔧 run_mcp_server.bat       # Windows launcher script
    └── ⚙️ claude_mcp_config.json   # Claude Desktop config

⚙️ Configuration

Main Configuration (config.py)

The entire system is controlled by a single configuration file:

# Basic scraping settings
SCRAPER_CONFIG = {
    "base_url": "https://docs.example.com/",
    "output_dir": "docs_db",
    "max_depth": 3,
    "max_pages": 200,
    "delay_between_requests": 0.5,
}

# URL filtering rules
URL_FILTER_CONFIG = {
    "skip_patterns": [r'/api/', r'\.pdf$'],
    "allowed_domains": ["docs.example.com"],
}

# MCP server settings
MCP_CONFIG = {
    "server_name": "docs-server",
    "default_search_limit": 10,
    "max_search_limit": 50,
}

Environment Overrides

You can override any setting with environment variables:

export DOCS_DB_PATH="/custom/path/documentation.db"
export DOCS_BASE_URL="https://different-docs.com/"
python mcp_docs_server.py

🤖 Claude Desktop Integration

Automatic Setup

Generate configuration files:
```
python utils/gen_mcp.py
```
Copy the generated config to Claude Desktop:
- Windows: %APPDATA%\Claude\claude_desktop_config.json
- macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Restart Claude Desktop

Manual Setup

If you prefer manual setup, add this to your Claude Desktop config:

{
  "mcpServers": {
    "docs": {
      "command": "python",
      "args": [
        "path/to/mcp_docs_server.py"
      ],
      "cwd": "path/to/project",
      "env": {
        "DOCS_DB_PATH": "path/to/docs_db/documentation.db"
      }
    }
  }
}

Available MCP Tools

Once connected, Claude can use these tools:

🔍 search_documentation: Search for content across all documentation
📚 get_documentation_sections: List all available sections
📄 get_page_content: Get full content of specific pages
🗂️ browse_section: Browse pages within a section
📊 get_documentation_stats: Get database statistics

🔧 Command Line Tools

Documentation Scraper

# Basic scraping
python docs_scraper.py

# Override config settings
python docs_scraper.py  # Settings from config.py

Query Tool

# Search for content
python query_docs.py --search "authentication guide"

# Browse specific sections  
python query_docs.py --section "api-reference"

# Get database statistics
python query_docs.py --stats

# List all sections
python query_docs.py --list-sections

# Export section to file
python query_docs.py --export-section "tutorials" --format markdown > tutorials.md

# Use custom database
python query_docs.py --db "custom/path/docs.db" --search "example"

Debug Tools

# Test scraper functionality
python utils/debug_scraper.py

# Test MCP server
python utils/debug_mcp_server.py

# Test MCP tools directly
python utils/debug_mcp_client.py

# Test MCP protocol
python utils/debug_mcp_server_protocol.py

# Debug content extraction
python utils/debug_site_content.py

# Generate MCP config files
python utils/gen_mcp.py

📊 Database Schema

Pages Table

CREATE TABLE pages (
    id INTEGER PRIMARY KEY,
    url TEXT UNIQUE NOT NULL,
    title TEXT,
    content TEXT,
    markdown TEXT,
    word_count INTEGER,
    section TEXT,
    subsection TEXT,
    scraped_at TIMESTAMP,
    metadata TEXT
);

Full-Text Search

-- Search using FTS5
SELECT * FROM pages_fts WHERE pages_fts MATCH 'your search term';

-- Or use the query tool
python query_docs.py --search "your search term"

🎯 Example Use Cases

1. Documentation Analysis

# Get overview of documentation
python query_docs.py --stats

# Find all tutorial content
python query_docs.py --search "tutorial guide example"

# Export specific sections
python query_docs.py --export-section "getting-started" > onboarding.md

2. AI Integration with Claude

# Once MCP is set up, ask Claude:
# "Search the documentation for authentication examples"
# "What sections are available in the documentation?"
# "Show me the content for the API reference page"

3. Custom Applications

import sqlite3

# Connect to your scraped documentation
conn = sqlite3.connect('docs_db/documentation.db')

# Query for specific content
results = conn.execute("""
    SELECT title, url, markdown 
    FROM pages 
    WHERE section = 'tutorials' 
    AND word_count > 500
    ORDER BY word_count DESC
""").fetchall()

# Build your own tools on top of the structured data

🔍 Debugging and Testing

Test Scraper Before Full Run

python utils/debug_scraper.py

Validate Content Extraction

python utils/debug_site_content.py

Test MCP Integration

# Test server functionality
python utils/debug_mcp_server.py

# Test tools directly
python utils/debug_mcp_client.py

# Test JSON-RPC protocol
python utils/debug_mcp_server_protocol.py

📈 Performance and Optimization

Scraping Performance

Start small: Use max_pages=50 for testing
Adjust depth: max_depth=2 covers most content efficiently
Rate limiting: Increase delay_between_requests if getting blocked
Caching: Enabled by default for resumable crawls

Database Performance

Full-text search: Automatic FTS5 index for fast searching
Indexing: Optimized indexes on URL and section columns
Word counts: Pre-calculated for quick statistics

MCP Performance

Configurable limits: Set appropriate search and section limits
Snippet length: Adjust snippet size for optimal response times
Connection pooling: Efficient database connections

🌐 Supported Documentation Sites

This scraper works with most documentation websites including:

Static sites: Hugo, Jekyll, MkDocs, Docusaurus
Documentation platforms: GitBook, Notion, Confluence
API docs: Swagger/OpenAPI documentation
Wiki-style: MediaWiki, TiddlyWiki
Custom sites: Any site with consistent HTML structure

Site-Specific Configuration

Customize URL filtering and content extraction for your target site:

URL_FILTER_CONFIG = {
    "skip_patterns": [
        r'/api/',           # Skip API endpoint docs
        r'/edit/',          # Skip edit pages  
        r'\.pdf$',          # Skip PDF files
    ],
    "allowed_domains": ["docs.yoursite.com"],
}

CONTENT_FILTER_CONFIG = {
    "remove_patterns": [
        r'Edit this page.*?\n',      # Remove edit links
        r'Was this helpful\?.*?\n',  # Remove feedback sections
    ],
}

🤝 Contributing

We welcome contributions! Here are some areas where you can help:

New export formats: PDF, EPUB, Word documents
Enhanced content filtering: Better noise removal
Additional debug tools: More comprehensive testing
Documentation: Improve guides and examples
Performance optimizations: Faster scraping and querying

⚠️ Responsible Usage

Respect robots.txt: Check the target site’s robots.txt file
Rate limiting: Use appropriate delays between requests
Terms of service: Respect the documentation site’s terms
Fair use: Use for educational, research, or personal purposes
Attribution: Credit the original documentation source

📄 License

This project is provided as-is for educational and research purposes. Please respect the terms of service and licensing of the documentation sites you scrape.

🎉 Getting Started Examples

Example 1: Scrape Python Documentation

# config.py
SCRAPER_CONFIG = {
    "base_url": "https://docs.python.org/3/",
    "max_pages": 500,
    "max_depth": 3,
}

Example 2: Scrape API Documentation

# config.py  
SCRAPER_CONFIG = {
    "base_url": "https://api-docs.example.com/",
    "max_pages": 200,
}

URL_FILTER_CONFIG = {
    "skip_patterns": [r'/changelog/', r'/releases/'],
}

Example 3: Corporate Documentation

# config.py
SCRAPER_CONFIG = {
    "base_url": "https://internal-docs.company.com/",
    "output_dir": "company_docs",
}

MCP_CONFIG = {
    "server_name": "company-docs-server",
    "docs_display_name": "Company Internal Docs",
}

Happy Documenting! 📚✨

For questions, issues, or feature requests, please check the debug logs first, then create an issue with relevant details.

🙏 Attribution

This project is powered by Crawl4AI - an amazing open-source LLM-friendly web crawler and scraper.

Crawl4AI enables the intelligent web scraping capabilities that make this documentation toolkit possible. A huge thanks to @unclecode and the Crawl4AI community for building such an incredible tool! 🚀

Check out Crawl4AI:

Repository: https://github.com/unclecode/crawl4ai
Documentation: https://crawl4ai.com
Discord Community: https://discord.gg/jP8KfhDhyN

📄 License

Dev Tools Supporting MCP

The following are the main code editors that support the Model Context Protocol. Click the link to visit the official website for more information.

Zed: High-performance collaborative code editor, supports MCP protocol, providing a smooth programming experience. zed.dev

Cursor: AI code editor built on VS Code, supports MCP protocol for context-aware programming. cursor.com

Windsurf: AI code editor from Codeium, integrates MCP protocol to provide intelligent code assistance. windsurf.com

Continue: Open-source AI programming assistant plugin, supports VS Code and JetBrains, compatible with MCP protocol. continue.dev

Trae: AI-driven code editor, supports MCP protocol, focusing on enhancing developer programming experience. trae.ai

View More MCP Dev Tools

Tools

No tools

Comments

Recommend MCP Servers

Tavily MCP Server The Tavily MCP server provides: search, extract, map, crawl tools Real-time web search capabilities through the tavily-search tool Intelligent data extraction from web pages via the tavily-extract tool Powerful web mapping tool that creates a structured map of website Web crawler that systematically explores websites.

MCP Server Chart This is a TypeScript-based MCP server that provides chart generation capabilities. It allows you to create various types of charts through MCP tools. You can also use it in Dify.

GitHub MCP Server MCP Server for the GitHub API, enabling file operations, repository management, search functionality, and more.

Brave Search MCP Server Web and local search using Brave's Search API

Firecrawl MCP Server Advanced web scraping with JavaScript rendering, PDF support, and smart rate limiting

Context7 MCP LLMs rely on outdated or generic information about the libraries you use. You get:

Slack MCP server Channel management and messaging capabilities

Sequential Thinking MCP Server Dynamic and reflective problem-solving through thought sequences

Fetch MCP Server A Model Context Protocol server that provides web content fetching capabilities.

Playwright MCP A Model Context Protocol (MCP) server that provides browser automation capabilities using [Playwright](https://playwright.dev). This server enables LLMs to interact with web pages through structured accessibility snapshots, bypassing the need for screenshots or visually-tuned models.

View All MCP Servers