Docs Scraper Mcp

@feloreson a year ago

1 NOASSERTION

FreeCommunity

AI Systems

Extract clean, focused documentation for AI and human use from any library.

What is Docs Scraper Mcp

docs_scraper_mcp is a toolkit designed to extract clean and focused documentation from various framework or library websites, making it suitable for both human readers and large language models (LLMs).

Use cases

Use cases include extracting documentation for libraries and frameworks, preparing content for LLM training, and maintaining up-to-date documentation without irrelevant information.

How to use

To use docs_scraper_mcp, you can implement different crawling strategies such as single page, multi-page, sitemap-based, or menu-based crawling to extract documentation in Markdown or JSON formats.

Key features

Key features include clean documentation output in Markdown and JSON formats, smart content extraction that removes unnecessary elements, flexible crawling strategies for various documentation needs, and readiness for LLM and RAG systems.

Where to use

docs_scraper_mcp can be used in software development environments, documentation sites, wikis, and knowledge bases where clean and structured documentation is required.

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Overview

What is Docs Scraper Mcp

Use cases

Use cases include extracting documentation for libraries and frameworks, preparing content for LLM training, and maintaining up-to-date documentation without irrelevant information.

How to use

To use docs_scraper_mcp, you can implement different crawling strategies such as single page, multi-page, sitemap-based, or menu-based crawling to extract documentation in Markdown or JSON formats.

Key features

Where to use

docs_scraper_mcp can be used in software development environments, documentation sites, wikis, and knowledge bases where clean and structured documentation is required.

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Content

Crawl4AI Documentation Scraper

Keep your dependency documentation lean, current, and AI-ready. This toolkit helps you extract clean, focused documentation from any framework or library website, perfect for both human readers and LLM consumption.

Why This Tool?

In today’s fast-paced development environment, you need:

📚 Quick access to dependency documentation without the bloat
🤖 Documentation in a format that’s ready for RAG systems and LLMs
🎯 Focused content without navigation elements, ads, or irrelevant sections
⚡ Fast, efficient way to keep documentation up-to-date
🧹 Clean Markdown output for easy integration with documentation tools

Traditional web scraping often gives you everything - including navigation menus, footers, ads, and other noise. This toolkit is specifically designed to extract only what matters: the actual documentation content.

Key Benefits

Clean Documentation Output
- Markdown format for content-focused documentation
- JSON format for structured menu data
- Perfect for documentation sites, wikis, and knowledge bases
- Ideal format for LLM training and RAG systems
Smart Content Extraction
- Automatically identifies main content areas
- Strips away navigation, ads, and irrelevant sections
- Preserves code blocks and technical formatting
- Maintains proper Markdown structure
Flexible Crawling Strategies
- Single page for quick reference docs
- Multi-page for comprehensive library documentation
- Sitemap-based for complete framework coverage
- Menu-based for structured documentation hierarchies
LLM and RAG Ready
- Clean Markdown text suitable for embeddings
- Preserved code blocks for technical accuracy
- Structured menu data in JSON format
- Consistent formatting for reliable processing

A comprehensive Python toolkit for scraping documentation websites using different crawling strategies. Built using the Crawl4AI library for efficient web crawling.

Features

Core Features

🚀 Multiple crawling strategies
📑 Automatic nested menu expansion
🔄 Handles dynamic content and lazy-loaded elements
🎯 Configurable selectors
📝 Clean Markdown output for documentation
📊 JSON output for menu structure
🎨 Colorful terminal feedback
🔍 Smart URL processing
⚡ Asynchronous execution

Available Crawlers

Single URL Crawler (single_url_crawler.py)
- Extracts content from a single documentation page
- Outputs clean Markdown format
- Perfect for targeted content extraction
- Configurable content selectors
Multi URL Crawler (multi_url_crawler.py)
- Processes multiple URLs in parallel
- Generates individual Markdown files per page
- Efficient batch processing
- Shared browser session for better performance
Sitemap Crawler (sitemap_crawler.py)
- Automatically discovers and crawls sitemap.xml
- Creates Markdown files for each page
- Supports recursive sitemap parsing
- Handles gzipped sitemaps
Menu Crawler (menu_crawler.py)
- Extracts all menu links from documentation
- Outputs structured JSON format
- Handles nested and dynamic menus
- Smart menu expansion

Requirements

Python 3.7+
Virtual Environment (recommended)

Installation

Clone the repository:

git clone https://github.com/felores/crawl4ai_docs_scraper.git
cd crawl4ai_docs_scraper

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Usage

1. Single URL Crawler

python single_url_crawler.py https://docs.example.com/page

Arguments:

URL: Target documentation URL (required, first argument)

Note: Use quotes only if your URL contains special characters or spaces.

Output format (Markdown):

# Page Title

## Section 1
Content with preserved formatting, including:
- Lists
- Links
- Tables

### Code Examples
```python
def example():
    return "Code blocks are preserved"

2. Multi URL Crawler

# Using a text file with URLs
python multi_url_crawler.py urls.txt

# Using JSON output from menu crawler
python multi_url_crawler.py menu_links.json

# Using custom output prefix
python multi_url_crawler.py menu_links.json --output-prefix custom_name

Arguments:

URLs file: Path to file containing URLs (required, first argument)
- Can be .txt with one URL per line
- Or .json from menu crawler output
--output-prefix: Custom prefix for output markdown file (optional)

Note: Use quotes only if your file path contains spaces.

Output filename format:

Without --output-prefix: domain_path_docs_content_timestamp.md (e.g., cloudflare_agents_docs_content_20240323_223656.md)
With --output-prefix: custom_prefix_docs_content_timestamp.md (e.g., custom_name_docs_content_20240323_223656.md)

The crawler accepts two types of input files:

Text file with one URL per line:

https://docs.example.com/page1
https://docs.example.com/page2
https://docs.example.com/page3

JSON file (compatible with menu crawler output):

{
  "menu_links": [
    "https://docs.example.com/page1",
    "https://docs.example.com/page2"
  ]
}

3. Sitemap Crawler

python sitemap_crawler.py https://docs.example.com/sitemap.xml

Options:

--max-depth: Maximum sitemap recursion depth (optional)
--patterns: URL patterns to include (optional)

4. Menu Crawler

python menu_crawler.py https://docs.example.com

Options:

--selectors: Custom menu selectors (optional)

The menu crawler now saves its output to the input_files directory, making it ready for immediate use with the multi-url crawler. The output JSON has this format:

{
  "start_url": "https://docs.example.com/",
  "total_links_found": 42,
  "menu_links": [
    "https://docs.example.com/page1",
    "https://docs.example.com/page2"
  ]
}

After running the menu crawler, you’ll get a command to run the multi-url crawler with the generated file.

Directory Structure

crawl4ai_docs_scraper/
├── input_files/           # Input files for URL processing
│   ├── urls.txt          # Text file with URLs
│   └── menu_links.json   # JSON output from menu crawler
├── scraped_docs/         # Output directory for markdown files
│   └── docs_timestamp.md # Generated documentation
├── multi_url_crawler.py
├── menu_crawler.py
└── requirements.txt

Error Handling

All crawlers include comprehensive error handling with colored terminal output:

🟢 Green: Success messages
🔵 Cyan: Processing status
🟡 Yellow: Warnings
🔴 Red: Error messages

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Attribution

This project uses Crawl4AI for web data extraction.

Acknowledgments

Built with Crawl4AI
Uses termcolor for colorful terminal output

Dev Tools Supporting MCP

The following are the main code editors that support the Model Context Protocol. Click the link to visit the official website for more information.

Zed: High-performance collaborative code editor, supports MCP protocol, providing a smooth programming experience. zed.dev

Cursor: AI code editor built on VS Code, supports MCP protocol for context-aware programming. cursor.com

Windsurf: AI code editor from Codeium, integrates MCP protocol to provide intelligent code assistance. windsurf.com

Continue: Open-source AI programming assistant plugin, supports VS Code and JetBrains, compatible with MCP protocol. continue.dev

Trae: AI-driven code editor, supports MCP protocol, focusing on enhancing developer programming experience. trae.ai

View More MCP Dev Tools

Tools

No tools

Comments

Recommend MCP Servers

Tavily MCP Server The Tavily MCP server provides: search, extract, map, crawl tools Real-time web search capabilities through the tavily-search tool Intelligent data extraction from web pages via the tavily-extract tool Powerful web mapping tool that creates a structured map of website Web crawler that systematically explores websites.

MCP Server Chart This is a TypeScript-based MCP server that provides chart generation capabilities. It allows you to create various types of charts through MCP tools. You can also use it in Dify.

GitHub MCP Server MCP Server for the GitHub API, enabling file operations, repository management, search functionality, and more.

Brave Search MCP Server Web and local search using Brave's Search API

Firecrawl MCP Server Advanced web scraping with JavaScript rendering, PDF support, and smart rate limiting

Context7 MCP LLMs rely on outdated or generic information about the libraries you use. You get:

Slack MCP server Channel management and messaging capabilities

Sequential Thinking MCP Server Dynamic and reflective problem-solving through thought sequences

Fetch MCP Server A Model Context Protocol server that provides web content fetching capabilities.

Playwright MCP A Model Context Protocol (MCP) server that provides browser automation capabilities using [Playwright](https://playwright.dev). This server enables LLMs to interact with web pages through structured accessibility snapshots, bypassing the need for screenshots or visually-tuned models.

View All MCP Servers