Web Scraping Mcp

6 MIT

FreeCommunity

AI Systems

MCP Server leveraging crawl4ai for web scraping and LLM-based content extraction (Markdown, text snippets, smart extraction). Designed for AI agent integration.

What is Web Scraping Mcp

WEB-SCRAPING-MCP is an MCP Server that utilizes the crawl4ai library for web scraping and LLM-based content extraction. It is designed for AI agent integration, allowing interaction with web pages to retrieve content and perform intelligent data extraction.

Use cases

Use cases include extracting product information from e-commerce sites, gathering news articles for sentiment analysis, retrieving academic papers, and automating data collection for market research.

How to use

To use WEB-SCRAPING-MCP, deploy the server and interact with its exposed MCP tools. You can scrape web pages using the scrape_url function, extract specific text snippets with extract_text_by_query, or utilize the smart_extract feature for structured information extraction based on natural language instructions.

Key features

Key features include web interaction tools such as scrape_url for full webpage content in Markdown, extract_text_by_query for specific text snippets, and smart_extract for LLM-based structured extraction. The server is configurable via environment variables and supports Docker for easy deployment.

Where to use

WEB-SCRAPING-MCP can be used in various fields including data analysis, content aggregation, research, and any application requiring automated web data extraction and processing.

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Overview

What is Web Scraping Mcp

Use cases

Use cases include extracting product information from e-commerce sites, gathering news articles for sentiment analysis, retrieving academic papers, and automating data collection for market research.

How to use

Key features

Where to use

WEB-SCRAPING-MCP can be used in various fields including data analysis, content aggregation, research, and any application requiring automated web data extraction and processing.

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Content

Crawl4AI Web Scraper MCP Server

This project provides an MCP (Model Context Protocol) server that uses the crawl4ai library to perform web scraping and intelligent content extraction tasks. It allows AI agents (like Claude, or agents built with LangChain/LangGraph) to interact with web pages, retrieve content, search for specific text, and perform LLM-based extraction based on natural language instructions.

This server uses:

FastMCP: For creating the MCP server endpoint.
crawl4ai: For the core web crawling and extraction logic.
dotenv: For managing API keys via a .env file.
(Optional) Docker: For containerized deployment, bundling Python and dependencies.

Features

Exposes MCP tools for web interaction:
- scrape_url: Get the full content of a webpage in Markdown format.
- extract_text_by_query: Find specific text snippets on a page based on a query.
- smart_extract: Use an LLM (currently Google Gemini) to extract structured information based on instructions.
Configurable via environment variables (API keys).
Includes Docker configuration (Dockerfile) for easy, self-contained deployment.
Communicates over Server-Sent Events (SSE) on port 8002 by default.

Exposed MCP Tools

scrape_url

Scrape a webpage and return its content in Markdown format.

Arguments:

url (str, required): The URL of the webpage to scrape.

Returns:

(str): The webpage content in Markdown format, or an error message.

extract_text_by_query

Extract relevant text snippets from a webpage that contain a specific search query. Returns up to the first 5 matches found.

Arguments:

url (str, required): The URL of the webpage to search within.
query (str, required): The text query to search for (case-insensitive).
context_size (int, optional): The number of characters to include before and after the matched query text in each snippet. Defaults to 300.

Returns:

(str): A formatted string containing the found text snippets or a message indicating no matches were found, or an error message.

smart_extract

Intelligently extract specific information from a webpage using the configured LLM (currently requires Google Gemini API key) based on a natural language instruction.

Arguments:

url (str, required): The URL of the webpage to analyze and extract from.
instruction (str, required): Natural language instruction specifying what information to extract (e.g., “List all the speakers mentioned on this page”, “Extract the main contact email address”, “Summarize the key findings”).

Returns:

(str): The extracted information (often formatted as JSON or structured text based on the instruction) or a message indicating no relevant information was found, or an error message (including if the required API key is missing).

Setup and Running

You can run this server either locally or using the provided Docker configuration.

Option 1: Running with Docker (Recommended for Deployment)

This method bundles Python and all necessary libraries. You only need Docker installed on the host machine.

Install Docker: Download and install Docker Desktop for your OS. Start Docker Desktop.

Clone Repository:

git clone https://github.com/your-username/your-repo-name.git # Replace with your repo URL
cd your-repo-name

Create .env File: Create a file named .env in the project root directory and add your API keys:

# Required for the smart_extract tool
GOOGLE_API_KEY=your_google_ai_api_key_here

# Optional, checked by server but not currently used by tools
# OPENAI_API_KEY=your_openai_key_here
# MISTRAL_API_KEY=your_mistral_key_here

Build the Docker Image:
```
docker build -t crawl4ai-mcp-server .
```
Run the Container: This starts the server, making port 8002 available on your host machine. It uses --env-file to securely pass the API keys from your local .env file into the container’s environment.
```
docker run -it --rm -p 8002:8002 --env-file .env crawl4ai-mcp-server
```
- -it: Runs interactively.
- --rm: Removes container on exit.
- -p 8002:8002: Maps host port 8002 to container port 8002.
- --env-file .env: Loads environment variables from your local .env file into the container. Crucial for API keys.
- crawl4ai-mcp-server: The name of the image you built.
Server is Running: Logs will appear, indicating the server is listening on SSE (http://0.0.0.0:8002).
Connecting Client: Configure your MCP client (e.g., LangChain agent) to connect to http://127.0.0.1:8002/sse with transport: "sse".

Option 2: Running Locally

This requires Python and manual installation of dependencies on your host machine.

Install Python: Ensure Python >= 3.9 (check crawl4ai requirements if needed, 3.10+ recommended).

Clone Repository:

git clone https://github.com/your-username/your-repo-name.git # Replace with your repo URL
cd your-repo-name

Create Virtual Environment (Recommended):
```
python -m venv venv
source venv/bin/activate # Linux/macOS
# venv\Scripts\activate # Windows
```
(Or use Conda: conda create --name crawl4ai-env python=3.11 -y && conda activate crawl4ai-env)
Install Dependencies:
```
pip install -r requirements.txt
```
Create .env File: Create a file named .env in the project root directory and add your API keys (same content as in Docker setup step 3).

Run the Server:

python your_server_script_name.py # e.g., python webcrawl_mcp_server.py

Server is Running: It will listen on http://127.0.0.1:8002/sse.
Connecting Client: Configure your MCP client to connect to http://127.0.0.1:8002/sse.

Environment Variables

The server uses the following environment variables, typically loaded from an .env file:

GOOGLE_API_KEY: Required for the smart_extract tool to function (uses Google Gemini). Get one from Google AI Studio.
OPENAI_API_KEY: Checked for existence but not currently used by any tool in this version.
MISTRAL_API_KEY: Checked for existence but not currently used by any tool in this version.

Example Agent Interaction

# Example using the agent CLI from the previous setup

You: scrape_url https://example.com
Agent: Thinking...
[Agent calls scrape_url tool]
Agent: [Markdown content of example.com]
------------------------------
You: extract text from https://en.wikipedia.org/wiki/Web_scraping using the query "ethical considerations"
Agent: Thinking...
[Agent calls extract_text_by_query tool]
Agent: Found X matches for 'ethical considerations' on the page. Here are the relevant sections:
Match 1:
... text snippet ...
---
Match 2:
... text snippet ...
------------------------------
You: Use smart_extract on https://blog.google/technology/ai/google-gemini-ai/ to get the main points about Gemini models
Agent: Thinking...
[Agent calls smart_extract tool with Google API Key]
Agent: Successfully extracted information based on your instruction:
{
  "main_points": [
    "Gemini is Google's most capable AI model family (Ultra, Pro, Nano).",
    "Designed to be multimodal, understanding text, code, audio, image, video.",
    "Outperforms previous models on various benchmarks.",
    "Being integrated into Google products like Bard and Pixel."
  ]
}

Files

your_server_script_name.py: The main Python script for the MCP server (e.g., webcrawl_mcp_server.py).
Dockerfile: Instructions for building the Docker container image.
requirements.txt: Python dependencies.
.env.example: (Recommended) An example environment file showing needed keys. Do not commit your actual .env file.
.gitignore: Specifies intentionally untracked files for Git (should include .env).
README.md: This file.

Contributing

(Add contribution guidelines if desired)

License

(Specify your license, e.g., MIT License)

Dev Tools Supporting MCP

The following are the main code editors that support the Model Context Protocol. Click the link to visit the official website for more information.

Zed: High-performance collaborative code editor, supports MCP protocol, providing a smooth programming experience. zed.dev

Cursor: AI code editor built on VS Code, supports MCP protocol for context-aware programming. cursor.com

Windsurf: AI code editor from Codeium, integrates MCP protocol to provide intelligent code assistance. windsurf.com

Continue: Open-source AI programming assistant plugin, supports VS Code and JetBrains, compatible with MCP protocol. continue.dev

Trae: AI-driven code editor, supports MCP protocol, focusing on enhancing developer programming experience. trae.ai

View More MCP Dev Tools

Tools

No tools

Comments

Recommend MCP Servers

Tavily MCP Server The Tavily MCP server provides: search, extract, map, crawl tools Real-time web search capabilities through the tavily-search tool Intelligent data extraction from web pages via the tavily-extract tool Powerful web mapping tool that creates a structured map of website Web crawler that systematically explores websites.

MCP Server Chart This is a TypeScript-based MCP server that provides chart generation capabilities. It allows you to create various types of charts through MCP tools. You can also use it in Dify.

GitHub MCP Server MCP Server for the GitHub API, enabling file operations, repository management, search functionality, and more.

Brave Search MCP Server Web and local search using Brave's Search API

Firecrawl MCP Server Advanced web scraping with JavaScript rendering, PDF support, and smart rate limiting

Context7 MCP LLMs rely on outdated or generic information about the libraries you use. You get:

Slack MCP server Channel management and messaging capabilities

Sequential Thinking MCP Server Dynamic and reflective problem-solving through thought sequences

Fetch MCP Server A Model Context Protocol server that provides web content fetching capabilities.

Playwright MCP A Model Context Protocol (MCP) server that provides browser automation capabilities using [Playwright](https://playwright.dev). This server enables LLMs to interact with web pages through structured accessibility snapshots, bypassing the need for screenshots or visually-tuned models.

View All MCP Servers