Mcp4html2md

1 MIT

FreeCommunity

AI Systems

# MCP Tool The MCP tool is used to convert HTML to MD (Markdown).

What is Mcp4html2md

mcp4html2md is a powerful MCP tool designed for converting HTML web pages into well-formatted Markdown documents, utilizing advanced web scraping and content processing techniques.

Use cases

Use cases include converting blog posts from HTML to Markdown, scraping content from online articles, managing documentation, and processing web pages for offline use.

How to use

To use mcp4html2md, install it via PyPI using ‘pip install htmlcmd’, then run commands like ‘htmlcmd https://example.com’ to convert a webpage to Markdown. You can specify output files and use plugins for additional functionality.

Key features

Key features include smart web scraping with Playwright, intelligent content parsing, Markdown conversion, an extensible plugin system, automatic image processing, configurable templates, and a user-friendly command line interface.

Where to use

mcp4html2md can be used in various fields such as web content management, digital publishing, academic research, and anywhere that requires converting HTML content into Markdown format.

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Overview

What is Mcp4html2md

mcp4html2md is a powerful MCP tool designed for converting HTML web pages into well-formatted Markdown documents, utilizing advanced web scraping and content processing techniques.

Use cases

Use cases include converting blog posts from HTML to Markdown, scraping content from online articles, managing documentation, and processing web pages for offline use.

How to use

Key features

Where to use

mcp4html2md can be used in various fields such as web content management, digital publishing, academic research, and anywhere that requires converting HTML content into Markdown format.

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Content

HTML Convert Markdown (HTML Convert Markdown MCP Tool)

A powerful web content scraping and processing tool that converts web pages to well-formatted Markdown documents.

Features

Smart Web Scraping: Uses Playwright for reliable content extraction, even from JavaScript-heavy websites
Intelligent Content Parsing: Automatically identifies and extracts main content from web pages
Markdown Conversion: Converts HTML content to clean, well-formatted Markdown
Plugin System: Extensible architecture supporting custom content processing plugins
Image Processing: Automatically downloads and manages images with local references
Configurable: Supports custom templates and configuration options
Command Line Interface: Easy to use CLI for quick content processing

Installation

Option 1: Install from PyPI (Recommended)

# Install from PyPI
pip install htmlcmd

# Install Playwright browsers (required)
playwright install

Option 2: Install from Source

# Clone the repository
git clone https://github.com/yourusername/mcp4html2md.git
cd mcp4html2md

# Install the package
pip install -e .

# Install Playwright browsers (required)
playwright install

Quick Start

Basic usage:

# Convert a webpage to Markdown
htmlcmd https://example.com

# Specify output file
htmlcmd https://example.com -o output.md

# Use image processing plugin
htmlcmd https://example.com --plugins image_downloader

# List available plugins
mcp --list-plugins

Configuration

MCP uses YAML configuration files. The default configuration is included in the package at src/convert/default_config.yaml. On first run, this configuration will be automatically copied to ~/.convert/config.yaml.

Default Configuration

fetcher:
  headless: true
  timeout: 30
  user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'

parser:
  rules_path: ~/.convert/rules
  default_format: markdown
  default_rules:
    title: 'h1'
    content: 'article'
    author: '.author'
    date: '.date'
    tags: '.tags'

converter:
  template_path: ~/.convert/templates
  default_template: default.md
  image_path: images
  link_style: relative

output:
  path: ~/Documents/mcp-output
  filename_template: '{title}-{date}'
  create_date_dirs: true
  file_exists_action: increment  # increment, overwrite, or skip

plugins:
  enabled: []
  image_downloader:
    download_path: images
    skip_data_urls: true
    timeout: 30
    max_retries: 3

logging:
  console_level: INFO
  file_level: DEBUG
  log_dir: ~/.convert/logs
  max_file_size: 10MB
  backup_count: 5

Customizing Configuration

You can customize the configuration in two ways:

Global Configuration:
- Edit ~/.convert/config.yaml
- Changes will apply to all future conversions
```
# Open config in your default editor
nano ~/.convert/config.yaml
```
Project-specific Configuration:
- Create a convert_config.yaml in your project directory
- This will override the global configuration for this project
```
# Copy default config to current directory
cp ~/.convert/config.yaml ./convert_config.yaml
```

Configuration Options

fetcher: Controls web page fetching
- headless: Run browser in headless mode
- timeout: Page load timeout in seconds
- user_agent: Browser user agent string
parser: Content parsing settings
- rules_path: Directory for custom parsing rules
- default_format: Output format (markdown/html)
- default_rules: CSS selectors for content extraction
converter: Markdown conversion settings
- template_path: Directory for custom templates
- default_template: Default template file
- image_path: Local path for downloaded images
- link_style: URL style in output (relative/absolute)
output: Output file settings
- path: Default output directory
- filename_template: Template for output filenames
- create_date_dirs: Create date-based directories
- file_exists_action: Action when file exists
plugins: Plugin settings
- enabled: List of enabled plugins
- Plugin-specific configurations
logging: Logging settings
- console_level: Console output level
- file_level: File logging level
- log_dir: Log file directory
- max_file_size: Maximum log file size
- backup_count: Number of backup log files

Plugin System

MCP supports a plugin system for custom content processing. Available plugins:

Image Downloader: Downloads images to local storage and updates references
```
mcp https://example.com --plugins image_downloader
```

Creating Custom Plugins

Create a new Python file in the plugins directory
Inherit from the Plugin base class
Implement the process_content method

Example:

from mcp.plugin import Plugin

class CustomPlugin(Plugin):
    def __init__(self, name: str, description: str):
        super().__init__(name, description)
    
    def process_content(self, content: dict) -> dict:
        # Process content here
        return content

Logging

MCP includes a comprehensive logging system:

Console output: INFO level and above
File logging: DEBUG level and above
Log files location: ~/.convert/logs/

Development

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run specific test file
pytest tests/test_logger.py

Output:

(base)  ✘ /workflow-script/mcp4html2md   main ±  pytest -v
========================================================================== test session starts ===========================================================================
platform darwin -- Python 3.11.11, pytest-8.3.5, pluggy-1.5.0 -- /Users/cgw/miniconda3/envs/media_env/bin/python3.11
cachedir: .pytest_cache
rootdir: /Users/cgw/workflow-script/mcp4html2md
configfile: pytest.ini
plugins: anyio-4.9.0, asyncio-0.26.0
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=function, asyncio_default_test_loop_scope=function
collected 51 items                                                                                                                                                       

tests/test_cli.py::test_cli_initialization PASSED                                                                                                                  [  1%]
tests/test_cli.py::test_create_parser PASSED                                                                                                                       [  3%]
tests/test_cli.py::test_process_url PASSED                                                                                                                         [  5%]
tests/test_cli.py::test_convert_to_markdown PASSED                                                                                                                 [  7%]
tests/test_cli.py::test_get_output_path PASSED                                                                                                                     [  9%]
tests/test_cli.py::test_run PASSED                                                                                                                                 [ 11%]
tests/test_cli.py::test_run_with_output_file PASSED                                                                                                                [ 13%]
tests/test_cli.py::test_run_with_plugins PASSED                                                                                                                    [ 15%]
tests/test_cli.py::test_run_list_plugins PASSED                                                                                                                    [ 17%]
tests/test_cli.py::test_save_markdown PASSED                                                                                                                       [ 19%]
tests/test_cli.py::test_list_available_plugins PASSED                                                                                                              [ 21%]
tests/test_cli.py::test_run_with_stdout PASSED                                                                                                                     [ 23%]
tests/test_config.py::test_config_initialization PASSED                                                                                                            [ 25%]
tests/test_config.py::test_config_get_value PASSED                                                                                                                 [ 27%]
tests/test_config.py::test_config_set_value PASSED                                                                                                                 [ 29%]
tests/test_config.py::test_config_save_and_load PASSED                                                                                                             [ 31%]
tests/test_config.py::test_default_config_creation PASSED                                                                                                          [ 33%]
tests/test_content_parser.py::test_content_parser_initialization PASSED                                                                                            [ 35%]
tests/test_content_parser.py::test_parse_github_content PASSED                                                                                                     [ 37%]
tests/test_content_parser.py::test_parse_zhihu_content PASSED                                                                                                      [ 39%]
tests/test_content_parser.py::test_xpath_to_css_conversion PASSED                                                                                                  [ 41%]
tests/test_image_downloader.py::test_image_downloader_initialization PASSED                                                                                        [ 43%]
tests/test_image_downloader.py::test_extract_image_urls PASSED                                                                                                     [ 45%]
tests/test_image_downloader.py::test_extract_markdown_image_urls PASSED                                                                                            [ 47%]
tests/test_image_downloader.py::test_normalize_urls PASSED                                                                                                         [ 49%]
tests/test_image_downloader.py::test_get_extension PASSED                                                                                                          [ 50%]
tests/test_image_downloader.py::test_replace_image_urls PASSED                                                                                                     [ 52%]
tests/test_image_downloader.py::test_replace_markdown_image_urls PASSED                                                                                            [ 54%]
tests/test_image_downloader.py::test_download_images PASSED                                                                                                        [ 56%]
tests/test_image_downloader.py::test_process_content PASSED                                                                                                        [ 58%]
tests/test_image_processor.py::test_image_processor_initialization PASSED                                                                                          [ 60%]
tests/test_image_processor.py::test_process_html_images PASSED                                                                                                     [ 62%]
tests/test_image_processor.py::test_process_markdown_images PASSED                                                                                                 [ 64%]
tests/test_image_processor.py::test_process_mixed_content PASSED                                                                                                   [ 66%]
tests/test_image_processor.py::test_handle_empty_content PASSED                                                                                                    [ 68%]
tests/test_image_processor.py::test_handle_invalid_content PASSED                                                                                                  [ 70%]
tests/test_logger.py::test_logger_initialization PASSED                                                                                                            [ 72%]
tests/test_logger.py::test_logger_with_custom_file PASSED                                                                                                          [ 74%]
tests/test_logger.py::test_logger_reuse PASSED                                                                                                                     [ 76%]
tests/test_logger.py::test_logger_formatting PASSED                                                                                                                [ 78%]
tests/test_markdown_converter.py::test_markdown_converter_initialization PASSED                                                                                    [ 80%]
tests/test_markdown_converter.py::test_convert_basic_data PASSED                                                                                                   [ 82%]
tests/test_markdown_converter.py::test_convert_with_metadata PASSED                                                                                                [ 84%]
tests/test_markdown_converter.py::test_format_content_blocks PASSED                                                                                                [ 86%]
tests/test_markdown_converter.py::test_extract_domain PASSED                                                                                                       [ 88%]
tests/test_plugin.py::test_plugin_manager_initialization PASSED                                                                                                    [ 90%]
tests/test_plugin.py::test_plugin_loading PASSED                                                                                                                   [ 92%]
tests/test_plugin.py::test_plugin_list PASSED                                                                                                                      [ 94%]
tests/test_plugin.py::test_plugin_processing PASSED                                                                                                                [ 96%]
tests/test_plugin.py::test_plugin_chain_processing PASSED                                                                                                          [ 98%]
tests/test_plugin.py::test_invalid_plugin PASSED                                                                                                                   [100%]

============================================================================ warnings summary ============================================================================
tests/test_plugins/test_plugin.py:3
  /Users/cgw/workflow-script/mcp4html2md/tests/test_plugins/test_plugin.py:3: PytestCollectionWarning: cannot collect test class 'TestPlugin' because it has a __init__ constructor (from: tests/test_plugins/test_plugin.py)
    class TestPlugin(Plugin):

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================================================== 51 passed, 1 warning in 0.57s ======================================================================

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Dev Tools Supporting MCP

The following are the main code editors that support the Model Context Protocol. Click the link to visit the official website for more information.

Zed: High-performance collaborative code editor, supports MCP protocol, providing a smooth programming experience. zed.dev

Cursor: AI code editor built on VS Code, supports MCP protocol for context-aware programming. cursor.com

Windsurf: AI code editor from Codeium, integrates MCP protocol to provide intelligent code assistance. windsurf.com

Continue: Open-source AI programming assistant plugin, supports VS Code and JetBrains, compatible with MCP protocol. continue.dev

Trae: AI-driven code editor, supports MCP protocol, focusing on enhancing developer programming experience. trae.ai

View More MCP Dev Tools

Tools

No tools

Comments

Recommend MCP Servers

Tavily MCP Server The Tavily MCP server provides: search, extract, map, crawl tools Real-time web search capabilities through the tavily-search tool Intelligent data extraction from web pages via the tavily-extract tool Powerful web mapping tool that creates a structured map of website Web crawler that systematically explores websites.

MCP Server Chart This is a TypeScript-based MCP server that provides chart generation capabilities. It allows you to create various types of charts through MCP tools. You can also use it in Dify.

GitHub MCP Server MCP Server for the GitHub API, enabling file operations, repository management, search functionality, and more.

Brave Search MCP Server Web and local search using Brave's Search API

Firecrawl MCP Server Advanced web scraping with JavaScript rendering, PDF support, and smart rate limiting

Context7 MCP LLMs rely on outdated or generic information about the libraries you use. You get:

Slack MCP server Channel management and messaging capabilities

Sequential Thinking MCP Server Dynamic and reflective problem-solving through thought sequences

Fetch MCP Server A Model Context Protocol server that provides web content fetching capabilities.

Playwright MCP A Model Context Protocol (MCP) server that provides browser automation capabilities using [Playwright](https://playwright.dev). This server enables LLMs to interact with web pages through structured accessibility snapshots, bypassing the need for screenshots or visually-tuned models.

View All MCP Servers