MCP ExplorerExplorer

Mcp4html2md

@guowei1003on 10 months ago
1 MIT
FreeCommunity
AI Systems
# MCP Tool The MCP tool is used to convert HTML to MD (Markdown).

Overview

What is Mcp4html2md

mcp4html2md is a powerful MCP tool designed for converting HTML web pages into well-formatted Markdown documents, utilizing advanced web scraping and content processing techniques.

Use cases

Use cases include converting blog posts from HTML to Markdown, scraping content from online articles, managing documentation, and processing web pages for offline use.

How to use

To use mcp4html2md, install it via PyPI using ‘pip install htmlcmd’, then run commands like ‘htmlcmd https://example.com’ to convert a webpage to Markdown. You can specify output files and use plugins for additional functionality.

Key features

Key features include smart web scraping with Playwright, intelligent content parsing, Markdown conversion, an extensible plugin system, automatic image processing, configurable templates, and a user-friendly command line interface.

Where to use

mcp4html2md can be used in various fields such as web content management, digital publishing, academic research, and anywhere that requires converting HTML content into Markdown format.

Content

HTML Convert Markdown (HTML Convert Markdown MCP Tool)

A powerful web content scraping and processing tool that converts web pages to well-formatted Markdown documents.

Features

  • Smart Web Scraping: Uses Playwright for reliable content extraction, even from JavaScript-heavy websites
  • Intelligent Content Parsing: Automatically identifies and extracts main content from web pages
  • Markdown Conversion: Converts HTML content to clean, well-formatted Markdown
  • Plugin System: Extensible architecture supporting custom content processing plugins
  • Image Processing: Automatically downloads and manages images with local references
  • Configurable: Supports custom templates and configuration options
  • Command Line Interface: Easy to use CLI for quick content processing

Installation

Option 1: Install from PyPI (Recommended)

# Install from PyPI
pip install htmlcmd

# Install Playwright browsers (required)
playwright install

Option 2: Install from Source

# Clone the repository
git clone https://github.com/yourusername/mcp4html2md.git
cd mcp4html2md

# Install the package
pip install -e .

# Install Playwright browsers (required)
playwright install

Quick Start

Basic usage:

# Convert a webpage to Markdown
htmlcmd https://example.com

# Specify output file
htmlcmd https://example.com -o output.md

# Use image processing plugin
htmlcmd https://example.com --plugins image_downloader

# List available plugins
mcp --list-plugins

Configuration

MCP uses YAML configuration files. The default configuration is included in the package at src/convert/default_config.yaml. On first run, this configuration will be automatically copied to ~/.convert/config.yaml.

Default Configuration

fetcher:
  headless: true
  timeout: 30
  user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'

parser:
  rules_path: ~/.convert/rules
  default_format: markdown
  default_rules:
    title: 'h1'
    content: 'article'
    author: '.author'
    date: '.date'
    tags: '.tags'

converter:
  template_path: ~/.convert/templates
  default_template: default.md
  image_path: images
  link_style: relative

output:
  path: ~/Documents/mcp-output
  filename_template: '{title}-{date}'
  create_date_dirs: true
  file_exists_action: increment  # increment, overwrite, or skip

plugins:
  enabled: []
  image_downloader:
    download_path: images
    skip_data_urls: true
    timeout: 30
    max_retries: 3

logging:
  console_level: INFO
  file_level: DEBUG
  log_dir: ~/.convert/logs
  max_file_size: 10MB
  backup_count: 5

Customizing Configuration

You can customize the configuration in two ways:

  1. Global Configuration:

    • Edit ~/.convert/config.yaml
    • Changes will apply to all future conversions
    # Open config in your default editor
    nano ~/.convert/config.yaml
    
  2. Project-specific Configuration:

    • Create a convert_config.yaml in your project directory
    • This will override the global configuration for this project
    # Copy default config to current directory
    cp ~/.convert/config.yaml ./convert_config.yaml
    

Configuration Options

  • fetcher: Controls web page fetching

    • headless: Run browser in headless mode
    • timeout: Page load timeout in seconds
    • user_agent: Browser user agent string
  • parser: Content parsing settings

    • rules_path: Directory for custom parsing rules
    • default_format: Output format (markdown/html)
    • default_rules: CSS selectors for content extraction
  • converter: Markdown conversion settings

    • template_path: Directory for custom templates
    • default_template: Default template file
    • image_path: Local path for downloaded images
    • link_style: URL style in output (relative/absolute)
  • output: Output file settings

    • path: Default output directory
    • filename_template: Template for output filenames
    • create_date_dirs: Create date-based directories
    • file_exists_action: Action when file exists
  • plugins: Plugin settings

    • enabled: List of enabled plugins
    • Plugin-specific configurations
  • logging: Logging settings

    • console_level: Console output level
    • file_level: File logging level
    • log_dir: Log file directory
    • max_file_size: Maximum log file size
    • backup_count: Number of backup log files

Plugin System

MCP supports a plugin system for custom content processing. Available plugins:

  • Image Downloader: Downloads images to local storage and updates references
    mcp https://example.com --plugins image_downloader
    

Creating Custom Plugins

  1. Create a new Python file in the plugins directory
  2. Inherit from the Plugin base class
  3. Implement the process_content method

Example:

from mcp.plugin import Plugin

class CustomPlugin(Plugin):
    def __init__(self, name: str, description: str):
        super().__init__(name, description)
    
    def process_content(self, content: dict) -> dict:
        # Process content here
        return content

Logging

MCP includes a comprehensive logging system:

  • Console output: INFO level and above
  • File logging: DEBUG level and above
  • Log files location: ~/.convert/logs/

Development

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run specific test file
pytest tests/test_logger.py

Output:

(base)  ✘ /workflow-script/mcp4html2md   main ±  pytest -v
========================================================================== test session starts ===========================================================================
platform darwin -- Python 3.11.11, pytest-8.3.5, pluggy-1.5.0 -- /Users/cgw/miniconda3/envs/media_env/bin/python3.11
cachedir: .pytest_cache
rootdir: /Users/cgw/workflow-script/mcp4html2md
configfile: pytest.ini
plugins: anyio-4.9.0, asyncio-0.26.0
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=function, asyncio_default_test_loop_scope=function
collected 51 items                                                                                                                                                       

tests/test_cli.py::test_cli_initialization PASSED                                                                                                                  [  1%]
tests/test_cli.py::test_create_parser PASSED                                                                                                                       [  3%]
tests/test_cli.py::test_process_url PASSED                                                                                                                         [  5%]
tests/test_cli.py::test_convert_to_markdown PASSED                                                                                                                 [  7%]
tests/test_cli.py::test_get_output_path PASSED                                                                                                                     [  9%]
tests/test_cli.py::test_run PASSED                                                                                                                                 [ 11%]
tests/test_cli.py::test_run_with_output_file PASSED                                                                                                                [ 13%]
tests/test_cli.py::test_run_with_plugins PASSED                                                                                                                    [ 15%]
tests/test_cli.py::test_run_list_plugins PASSED                                                                                                                    [ 17%]
tests/test_cli.py::test_save_markdown PASSED                                                                                                                       [ 19%]
tests/test_cli.py::test_list_available_plugins PASSED                                                                                                              [ 21%]
tests/test_cli.py::test_run_with_stdout PASSED                                                                                                                     [ 23%]
tests/test_config.py::test_config_initialization PASSED                                                                                                            [ 25%]
tests/test_config.py::test_config_get_value PASSED                                                                                                                 [ 27%]
tests/test_config.py::test_config_set_value PASSED                                                                                                                 [ 29%]
tests/test_config.py::test_config_save_and_load PASSED                                                                                                             [ 31%]
tests/test_config.py::test_default_config_creation PASSED                                                                                                          [ 33%]
tests/test_content_parser.py::test_content_parser_initialization PASSED                                                                                            [ 35%]
tests/test_content_parser.py::test_parse_github_content PASSED                                                                                                     [ 37%]
tests/test_content_parser.py::test_parse_zhihu_content PASSED                                                                                                      [ 39%]
tests/test_content_parser.py::test_xpath_to_css_conversion PASSED                                                                                                  [ 41%]
tests/test_image_downloader.py::test_image_downloader_initialization PASSED                                                                                        [ 43%]
tests/test_image_downloader.py::test_extract_image_urls PASSED                                                                                                     [ 45%]
tests/test_image_downloader.py::test_extract_markdown_image_urls PASSED                                                                                            [ 47%]
tests/test_image_downloader.py::test_normalize_urls PASSED                                                                                                         [ 49%]
tests/test_image_downloader.py::test_get_extension PASSED                                                                                                          [ 50%]
tests/test_image_downloader.py::test_replace_image_urls PASSED                                                                                                     [ 52%]
tests/test_image_downloader.py::test_replace_markdown_image_urls PASSED                                                                                            [ 54%]
tests/test_image_downloader.py::test_download_images PASSED                                                                                                        [ 56%]
tests/test_image_downloader.py::test_process_content PASSED                                                                                                        [ 58%]
tests/test_image_processor.py::test_image_processor_initialization PASSED                                                                                          [ 60%]
tests/test_image_processor.py::test_process_html_images PASSED                                                                                                     [ 62%]
tests/test_image_processor.py::test_process_markdown_images PASSED                                                                                                 [ 64%]
tests/test_image_processor.py::test_process_mixed_content PASSED                                                                                                   [ 66%]
tests/test_image_processor.py::test_handle_empty_content PASSED                                                                                                    [ 68%]
tests/test_image_processor.py::test_handle_invalid_content PASSED                                                                                                  [ 70%]
tests/test_logger.py::test_logger_initialization PASSED                                                                                                            [ 72%]
tests/test_logger.py::test_logger_with_custom_file PASSED                                                                                                          [ 74%]
tests/test_logger.py::test_logger_reuse PASSED                                                                                                                     [ 76%]
tests/test_logger.py::test_logger_formatting PASSED                                                                                                                [ 78%]
tests/test_markdown_converter.py::test_markdown_converter_initialization PASSED                                                                                    [ 80%]
tests/test_markdown_converter.py::test_convert_basic_data PASSED                                                                                                   [ 82%]
tests/test_markdown_converter.py::test_convert_with_metadata PASSED                                                                                                [ 84%]
tests/test_markdown_converter.py::test_format_content_blocks PASSED                                                                                                [ 86%]
tests/test_markdown_converter.py::test_extract_domain PASSED                                                                                                       [ 88%]
tests/test_plugin.py::test_plugin_manager_initialization PASSED                                                                                                    [ 90%]
tests/test_plugin.py::test_plugin_loading PASSED                                                                                                                   [ 92%]
tests/test_plugin.py::test_plugin_list PASSED                                                                                                                      [ 94%]
tests/test_plugin.py::test_plugin_processing PASSED                                                                                                                [ 96%]
tests/test_plugin.py::test_plugin_chain_processing PASSED                                                                                                          [ 98%]
tests/test_plugin.py::test_invalid_plugin PASSED                                                                                                                   [100%]

============================================================================ warnings summary ============================================================================
tests/test_plugins/test_plugin.py:3
  /Users/cgw/workflow-script/mcp4html2md/tests/test_plugins/test_plugin.py:3: PytestCollectionWarning: cannot collect test class 'TestPlugin' because it has a __init__ constructor (from: tests/test_plugins/test_plugin.py)
    class TestPlugin(Plugin):

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================================================== 51 passed, 1 warning in 0.57s ======================================================================

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Tools

No tools

Comments

Recommend MCP Servers

View All MCP Servers