MCP ExplorerExplorer

Dataset Viewer

@privetinon 12 days ago
18 MIT
FreeCommunity
Analytics
#Hugging Face#datasets#data analysis
Browse and analyze Hugging Face datasets with features like search, filtering, statistics, and data export

Overview

What is Dataset Viewer

The MCP Server is an interface for interacting with the Hugging Face Dataset Viewer API, allowing users to browse and analyze datasets hosted on the Hugging Face Hub.

Use cases

Users can validate the existence of datasets, retrieve detailed information, access paginated dataset contents, perform searches, filter data using SQL-like conditions, and download datasets in various formats, including Parquet.

How to use

To use the server, clone the repository, set up a virtual environment using ‘uv’, activate it, and install the package in development mode. Configuration involves setting environment variables for authentication and integrating the MCP server into Claude Desktop.

Key features

Key features include support for private datasets, pagination, dataset exploration with configurations and splits, searching and filtering capabilities, as well as obtaining statistics and downloading datasets in different formats.

Where to use

This server can be used for any application or research scenario that requires dataset access and analysis, especially in machine learning and data science, where Hugging Face datasets are commonly utilized.

Content

Dataset Viewer MCP Server

An MCP server for interacting with the Hugging Face Dataset Viewer API, providing capabilities to browse and analyze datasets hosted on the Hugging Face Hub.

Features

Resources

  • Uses dataset:// URI scheme for accessing Hugging Face datasets
  • Supports dataset configurations and splits
  • Provides paginated access to dataset contents
  • Handles authentication for private datasets
  • Supports searching and filtering dataset contents
  • Provides dataset statistics and analysis

Tools

The server provides the following tools:

  1. validate

    • Check if a dataset exists and is accessible
    • Parameters:
      • dataset: Dataset identifier (e.g. ‘stanfordnlp/imdb’)
      • auth_token (optional): For private datasets
  2. get_info

    • Get detailed information about a dataset
    • Parameters:
      • dataset: Dataset identifier
      • auth_token (optional): For private datasets
  3. get_rows

    • Get paginated contents of a dataset
    • Parameters:
      • dataset: Dataset identifier
      • config: Configuration name
      • split: Split name
      • page (optional): Page number (0-based)
      • auth_token (optional): For private datasets
  4. get_first_rows

    • Get first rows from a dataset split
    • Parameters:
      • dataset: Dataset identifier
      • config: Configuration name
      • split: Split name
      • auth_token (optional): For private datasets
  5. get_statistics

    • Get statistics about a dataset split
    • Parameters:
      • dataset: Dataset identifier
      • config: Configuration name
      • split: Split name
      • auth_token (optional): For private datasets
  6. search_dataset

    • Search for text within a dataset
    • Parameters:
      • dataset: Dataset identifier
      • config: Configuration name
      • split: Split name
      • query: Text to search for
      • auth_token (optional): For private datasets
  7. filter

    • Filter rows using SQL-like conditions
    • Parameters:
      • dataset: Dataset identifier
      • config: Configuration name
      • split: Split name
      • where: SQL WHERE clause (e.g. “score > 0.5”)
      • orderby (optional): SQL ORDER BY clause
      • page (optional): Page number (0-based)
      • auth_token (optional): For private datasets
  8. get_parquet

    • Download entire dataset in Parquet format
    • Parameters:
      • dataset: Dataset identifier
      • auth_token (optional): For private datasets

Installation

Prerequisites

  • Python 3.12 or higher
  • uv - Fast Python package installer and resolver

Setup

  1. Clone the repository:
git clone https://github.com/privetin/dataset-viewer.git
cd dataset-viewer
  1. Create a virtual environment and install:
# Create virtual environment
uv venv

# Activate virtual environment
# On Unix:
source .venv/bin/activate
# On Windows:
.venv\Scripts\activate

# Install in development mode
uv add -e .

Configuration

Environment Variables

  • HUGGINGFACE_TOKEN: Your Hugging Face API token for accessing private datasets

Claude Desktop Integration

Add the following to your Claude Desktop config file:

On Windows: %APPDATA%\Claude\claude_desktop_config.json

On MacOS: ~/Library/Application Support/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "dataset-viewer": {
      "command": "uv",
      "args": [
        "--directory",
        "parent_to_repo/dataset-viewer",
        "run",
        "dataset-viewer"
      ]
    }
  }
}

License

MIT License - see LICENSE for details

Tools

get_info
Get detailed information about a Hugging Face dataset including description, features, splits, and statistics. Run validate first to check if the dataset exists and is accessible.
get_rows
Get paginated rows from a Hugging Face dataset
get_first_rows
Get first rows from a Hugging Face dataset split
search_dataset
Search for text within a Hugging Face dataset
filter
Filter rows in a Hugging Face dataset using SQL-like conditions
get_statistics
Get statistics about a Hugging Face dataset
get_parquet
Export Hugging Face dataset split as Parquet file
validate
Check if a Hugging Face dataset exists and is accessible

Comments