Scientific Papers Mcp

1 MIT

FreeCommunity

AI Systems

MCP Server for real-time access to scientific papers from arXiv and OpenAlex.

What is Scientific Papers Mcp

Scientific-Papers-MCP is a Model Context Protocol (MCP) server that provides large language models (LLMs) with real-time access to scientific papers from arXiv and OpenAlex.

Use cases

Use cases include retrieving recent AI research papers, analyzing citation trends, and extracting relevant scientific information for various applications.

How to use

To use Scientific-Papers-MCP, install the server via npm, configure your MCP client with the provided settings, and utilize the CLI interface to fetch papers, list categories, and perform citation analysis.

Key features

Key features include paper fetching by category, full text extraction from HTML sources, citation analysis, paper metadata lookup, category listing, rate limiting for API usage, dual interface support, and full type safety with TypeScript.

Where to use

Scientific-Papers-MCP can be used in academic research, data analysis, and any field requiring access to the latest scientific literature.

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Overview

What is Scientific Papers Mcp

Scientific-Papers-MCP is a Model Context Protocol (MCP) server that provides large language models (LLMs) with real-time access to scientific papers from arXiv and OpenAlex.

Use cases

Use cases include retrieving recent AI research papers, analyzing citation trends, and extracting relevant scientific information for various applications.

How to use

Key features

Where to use

Scientific-Papers-MCP can be used in academic research, data analysis, and any field requiring access to the latest scientific literature.

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Content

Scientific Paper Harvester MCP Server

A Model Context Protocol (MCP) server that provides LLMs with real-time access to scientific papers from arXiv and OpenAlex.

Features

Paper Fetching: Get latest papers from arXiv and OpenAlex by category/concept
Text Extraction: Full text content extraction from HTML sources (arXiv and OpenAlex)
Citation Analysis: Find top cited papers from OpenAlex since a specific date
Paper Lookup: Retrieve full metadata for specific papers by ID
Category Listing: Browse available categories from arXiv and OpenAlex
Rate Limiting: Respectful API usage with per-source rate limiting (5 req/min arXiv, 10 req/min OpenAlex)
Dual Interface: Both MCP protocol and CLI access
TypeScript: Full type safety with ESM modules

Installation

npm install
npm run build

MCP Client Configuration

To use this server with an MCP client (like Claude Desktop), add the following to your MCP client configuration:

For published package (available on npm):

{
  "mcpServers": {
    "scientific-papers": {
      "command": "npx",
      "args": [
        "-y",
        "@futurelab-studio/latest-science-mcp"
      ]
    }
  }
}

For local development:

{
  "mcpServers": {
    "scientific-papers": {
      "command": "node",
      "args": [
        "dist/server.js"
      ],
      "cwd": "/path/to/your/MCP-tutorial"
    }
  }
}

Note: Replace /path/to/your/MCP-tutorial with the actual path to your project directory.

Usage

CLI Interface

List Categories

# List arXiv categories
node dist/cli.js list-categories --source=arxiv

# List OpenAlex concepts
node dist/cli.js list-categories --source=openalex

Fetch Latest Papers

# Get latest 10 AI papers from arXiv
node dist/cli.js fetch-latest --source=arxiv --category=cs.AI --count=10

# Get latest 5 computer science papers from OpenAlex
node dist/cli.js fetch-latest --source=openalex --category=C41008148 --count=5

# Search by concept name (OpenAlex)
node dist/cli.js fetch-latest --source=openalex --category="machine learning" --count=3

Fetch Top Cited Papers

# Get top 20 cited papers in machine learning since 2024
node dist/cli.js fetch-top-cited --concept="machine learning" --since=2024-01-01 --count=20

# Get top cited papers by concept ID
node dist/cli.js fetch-top-cited --concept=C41008148 --since=2023-06-01 --count=10

Fetch Specific Paper

# Get arXiv paper by ID
node dist/cli.js fetch-content --source=arxiv --id=2401.12345

# Get OpenAlex paper by Work ID
node dist/cli.js fetch-content --source=openalex --id=W2741809807

# Show text content with preview
node dist/cli.js fetch-content --source=arxiv --id=2401.12345 --show-text --text-preview=500

# Show full text content
node dist/cli.js fetch-latest --source=arxiv --category=cs.AI --count=2 --show-text

CLI Text Display Options

# Show text extraction status (default)
node dist/cli.js fetch-latest --source=arxiv --category=cs.AI --count=3

# Display full text content
node dist/cli.js fetch-latest --source=arxiv --category=cs.AI --count=2 --show-text

# Display text preview (first 500 characters)
node dist/cli.js fetch-content --source=arxiv --id=2401.12345 --show-text --text-preview=500

MCP Server

Start the MCP server:

node dist/server.js

The server accepts MCP protocol calls via stdio transport.

Available Tools

list_categories

Lists available categories/concepts from a data source.

Parameters:

source: "arxiv" or "openalex"

Returns:

Array of category objects with id, name, and optional description

Example:

{
  "name": "list_categories",
  "arguments": {
    "source": "arxiv"
  }
}

fetch_latest

Fetches the latest papers from arXiv or OpenAlex for a given category with metadata only (no text extraction).

Parameters:

source: "arxiv" or "openalex"
category: Category ID (e.g., “cs.AI” for arXiv, “C41008148” for OpenAlex) or concept name
count: Number of papers to fetch (default: 50, max: 200)

Returns:

Array of paper objects with metadata (id, title, authors, date, pdf_url)
Text field: Empty string (text: "") - use fetch_content for full text

Workflow:
This tool is designed for browsing and discovery. Use it to find interesting papers, then call fetch_content for specific papers you want to read in full.

Examples:

{
  "name": "fetch_latest",
  "arguments": {
    "source": "arxiv",
    "category": "cs.AI",
    "count": 10
  }
}

{
  "name": "fetch_latest",
  "arguments": {
    "source": "openalex",
    "category": "artificial intelligence",
    "count": 5
  }
}

fetch_top_cited

Fetches the top cited papers from OpenAlex for a given concept since a specific date with metadata only (no text extraction).

Parameters:

concept: Concept name or OpenAlex concept ID (e.g., “machine learning”, “C41008148”)
since: Start date in YYYY-MM-DD format
count: Number of papers to fetch (default: 50, max: 200)

Returns:

Array of paper objects sorted by citation count (descending) with metadata only
Text field: Empty string (text: "") - use fetch_content for full text

Workflow:
Use this to discover influential papers by citation count, then call fetch_content for papers you want to read.

Example:

{
  "name": "fetch_top_cited",
  "arguments": {
    "concept": "machine learning",
    "since": "2024-01-01",
    "count": 20
  }
}

fetch_content

Fetches full metadata and text content for a specific paper by ID from arXiv or OpenAlex with complete text extraction.

Parameters:

source: "arxiv" or "openalex"
id: Paper ID (arXiv ID like “2401.12345” or OpenAlex Work ID like “W2741809807”)
- Flexible ID Format: Accepts both strings and numbers
- Auto-normalization: Numeric IDs are automatically converted to proper format (e.g., 2741809807 → "W2741809807" for OpenAlex)

Returns:

Single paper object with full metadata and extracted text content

Text Extraction:

arXiv: Extracts text from HTML version at arxiv.org/html/{id} with fallback to ar5iv.labs.arxiv.org
OpenAlex: Extracts text from HTML sources when source_type="html" is available
Graceful degradation: Returns metadata even if text extraction fails

Examples:

{
  "name": "fetch_content",
  "arguments": {
    "source": "arxiv",
    "id": "2401.12345"
  }
}

{
  "name": "fetch_content",
  "arguments": {
    "source": "openalex",
    "id": "W2741809807"
  }
}

{
  "name": "fetch_content",
  "arguments": {
    "source": "openalex",
    "id": 2741809807
  }
}

Paper Metadata Format

All tools return paper objects with the following structure:

{
  id: string;                    // Paper ID
  title: string;                 // Paper title
  authors: string[];             // List of author names
  date: string;                  // Publication date (ISO format)
  pdf_url?: string;              // PDF URL (if available)
  text: string;                  // Extracted full text content
  textTruncated?: boolean;       // Warning: text was truncated due to size limits
  textExtractionFailed?: boolean; // Warning: text extraction failed
}

Text Extraction Details

Text Content: The text field contains the full extracted text from HTML sources
Size Limits: Text is limited to 6MB to fit within 8MB response limits
Truncation: When text exceeds limits, it’s truncated at word boundaries with textTruncated: true
Extraction Failures: When text extraction fails, textExtractionFailed: true is set and text is empty
Graceful Degradation: Papers are always returned with metadata even if text extraction fails

Development

Build

npm run build

Test

# Test CLI commands
node dist/cli.js list-categories --source=arxiv
node dist/cli.js fetch-latest --source=arxiv --category=cs.AI --count=3
node dist/cli.js fetch-top-cited --concept="artificial intelligence" --since=2024-01-01 --count=5
node dist/cli.js fetch-content --source=arxiv --id=2401.12345

# Test MCP server
node test-mcp.js

Exploratory Testing

Test the MCP server with the proxy:

npx @srbhptl39/mcp-superassistant-proxy@latest --config ./mcpconfig.json

Architecture

TypeScript + ESM: Modern JavaScript with full type safety
Text Extraction Pipeline: HTML parsing and cleaning using cheerio with fallback mechanisms
Rate Limiting: Token bucket algorithm per data source (5 req/min arXiv, 10 req/min OpenAlex)
Modular Design: Clean separation between drivers, extractors, tools, and core services
Error Handling: Structured error responses with actionable suggestions
Graceful Degradation: Always returns metadata even when text extraction fails
Response Size Management: Automatic truncation and warnings for large content

API Sources

arXiv: Papers and categories from arXiv API
- Search by category (e.g., cs.AI, physics.gen-ph)
- Sorted by submission date (latest first)
- Individual paper lookup by arXiv ID
OpenAlex: Papers and concepts from OpenAlex API
- Search by concept ID or name
- Citation data and sorting by citation count
- Individual paper lookup by Work ID
- Rich metadata including author affiliations

Text Extraction Sources

arXiv Text Extraction

Primary Source: https://arxiv.org/html/{paper_id}
Fallback Source: https://ar5iv.labs.arxiv.org/html/{paper_id} (when primary fails)
Content: LaTeX-rendered HTML with mathematical formulas and structured content
Success Rate: ~90% for papers with HTML versions available
Limitations: Some older papers may not have HTML versions

OpenAlex Text Extraction

Source: Papers with primary_location.source_type == "html"
Content: Full-text HTML from publisher websites and repositories
Success Rate: Varies by publisher and access policies
Limitations:
- Only extracts from HTML sources (PDF extraction not included in MVP)
- Depends on publisher providing HTML access
- Some papers may be behind paywalls

Text Processing

HTML Cleaning: Removes navigation, headers, footers, and non-content elements
Text Normalization: Standardizes whitespace, line breaks, and formatting
Content Extraction: Focuses on main article content using academic paper selectors
Size Management: Automatic truncation at 6MB with word boundary preservation

Rate Limiting

The server implements respectful rate limiting:

arXiv: 5 requests per minute (per arXiv guidelines)
OpenAlex: 10 requests per minute (conservative limit)

Rate limits are enforced per source and shared across all tools.

Error Handling

The server provides detailed error messages for common issues:

Invalid paper IDs
Rate limiting (with retry-after information)
API timeouts and server errors
Invalid date formats
Missing required parameters

License

MIT

Dev Tools Supporting MCP

The following are the main code editors that support the Model Context Protocol. Click the link to visit the official website for more information.

Zed: High-performance collaborative code editor, supports MCP protocol, providing a smooth programming experience. zed.dev

Cursor: AI code editor built on VS Code, supports MCP protocol for context-aware programming. cursor.com

Windsurf: AI code editor from Codeium, integrates MCP protocol to provide intelligent code assistance. windsurf.com

Continue: Open-source AI programming assistant plugin, supports VS Code and JetBrains, compatible with MCP protocol. continue.dev

Trae: AI-driven code editor, supports MCP protocol, focusing on enhancing developer programming experience. trae.ai

View More MCP Dev Tools

Tools

No tools

Comments

Recommend MCP Servers

MCP Server Chart This is a TypeScript-based MCP server that provides chart generation capabilities. It allows you to create various types of charts through MCP tools. You can also use it in Dify.

GitHub MCP Server MCP Server for the GitHub API, enabling file operations, repository management, search functionality, and more.

Brave Search MCP Server Web and local search using Brave's Search API

Firecrawl MCP Server Advanced web scraping with JavaScript rendering, PDF support, and smart rate limiting

Context7 MCP LLMs rely on outdated or generic information about the libraries you use. You get:

Slack MCP server Channel management and messaging capabilities

Sequential Thinking MCP Server Dynamic and reflective problem-solving through thought sequences

Fetch MCP Server A Model Context Protocol server that provides web content fetching capabilities.

Playwright MCP A Model Context Protocol (MCP) server that provides browser automation capabilities using [Playwright](https://playwright.dev). This server enables LLMs to interact with web pages through structured accessibility snapshots, bypassing the need for screenshots or visually-tuned models.

AMap MCP Server Amap Maps is a server that supports any MCP protocol client, allowing users to easily utilize the Amap Maps MCP server for various location-based services.

View All MCP Servers