MCP ExplorerExplorer

Mcpworld

@SAAgenton 10 months ago
2 MIT
FreeCommunity
AI Systems
MCPWorld is an open-source platform for evaluating Computer-Using Agents via GUI, API, or Hybrid methods.

Overview

What is Mcpworld

MCPWorld is an open-source benchmarking framework designed for evaluating Computer-Using Agents (CUAs). It allows agents to interact with software applications through GUI, API (Model Context Protocol – MCP), or Hybrid methods.

Use cases

Use cases for MCPWorld include evaluating the performance of AI agents in software applications, automating repetitive tasks in development environments, and conducting research on agent interactions with software through different modalities.

How to use

To use MCPWorld, clone the repository from GitHub, install the necessary dependencies, and run the interactive agent demo within a Docker container. Follow the quick setup and installation instructions provided in the README.

Key features

MCPWorld offers a comprehensive task suite with approximately 170 tasks across 10+ open-source applications, supports GUI, API, and Hybrid interaction, provides white-box evaluation for precise task verification, ensures cross-platform compatibility via Docker, and features an extensible framework for adding new tasks and agents.

Where to use

MCPWorld can be used in research and development environments focused on artificial intelligence, software automation, and testing of computer-using agents across various applications.

Content

MCPWorld: A Multi-Modal Test Platform for Computer-Using Agents (CUA)

License Docker

MCPWorld is an open-source benchmarking framework designed for evaluating Computer-Using Agents (CUAs). It supports agents that interact with software applications via GUI, API (Model Context Protocol – MCP), or Hybrid methods.


🚀 Key Features

  • Comprehensive Task Suite

    • ~170 tasks across 10+ open-source applications (VSCode, OBS, Zotero, etc.).
  • GUI, API, and Hybrid Interaction

    • Integrated MCP support enables robust mixed-mode control, letting agents fall back to GUI when APIs are unavailable.
  • White-Box Evaluation

    • Built-in evaluators inspect internal app signals or outputs for precise, reproducible task verification.
  • Cross-Platform via Docker

    • Containerized environments ensure consistent setups on Linux, macOS, and Windows.
  • Extensible Framework

    • Easily add new tasks, applications, or custom agents via clear folder structure and interfaces.

📦 Installation

Prerequisites

  • Docker
  • (Optional) VS Code + DevContainers extension

Quick Setup

git clone https://github.com/SAAgent/MCPWorld.git
cd MCPWorld
git submodule update --init PC-Canary

Then open the folder in VS Code and select Reopen in Container, or manually build the image according to the Dockerfile provided by PC-Canary.


🚩 Quickstart

🚀 Running the Interactive Agent Demo with Evaluation

These instructions assume you are running commands inside the DevContainer.

  1. Install Dependencies:

    First, ensure all Python dependencies for the agent demo are installed:

    pip install -r computer-use-demo/computer_use_demo/requirements.txt
    
  2. Start Required Services:

    You’ll need to start several services. It’s recommended to run each in a separate terminal session within the container, or run them in the background.

    • VNC Server: This provides the graphical desktop environment for the agent. The xstartup script configured in the Dockerfile will prepare an XFCE session.

      vncserver -xstartup ~/.vnc/xstartup -geometry 1024x768 :4
      

      This typically makes VNC available on port 5904.

    • noVNC Proxy: This allows you to access the VNC session via a web browser.

      /opt/noVNC/utils/novnc_proxy \
          --vnc localhost:5904 \
          --listen 0.0.0.0:6080 \
          --web /opt/noVNC > /tmp/novnc.log 2>&1 &
      
    • Main Page HTTP Server: This server provides a unified entry point to access both VNC and the Streamlit UI.

      python computer-use-demo/image/http_server.py > /tmp/http_server.log 2>&1 &
      
    • Agent Demo & Evaluator UI (Streamlit App): This application serves as the control panel for running tasks with the agent and viewing evaluation results.

      cd computer-use-demo
      STREAMLIT_SERVER_PORT=8501 python -m streamlit run computer_use_demo/streamlit.py > /tmp/streamlit.log 2>&1 &
      
  3. Accessing the Demo:

    • Unified Interface: Access the main entry page via your web browser at http://localhost:8081. This page should provide links to the VNC desktop and the Agent/Evaluator Streamlit UI.
    • VNC Desktop (Direct): Access the agent’s desktop environment directly via http://localhost:6080.
    • Agent & Evaluator UI (Direct): Open http://localhost:8501 directly to interact with the Streamlit application.

    Through the Streamlit UI (or by direct interaction if using the headless mode below), you can assign tasks to the agent. The agent will then interact with applications within the VNC desktop environment. The Evaluator will monitor and report on the agent’s performance.

🧪 Headless Agent & Evaluator Execution (CLI-Only)

For scenarios where a UI is not needed or desired (e.g., automated batch testing), you can run the agent and evaluator directly from the command line using the run_pure_computer_use_with_eval.py script. This script handles the interaction loop and evaluation process without launching the Streamlit web interface.

Prerequisites:

  • Ensure the VNC server is running as described in the “Interactive Agent Demo” section if your tasks require GUI interaction. The VNC server provides the environment for the agent to operate in.
  • Ensure you have set your Anthropic API key, either via the --api_key argument or the ANTHROPIC_API_KEY environment variable.

Example Command:

python computer-use-demo/run_pure_computer_use_with_eval.py \
  --api_key <YOUR_ANTHROPIC_API_KEY> \
  --model claude-3-7-sonnet-20250219 \
  --task_id telegram/task01_search \
  --log_dir logs_computer_use_eval \
  --exec_mode mixed

This script will output agent interactions and evaluation events directly to the console. Final results and detailed logs will be saved in the directory specified by --log_dir.


📚 Documentation

  • Tasks: See PC-Canary/tests/tasks/ for JSON/JS/Python configs.
  • Agents: Reference implementations in computer-use-demo/.
  • Extension: Add new apps/tasks/agents as described in docs (Update in progress).
  • Evaluation: White-box evaluators guarantee objective metrics.

📝 License

Released under the MIT License.

Tools

No tools

Comments

Recommend MCP Servers

View All MCP Servers