MCP ExplorerExplorer

Arize Phoenix MCP Server

@Arize-aion 12 days ago
5905Β Apache 2.0
FreeCommunity
AI Systems
#MCP#Model Context Protocol#Prompts Management#Datasets#Experiments#LLM#Phoenix#Arize
Phoenix MCP Server is an implementation of the Model Context Protocol that provides a unified interface to the capabilities of the Arize Phoenix platform, enabling prompts management, dataset exploration and synthesis, and experiment visualization with the help of a language model.

Overview

What is Arize Phoenix MCP Server

Phoenix is an open-source AI observability platform designed for experimentation, evaluation, and troubleshooting of machine learning models, particularly focusing on large language models (LLMs). It supports tracing, evaluation, and dataset management, making it a comprehensive tool for developers and data scientists to analyze and optimize their AI applications.

Use cases

Phoenix is utilized for a variety of AI-related tasks including tracing LLM applications to monitor their performance in real-time, evaluating model outputs through response and retrieval assessments, managing datasets for fine-tuning and experimentation, and conducting structured experiments to compare model performance. Users can also optimize prompts in a controlled environment.

How to use

To install Phoenix, you can use package managers such as pip or conda. For pip, run pip install arize-phoenix. Alternatively, Phoenix can be deployed using Docker containers or Kubernetes. Once installed, you can access various features such as tracing, evaluation, and prompt management through provided APIs and integrated frameworks.

Key features

Key features of Phoenix include OpenTelemetry-based tracing for monitoring and debugging, evaluation tools for assessing model performance, version-controlled datasets for experimentation, systematic management of prompts, and integrations with popular frameworks. It is also vendor and language agnostic, allowing broad applicability across various development environments.

Where to use

Phoenix can be employed in diverse environments such as local setups, Jupyter notebooks, or cloud-based applications. It is compatible with containerized deployments using Docker or Kubernetes, making it suitable for both individual developers and enterprise-level integrations.

Content

phoenix banner

Phoenix is an open-source AI observability platform designed for experimentation, evaluation, and troubleshooting. It provides:

  • Tracing - Trace your LLM application’s runtime using OpenTelemetry-based instrumentation.
  • Evaluation - Leverage LLMs to benchmark your application’s performance using response and retrieval evals.
  • Datasets - Create versioned datasets of examples for experimentation, evaluation, and fine-tuning.
  • Experiments - Track and evaluate changes to prompts, LLMs, and retrieval.
  • Playground- Optimize prompts, compare models, adjust parameters, and replay traced LLM calls.
  • Prompt Management- Manage and test prompt changes systematically using version control, tagging, and experimentation.

Phoenix is vendor and language agnostic with out-of-the-box support for popular frameworks (πŸ¦™LlamaIndex, πŸ¦œβ›“LangChain, Haystack, 🧩DSPy, πŸ€—smolagents) and LLM providers (OpenAI, Bedrock, MistralAI, VertexAI, LiteLLM, Google GenAI and more). For details on auto-instrumentation, check out the OpenInference project.

Phoenix runs practically anywhere, including your local machine, a Jupyter notebook, a containerized deployment, or in the cloud.

Installation

Install Phoenix via pip or conda

pip install arize-phoenix

Phoenix container images are available via Docker Hub and can be deployed using Docker or Kubernetes.

Packages

The arize-phoenix package includes the entire Phoenix platfom. However if you have deployed the Phoenix platform, there are light-weight Python sub-packages and TypeScript packages that can be used in conjunction with the platfrom.

Subpackages

Package Language Description
arize-phoenix-otel Python PyPI Version Provides a lightweight wrapper around OpenTelemetry primitives with Phoenix-aware defaults
arize-phoenix-client Python PyPI Version Lightweight client for interacting with the Phoenix server via its OpenAPI REST interface
arize-phoenix-evals Python PyPI Version Tooling to evaluate LLM applications including RAG relevance, answer relevance, and more
@arizeai/phoenix-client JavaScript NPM Version Client for the Arize Phoenix API
@arizeai/phoenix-mcp JavaScript NPM Version MCP server implementation for Arize Phoenix providing unified interface to Phoenix’s capabilities

Tracing Integrations

Phoenix is built on top of OpenTelemetry and is vendor, language, and framework agnostic. For details about tracing integrations and example applications, see the OpenInference project.

Python Integrations

Integration Package Version Badge
OpenAI openinference-instrumentation-openai PyPI Version
OpenAI Agents openinference-instrumentation-openai-agents PyPI Version
LlamaIndex openinference-instrumentation-llama-index PyPI Version
DSPy openinference-instrumentation-dspy PyPI Version
AWS Bedrock openinference-instrumentation-bedrock PyPI Version
LangChain openinference-instrumentation-langchain PyPI Version
MistralAI openinference-instrumentation-mistralai PyPI Version
Google GenAI openinference-instrumentation-google-genai PyPI Version
Google ADK openinference-instrumentation-google-adk PyPI Version
Guardrails openinference-instrumentation-guardrails PyPI Version
VertexAI openinference-instrumentation-vertexai PyPI Version
CrewAI openinference-instrumentation-crewai PyPI Version
Haystack openinference-instrumentation-haystack PyPI Version
LiteLLM openinference-instrumentation-litellm PyPI Version
Groq openinference-instrumentation-groq PyPI Version
Instructor openinference-instrumentation-instructor PyPI Version
Anthropic openinference-instrumentation-anthropic PyPI Version
Smolagents openinference-instrumentation-smolagents PyPI Version
Agno openinference-instrumentation-agno PyPI Version
MCP openinference-instrumentation-mcp PyPI Version
Pydantic AI openinference-instrumentation-pydantic-ai PyPI Version
Autogen AgentChat openinference-instrumentation-autogen-agentchat PyPI Version
Portkey openinference-instrumentation-portkey PyPI Version

JavaScript Integrations

Integration Package Version Badge
OpenAI @arizeai/openinference-instrumentation-openai NPM Version
LangChain.js @arizeai/openinference-instrumentation-langchain NPM Version
Vercel AI SDK @arizeai/openinference-vercel NPM Version
BeeAI @arizeai/openinference-instrumentation-beeai NPM Version
Mastra @arizeai/openinference-mastra NPM Version

Platforms

Phoenix has native integrations with LangFlow, LiteLLM Proxy, and BeeAI.

Community

Join our community to connect with thousands of AI builders.

  • 🌍 Join our Slack community.
  • πŸ“š Read our documentation.
  • πŸ’‘ Ask questions and provide feedback in the #phoenix-support channel.
  • 🌟 Leave a star on our GitHub.
  • 🐞 Report bugs with GitHub Issues.
  • 𝕏 Follow us on 𝕏.
  • πŸ—ΊοΈ Check out our roadmap to see where we’re heading next.
  • πŸ§‘β€πŸ« Deep dive into everything Agents and LLM Evaluations on Arize’s Learning Hubs.

Breaking Changes

See the migration guide for a list of breaking changes.

Copyright, Patent, and License

Copyright 2025 Arize AI, Inc. All Rights Reserved.

Portions of this code are patent protected by one or more U.S. Patents. See the IP_NOTICE.

This software is licensed under the terms of the Elastic License 2.0 (ELv2). See LICENSE.

Tools

list-prompts
Get a list of all the prompts. Prompts (templates, prompt templates) are versioned templates for input messages to an LLM. Each prompt includes both the input messages, but also the model and invocation parameters to use when generating outputs. Returns a list of prompt objects with their IDs, names, and descriptions. Example usage: List all available prompts Expected return: Array of prompt objects with metadata. Example: [{ "name": "article-summarizer", "description": "Summarizes an article into concise bullet points", "source_prompt_id": null, "id": "promptid1234" }]
get-latest-prompt
Get the latest version of a prompt. Returns the prompt version with its template, model configuration, and invocation parameters. Example usage: Get the latest version of a prompt named 'article-summarizer' Expected return: Prompt version object with template and configuration. Example: { "description": "Initial version", "model_provider": "OPENAI", "model_name": "gpt-3.5-turbo", "template": { "type": "chat", "messages": [ { "role": "system", "content": "You are an expert summarizer. Create clear, concise bullet points highlighting the key information." }, { "role": "user", "content": "Please summarize the following {{topic}} article: {{article}}" } ] }, "template_type": "CHAT", "template_format": "MUSTACHE", "invocation_parameters": { "type": "openai", "openai": {} }, "id": "promptversionid1234" }
get-prompt-by-identifier
Get a prompt's latest version by its identifier (name or ID). Returns the prompt version with its template, model configuration, and invocation parameters. Example usage: Get the latest version of a prompt with name 'article-summarizer' Expected return: Prompt version object with template and configuration. Example: { "description": "Initial version", "model_provider": "OPENAI", "model_name": "gpt-3.5-turbo", "template": { "type": "chat", "messages": [ { "role": "system", "content": "You are an expert summarizer. Create clear, concise bullet points highlighting the key information." }, { "role": "user", "content": "Please summarize the following {{topic}} article: {{article}}" } ] }, "template_type": "CHAT", "template_format": "MUSTACHE", "invocation_parameters": { "type": "openai", "openai": {} }, "id": "promptversionid1234" }
get-prompt-version
Get a specific version of a prompt using its version ID. Returns the prompt version with its template, model configuration, and invocation parameters. Example usage: Get a specific prompt version with ID 'promptversionid1234' Expected return: Prompt version object with template and configuration. Example: { "description": "Initial version", "model_provider": "OPENAI", "model_name": "gpt-3.5-turbo", "template": { "type": "chat", "messages": [ { "role": "system", "content": "You are an expert summarizer. Create clear, concise bullet points highlighting the key information." }, { "role": "user", "content": "Please summarize the following {{topic}} article: {{article}}" } ] }, "template_type": "CHAT", "template_format": "MUSTACHE", "invocation_parameters": { "type": "openai", "openai": {} }, "id": "promptversionid1234" }
upsert-prompt
Create or update a prompt with its template and configuration. Creates a new prompt and its initial version with specified model settings. Example usage: Create a new prompt named 'email_generator' with a template for generating emails Expected return: A confirmation message of successful prompt creation
list-prompt-versions
Get a list of all versions for a specific prompt. Returns versions with pagination support. Example usage: List all versions of a prompt named 'article-summarizer' Expected return: Array of prompt version objects with IDs and configuration. Example: [ { "description": "Initial version", "model_provider": "OPENAI", "model_name": "gpt-3.5-turbo", "template": { "type": "chat", "messages": [ { "role": "system", "content": "You are an expert summarizer. Create clear, concise bullet points highlighting the key information." }, { "role": "user", "content": "Please summarize the following {{topic}} article: {{article}}" } ] }, "template_type": "CHAT", "template_format": "MUSTACHE", "invocation_parameters": { "type": "openai", "openai": {} }, "id": "promptversionid1234" } ]
get-prompt-version-by-tag
Get a prompt version by its tag name. Returns the prompt version with its template, model configuration, and invocation parameters. Example usage: Get the 'production' tagged version of prompt 'article-summarizer' Expected return: Prompt version object with template and configuration. Example: { "description": "Initial version", "model_provider": "OPENAI", "model_name": "gpt-3.5-turbo", "template": { "type": "chat", "messages": [ { "role": "system", "content": "You are an expert summarizer. Create clear, concise bullet points highlighting the key information." }, { "role": "user", "content": "Please summarize the following {{topic}} article: {{article}}" } ] }, "template_type": "CHAT", "template_format": "MUSTACHE", "invocation_parameters": { "type": "openai", "openai": {} }, "id": "promptversionid1234" }
list-prompt-version-tags
Get a list of all tags for a specific prompt version. Returns tag objects with pagination support. Example usage: List all tags associated with prompt version 'promptversionid1234' Expected return: Array of tag objects with names and IDs. Example: [ { "name": "staging", "description": "The version deployed to staging", "id": "promptversionid1234" }, { "name": "development", "description": "The version deployed for development", "id": "promptversionid1234" } ]
add-prompt-version-tag
Add a tag to a specific prompt version. The operation returns no content on success (204 status code). Example usage: Tag prompt version 'promptversionid1234' with the name 'production' Expected return: Confirmation message of successful tag addition
list-experiments-for-dataset
Get a list of all the experiments run on a given dataset. Experiments are collections of experiment runs, each experiment run corresponds to a single dataset example. The dataset example is passed to an implied `task` which in turn produces an output. Example usage: Show me all the experiments I've run on dataset RGF0YXNldDox Expected return: Array of experiment objects with metadata. Example: [ { "id": "experimentid1234", "dataset_id": "datasetid1234", "dataset_version_id": "datasetversionid1234", "repetitions": 1, "metadata": {}, "project_name": "Experiment-abc123", "created_at": "YYYY-MM-DDTHH:mm:ssZ", "updated_at": "YYYY-MM-DDTHH:mm:ssZ" } ]
get-experiment-by-id
Get an experiment by its ID. The tool returns experiment metadata in the first content block and a JSON object with the experiment data in the second. The experiment data contains both the results of each experiment run and the annotations made by an evaluator to score or label the results, for example, comparing the output of an experiment run to the expected output from the dataset example. Example usage: Show me the experiment results for experiment RXhwZXJpbWVudDo4 Expected return: Object containing experiment metadata and results. Example: { "metadata": { "id": "experimentid1234", "dataset_id": "datasetid1234", "dataset_version_id": "datasetversionid1234", "repetitions": 1, "metadata": {}, "project_name": "Experiment-abc123", "created_at": "YYYY-MM-DDTHH:mm:ssZ", "updated_at": "YYYY-MM-DDTHH:mm:ssZ" }, "experimentResult": [ { "example_id": "exampleid1234", "repetition_number": 0, "input": "Sample input text", "reference_output": "Expected output text", "output": "Actual output text", "error": null, "latency_ms": 1000, "start_time": "2025-03-20T12:00:00Z", "end_time": "2025-03-20T12:00:01Z", "trace_id": "trace-123", "prompt_token_count": 10, "completion_token_count": 20, "annotations": [ { "name": "quality", "annotator_kind": "HUMAN", "label": "good", "score": 0.9, "explanation": "Output matches expected format", "trace_id": "trace-456", "error": null, "metadata": {}, "start_time": "YYYY-MM-DDTHH:mm:ssZ", "end_time": "YYYY-MM-DDTHH:mm:ssZ" } ] } ] }
list-datasets
Get a list of all datasets. Datasets are collections of 'dataset examples' that each example includes an input, (expected) output, and optional metadata. They are primarily used as inputs for experiments. Example usage: Show me all available datasets Expected return: Array of dataset objects with metadata. Example: [ { "id": "RGF0YXNldDox", "name": "my-dataset", "description": "A dataset for testing", "metadata": {}, "created_at": "2024-03-20T12:00:00Z", "updated_at": "2024-03-20T12:00:00Z" } ]
get-dataset-examples
Get examples from a dataset. Dataset examples are an array of objects that each include an input, (expected) output, and optional metadata. These examples are typically used to represent input to an application or model (e.g. prompt template variables, a code file, or image) and used to test or benchmark changes. Example usage: Show me all examples from dataset RGF0YXNldDox Expected return: Object containing dataset ID, version ID, and array of examples. Example: { "dataset_id": "datasetid1234", "version_id": "datasetversionid1234", "examples": [ { "id": "exampleid1234", "input": { "text": "Sample input text" }, "output": { "text": "Expected output text" }, "metadata": {}, "updated_at": "YYYY-MM-DDTHH:mm:ssZ" } ] }
get-dataset-experiments
List experiments run on a dataset. Example usage: Show me all experiments run on dataset RGF0YXNldDox Expected return: Array of experiment objects with metadata. Example: [ { "id": "experimentid1234", "dataset_id": "datasetid1234", "dataset_version_id": "datasetversionid1234", "repetitions": 1, "metadata": {}, "project_name": "Experiment-abc123", "created_at": "YYYY-MM-DDTHH:mm:ssZ", "updated_at": "YYYY-MM-DDTHH:mm:ssZ" } ]
add-dataset-examples
Add examples to an existing dataset. This tool adds one or more examples to an existing dataset. Each example includes an input, output, and metadata. The metadata will automatically include information indicating that these examples were synthetically generated via MCP. When calling this tool, check existing examples using the "get-dataset-examples" tool to ensure that you are not adding duplicate examples and following existing patterns for how data should be structured. Example usage: Look at the analyze "my-dataset" and augment them with new examples to cover relevant edge cases Expected return: Confirmation of successful addition of examples to the dataset. Example: { "dataset_name": "my-dataset", "message": "Successfully added examples to dataset" }

Comments