Minions

941 MIT

FreeCommunity

AI Systems

# Collaboration Between Large Models and Small Models

What is Minions

Minions is a communication protocol designed for collaboration between small on-device language models and larger frontier models hosted in the cloud. It allows for efficient processing by reading long contexts locally, thereby reducing cloud costs without significant quality loss.

Use cases

Use cases for Minions include applications in chatbots, virtual assistants, and other interactive AI systems that require real-time processing of large amounts of data while minimizing cloud resource usage.

How to use

To use Minions, clone the repository from GitHub, install the required Python package, and set up a local model server such as ‘ollama’ or ‘tokasaurus’. Follow the setup instructions provided in the README for a smooth installation process.

Key features

Key features of Minions include cost-efficient collaboration between on-device and cloud models, minimal quality degradation when processing long contexts, and support for multiple local model servers. It also provides a demonstration of the protocol for users.

Where to use

Minions can be used in various fields including natural language processing, machine learning applications, and any scenario where efficient collaboration between local and cloud-based models is required.

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai/download

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Overview

What is Minions

Use cases

How to use

Key features

Where to use

Clients Supporting MCP

The following are the main client software that supports the Model Context Protocol. Click the link to visit the official website for more information.

Claude Desktop: Official desktop application from Anthropic, natively supports MCP protocol. claude.ai/download

Cherry Studio: Cross-platform desktop client supporting multiple LLM providers, built-in MCP server support. cherry-ai.com

LobeChat: Modern open-source ChatGPT/LLMs UI, supports MCP protocol integration. lobehub.com

DeepChat: Cross-platform desktop AI assistant, compatible with MCP protocol, focusing on privacy and efficiency. deepchat.thinkinai.xyz

5ire: Cross-platform open-source desktop intelligent assistant MCP client, supports local knowledge base and MCP server. 5ire.app

View More MCP Clients

Content

Minions Logo

Where On-Device and Cloud LLMs Meet

What is this? Minions is a communication protocol that enables small on-device models to collaborate with frontier models in the cloud. By only reading long contexts locally, we can reduce cloud costs with minimal or no quality degradation. This repository provides a demonstration of the protocol. Get started below or see our paper and blogpost below for more information.

Paper: Minions: Cost-efficient Collaboration Between On-device and Cloud
Language Models

Minions Blogpost: https://hazyresearch.stanford.edu/blog/2025-02-24-minions

Secure Minions Chat Blogpost: https://hazyresearch.stanford.edu/blog/2025-05-12-security

Looking for Secure Minions Chat? If you’re interested in our end-to-end encrypted and chat system, please see the Secure Minions Chat README for detailed setup and usage instructions.

Setup
Minions Demo Application
Minions WebGPU App
Example Code
- Minion (Singular)
- Minions (Plural)
Python Notebook
Docker Support
Command Line Interface
Secure Minions Local-Remote Protocol
Secure Minions Chat
Inference Estimator
- Command Line Usage
- Python API Usage
Miscellaneous Setup
- Using Azure OpenAI
Maintainers

Setup

We have tested the following setup on Mac and Ubuntu with Python 3.10-3.11 (Note: Python 3.13 is not supported)

Optional: Create a virtual environment with your favorite package manager (e.g. conda, venv, uv)

conda create -n minions python=3.11

Step 1: Clone the repository and install the Python package.

git clone https://github.com/HazyResearch/minions.git
cd minions
pip install -e .  # installs the minions package in editable mode

note: for optional MLX-LM install the package with the following command:

pip install -e ".[mlx]"

note: for secure minions chat, install the package with the following command:

pip install -e ".[secure]"

note: for optional Cartesia-MLX install, pip install the basic package and then follow the instructions below.

Step 2: Install a server for running the local model.

We support three servers for running local models: lemonade, ollama, and tokasaurus. You need to install at least one of these.

You should use ollama if you do not have access to NVIDIA/AMD GPUs. Install ollama following the instructions here. To enable Flash Attention, run
launchctl setenv OLLAMA_FLASH_ATTENTION 1 and, if on a mac, restart the ollama app.
You should use lemonade if you have access to local AMD CPUs/GPUs/NPUs. Install lemonade following the instructions here.
- See the following for supported APU configurations: https://ryzenai.docs.amd.com/en/latest/llm/overview.html#supported-configurations
- After installing lemonade make sure to launch the lemonade server. This can be done via the one-click Windows GUI installer which installs the Lemonade Server as a standalone tool.
- Note: Lemonade support is currently experimental and only supports the Minion protocol at this time.
You should use tokasaurus if you have access to NVIDIA GPUs and you are running the Minions protocol, which benefits from the high-throughput of tokasaurus. Install tokasaurus with the following command:

pip install tokasaurus

Optional: Install Cartesia-MLX (only available on Apple Silicon)

Download XCode
Install the command line tools by running xcode-select --install
Install the Nanobind🧮

pip install nanobind@git+https://github.com/wjakob/nanobind.git@2f04eac452a6d9142dedb957701bdb20125561e4

Install the Cartesia Metal backend by running the following command:

pip install git+https://github.com/cartesia-ai/edge.git#subdirectory=cartesia-metal

Install the Cartesia-MLX package by running the following command:

pip install git+https://github.com/cartesia-ai/edge.git#subdirectory=cartesia-mlx

Optional: Install llama-cpp-python

Installation

First, install the llama-cpp-python package:

# CPU-only installation
pip install llama-cpp-python

# For Metal on Mac (Apple Silicon/Intel)
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python

# For CUDA on NVIDIA GPUs
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

# For OpenBLAS CPU optimizations
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python

For more installation options, see the llama-cpp-python documentation.

Basic Usage

The client follows the basic pattern from the llama-cpp-python library:

from minions.clients import LlamaCppClient

# Initialize the client with a local model
client = LlamaCppClient(
    model_path="/path/to/model.gguf",
    chat_format="chatml",     # Most modern models use "chatml" format
    n_gpu_layers=35           # Set to 0 for CPU-only inference
)

# Run a chat completion
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's the capital of France?"}
]

responses, usage, done_reasons = client.chat(messages)
print(responses[0])  # The generated response

Loading Models from Hugging Face

You can easily load models directly from Hugging Face:

client = LlamaCppClient(
    model_path="dummy",  # Will be replaced by downloaded model
    model_repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
    model_file_pattern="*Q4_K_M.gguf",  # Optional - specify quantization
    chat_format="chatml",
    n_gpu_layers=35     # Offload 35 layers to GPU
)

Step 3: Set your API key for at least one of the following cloud LLM providers.

If needed, create an OpenAI API Key or TogetherAI API key or DeepSeek API key for the cloud model.

# OpenAI
export OPENAI_API_KEY=<your-openai-api-key>
export OPENAI_BASE_URL=<your-openai-base-url>  # Optional: Use a different OpenAI API endpoint

# Together AI
export TOGETHER_API_KEY=<your-together-api-key>

# OpenRouter
export OPENROUTER_API_KEY=<your-openrouter-api-key>
export OPENROUTER_BASE_URL=<your-openrouter-base-url>  # Optional: Use a different OpenRouter API endpoint

# Perplexity
export PERPLEXITY_API_KEY=<your-perplexity-api-key>
export PERPLEXITY_BASE_URL=<your-perplexity-base-url>  # Optional: Use a different Perplexity API endpoint

# Tokasaurus
export TOKASAURUS_BASE_URL=<your-tokasaurus-base-url>  # Optional: Use a different Tokasaurus API endpoint

# DeepSeek
export DEEPSEEK_API_KEY=<your-deepseek-api-key>

# Anthropic
export ANTHROPIC_API_KEY=<your-anthropic-api-key>

# Mistral AI
export MISTRAL_API_KEY=<your-mistral-api-key>

Minions Demo Application

To try the Minion or Minions protocol, run the following commands:

pip install torch transformers

streamlit run app.py

If you are seeing an error about the ollama client,

An error occurred: Failed to connect to Ollama. Please check that Ollama is downloaded, running and accessible. https://ollama.com/download

try running the following command:

OLLAMA_FLASH_ATTENTION=1 ollama serve

Minions WebGPU App

The Minions WebGPU app demonstrates the Minions protocol running entirely in the browser using WebGPU for local model inference and cloud APIs for supervision. This approach eliminates the need for local server setup while providing a user-friendly web interface.

Features

Browser-based: Runs entirely in your web browser with no local server required
WebGPU acceleration: Uses WebGPU for fast local model inference
Model selection: Choose from multiple pre-optimized models from MLC AI
Real-time progress: See model loading progress and conversation logs in real-time
Privacy-focused: Your API key and data never leave your browser

Quick Start

Navigate to the WebGPU app directory:
```
cd apps/minions-webgpu
```
Install dependencies:
```
npm install
```
Start the development server:
```
npm start
```
Open your browser and navigate to the URL shown in the terminal (typically http://localhost:5173)

Example code: Minion (singular)

The following example is for an ollama local client and an openai remote client.
The protocol is minion.

from minions.clients.ollama import OllamaClient
from minions.clients.openai import OpenAIClient
from minions.minion import Minion

local_client = OllamaClient(
        model_name="llama3.2",
    )

remote_client = OpenAIClient(
        model_name="gpt-4o",
    )

# Instantiate the Minion object with both clients
minion = Minion(local_client, remote_client)


context = """
Patient John Doe is a 60-year-old male with a history of hypertension. In his latest checkup, his blood pressure was recorded at 160/100 mmHg, and he reported occasional chest discomfort during physical activity.
Recent laboratory results show that his LDL cholesterol level is elevated at 170 mg/dL, while his HDL remains within the normal range at 45 mg/dL. Other metabolic indicators, including fasting glucose and renal function, are unremarkable.
"""

task = "Based on the patient's blood pressure and LDL cholesterol readings in the context, evaluate whether these factors together suggest an increased risk for cardiovascular complications."

# Execute the minion protocol for up to two communication rounds
output = minion(
    task=task,
    context=[context],
    max_rounds=2
)

Example Code: Minions (plural)

The following example is for an ollama local client and an openai remote client.
The protocol is minions.

from minions.clients.ollama import OllamaClient
from minions.clients.openai import OpenAIClient
from minions.minions import Minions
from pydantic import BaseModel

class StructuredLocalOutput(BaseModel):
    explanation: str
    citation: str | None
    answer: str | None

local_client = OllamaClient(
        model_name="llama3.2",
        temperature=0.0,
        structured_output_schema=StructuredLocalOutput
)

remote_client = OpenAIClient(
        model_name="gpt-4o",
)


# Instantiate the Minion object with both clients
minion = Minions(local_client, remote_client)


context = """
Patient John Doe is a 60-year-old male with a history of hypertension. In his latest checkup, his blood pressure was recorded at 160/100 mmHg, and he reported occasional chest discomfort during physical activity.
Recent laboratory results show that his LDL cholesterol level is elevated at 170 mg/dL, while his HDL remains within the normal range at 45 mg/dL. Other metabolic indicators, including fasting glucose and renal function, are unremarkable.
"""

task = "Based on the patient's blood pressure and LDL cholesterol readings in the context, evaluate whether these factors together suggest an increased risk for cardiovascular complications."

# Execute the minion protocol for up to two communication rounds
output = minion(
    task=task,
    doc_metadata="Medical Report",
    context=[context],
    max_rounds=2
)

Python Notebook

To run Minion/Minions in a notebook, checkout minions.ipynb.

Docker support

Build the Docker Image

docker build -t minions .

Run the container

#without GPU support
docker run -p 8501:8501 --env OPENAI_API_KEY=<your-openai-api-key> --env DEEPSEEK_API_KEY=<your-deepseek-api-key> minions
#with GPU support
docker run --gpus all -p 8501:8501 --env OPENAI_API_KEY=<your-openai-api-key> --env DEEPSEEK_API_KEY=<your-deepseek-api-key> minions

CLI

To run Minion/Minions in a CLI, checkout minions_cli.py.

Set your choice of local and remote models by running the following command. The format is <provider>/<model_name>. Choice of providers are ollama, openai, anthropic, together, perplexity, openrouter, groq, and mlx.

export MINIONS_LOCAL=ollama/llama3.2
export MINIONS_REMOTE=openai/gpt-4o

minions --help

minions --context <path_to_context> --protocol <minion|minions>

Secure Minions Local-Remote Protocol

The Secure Minions Local-Remote Protocol (secure/minions_secure.py) provides an end-to-end encrypted implementation of the Minions protocol that enables secure communication between a local worker model and a remote supervisor server. This protocol includes attestation verification, perfect forward secrecy, and replay protection.

Prerequisites

Install the secure dependencies:

pip install -e ".[secure]"

Basic Usage

Python API

from minions.clients import OllamaClient
from secure.minions_secure import SecureMinionProtocol

# Initialize local client
local_client = OllamaClient(model_name="llama3.2")

# Create secure protocol instance
protocol = SecureMinionProtocol(
    supervisor_url="https://your-supervisor-server.com",
    local_client=local_client,
    max_rounds=3,
    system_prompt="You are a helpful AI assistant."
)

# Run a secure task
result = protocol(
    task="Analyze this document for key insights",
    context=["Your document content here"],
    max_rounds=2
)

print(f"Final Answer: {result['final_answer']}")
print(f"Session ID: {result['session_id']}")
print(f"Log saved to: {result['log_file']}")

# Clean up the session
protocol.end_session()

Command Line Interfacec

python secure/minions_secure.py \
    --supervisor_url https://your-supervisor-server.com \
    --local_client_type ollama \
    --local_model llama3.2 \
    --max_rounds 3

Secure Minions Chat

To install secure minions chat, install the package with the following command:

pip install -e ".[secure]"

See the Secure Minions Chat README for additional details on how to setup clients and run the secure chat protocol.

Inference Estimator

Minions provides a utility to estimate LLM inference speed on your hardware. The inference estimator helps you:

Analyze your hardware capabilities (GPU, MPS, or CPU)
Calculate peak performance for your models
Estimate tokens per second and completion time

Command Line Usage

Run the estimator directly from the command line to check how fast a model will run:

python -m minions.utils.inference_estimator --model llama3.2 --tokens 1000 --describe

Arguments:

--model: Model name from the supported model list (e.g., llama3.2, mistral7b)
--tokens: Number of tokens to estimate generation time for
--describe: Show detailed hardware and model performance statistics
--quantized: Specify that the model is quantized
--quant-bits: Quantization bit-width (4, 8, or 16)

Python API Usage

You can also use the inference estimator in your Python code:

from minions.utils.inference_estimator import InferenceEstimator

# Initialize the estimator for a specific model
estimator = InferenceEstimator(
    model_name="llama3.2",  # Model name
    is_quant=True,          # Is model quantized?
    quant_bits=4            # Quantization level (4, 8, 16)
)

# Estimate performance for 1000 tokens
tokens_per_second, estimated_time = estimator.estimate(1000)
print(f"Estimated speed: {tokens_per_second:.1f} tokens/sec")
print(f"Estimated time: {estimated_time:.2f} seconds")

# Get detailed stats
detailed_info = estimator.describe(1000)
print(detailed_info)

# Calibrate with your actual model client for better accuracy
# (requires a model client that implements a chat() method)
estimator.calibrate(my_model_client, sample_tokens=32, prompt="Hello")

The estimator uses a roofline model that considers both compute and memory bandwidth limitations and applies empirical calibration to improve accuracy. The calibration data is cached at ~/.cache/ie_calib.json for future use.

Miscellaneous Setup

Using Azure OpenAI with Minions

Set Environment Variables

export AZURE_OPENAI_API_KEY=your-api-key
export AZURE_OPENAI_ENDPOINT=https://your-resource-name.openai.azure.com/
export AZURE_OPENAI_API_VERSION=2024-02-15-preview

Example Code

Here’s an example of how to use Azure OpenAI with the Minions protocol in your own code:

from minions.clients.ollama import OllamaClient
from minions.clients.azure_openai import AzureOpenAIClient
from minions.minion import Minion

local_client = OllamaClient(
    model_name="llama3.2",
)

remote_client = AzureOpenAIClient(
    model_name="gpt-4o",  # This should match your deployment name
    api_key="your-api-key",
    azure_endpoint="https://your-resource-name.openai.azure.com/",
    api_version="2024-02-15-preview",
)

# Instantiate the Minion object with both clients
minion = Minion(local_client, remote_client)

Maintainers

Avanika Narayan (contact: [email protected])
Dan Biderman (contact: [email protected])
Sabri Eyuboglu (contact: [email protected])

DevTools Supporting MCP

The following are the main code editors that support the Model Context Protocol. Click the link to visit the official website for more information.

Zed: High-performance collaborative code editor, supports MCP protocol, providing a smooth programming experience. zed.dev

Cursor: AI code editor built on VS Code, supports MCP protocol for context-aware programming. cursor.com

Windsurf: AI code editor from Codeium, integrates MCP protocol to provide intelligent code assistance. codeium.com/windsurf

Continue: Open-source AI programming assistant plugin, supports VS Code and JetBrains, compatible with MCP protocol. continue.dev

Trae: AI-driven code editor, supports MCP protocol, focusing on enhancing developer programming experience. trae.ai

View More MCP DevTools

Tools

No tools

Minions

What is Minions

Use cases

How to use

Key features

Where to use

Clients Supporting MCP

Overview

What is Minions

Use cases

How to use

Key features

Where to use

Clients Supporting MCP

Content

Where On-Device and Cloud LLMs Meet

Table of Contents

Setup

Installation

Basic Usage

Loading Models from Hugging Face

Minions Demo Application

Minions WebGPU App

Example code: Minion (singular)

Example Code: Minions (plural)

Python Notebook

Docker support

CLI

Secure Minions Local-Remote Protocol

Python API

Command Line Interfacec

Secure Minions Chat

Inference Estimator

Miscellaneous Setup

Set Environment Variables

Example Code

Maintainers

DevTools Supporting MCP

Tools

Comments