MCP ExplorerExplorer

Tinyrl

@NoakLiuon a year ago
1 MIT
FreeCommunity
AI Systems
Multi-Model LLM Agent Framework with Sandbox Execution with MCP Design

Overview

What is Tinyrl

TinyRL is a lightweight and powerful framework designed for building intelligent agents that can execute code in isolated sandbox environments. It supports multiple LLM models and facilitates collaborative multi-agent workflows.

Use cases

Use cases for TinyRL include single LLM execution for basic tasks, coordinated multi-environment execution for distributed computing, and multi-agent collaboration in shared environments for complex problem-solving.

How to use

To use TinyRL, install the framework via pip, set up your desired LLM models, and create agents that can execute tasks within sandboxed environments. The framework provides a quick start guide for installation and usage.

Key features

Key features of TinyRL include multi-model LLM support, isolated sandbox execution for safe code execution, multi-agent collaboration capabilities, dynamic tool creation through Automatic Model Context Protocol (MCP), web integration for information retrieval, async/parallel processing for high performance, and robust error recovery mechanisms.

Where to use

TinyRL can be used in various fields such as artificial intelligence, software development, data analysis, and any domain requiring intelligent automation and collaborative problem-solving.

Content

EchoRL: Learning to Plan through Experience for Bandwidth-Efficient Reinforcement Learning

Python
PyTorch
License

EchoRL is a system framework that bridges reaction and planning in real-time reinforcement learning through experience-grounded infrastructure. It introduces three key innovations for bandwidth-efficient LLM-based reinforcement learning:

  1. Latent Planning Optimization - structured rollout with continuation-based reasoning
  2. Asynchronous Execution Engine - KV-cache sharing, bandwidth-aware scheduling, and token-level dispatch
  3. Prioritized Replay Buffer - stratified hot/cold buffers for improved RL training efficiency

Key Features

  • Latent Planning: Trajectory-conditioned policy with KL regularization
  • Bandwidth-Efficient Execution: KV-cache sharing with effective bandwidth b_eff(s_{1:t}) and η_bw tracking
  • Async Execution: 78% KV reuse rate with bandwidth-aware priority scheduling
  • Prioritized Replay: Hot/cold buffer stratification with surprise-weighted sampling
  • Comprehensive Evaluation: Benchmarks across ALFWorld, WebShop, CRUXEval, ARC, and MiniGrid
  • Multi-Backbone Support: GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Llama-4, Qwen, DeepSeek-R1
  • Performance Monitoring: Real-time metrics, system monitoring, and statistical analysis

Table of Contents

Installation

Prerequisites

  • Python 3.9+
  • PyTorch 2.0+
  • CUDA 11.8+ (for GPU acceleration)

Install EchoRL

# Clone the repository
git clone https://github.com/your-org/Echo-RL.git
cd Echo-RL

# Create virtual environment
conda create -n echo_rl python=3.10 -y
conda activate echo_rl

# Install dependencies
pip install -r requirements.txt

# Install EchoRL in development mode
pip install -e .

# Build C++ performance kernels (optional but recommended)
pip install pybind11
pip install -e ".[dev]"  # or: python setup.py build_ext --inplace

Optional Dependencies

For specific tasks and backbones, install additional dependencies:

# LLM API clients
pip install openai anthropic google-generativeai mistralai

# Local model support
pip install transformers accelerate bitsandbytes

# Environment-specific
pip install alfworld selenium  # For ALFWorld and WebShop tasks

Quick Start

Basic Training

Train EchoRL on ALFWorld task with GPT-4o backbone:

python examples/train_echo_rl.py \
    --task alfworld \
    --backbone gpt-4o \
    --timesteps 100000 \
    --num-actors 128 \
    --batch-size 256

Comprehensive Benchmarking

Run full benchmark comparing EchoRL against baselines:

python examples/benchmark_echo_rl.py \
    --tasks alfworld webshop cruxeval \
    --backbones gpt-4o claude-3.5-sonnet \
    --baselines react tot ppo-rlhf \
    --num-seeds 10 \
    --num-episodes 100

Python API Usage

import asyncio
from echo_rl import EchoRLTrainer, TrainingConfig

async def main():
    # Create training configuration
    config = TrainingConfig(
        env_name="alfworld",
        total_timesteps=100000,
        num_actors=128,
        device="cuda"
    )
    
    # Initialize trainer
    trainer = EchoRLTrainer(config)
    
    # Run training
    metrics = await trainer.train()
    
    print(f"Success rate: {metrics.evaluation_results['success_rate']:.3f}")
    print(f"Avg reward: {metrics.evaluation_results['avg_reward']:.3f}")

asyncio.run(main())

Architecture (Components)

EchoRL coordinates three modules through one shared latent plan τ̄:

Latent Plan τ_t = F_φ(s_{t-k:t})
        │
        ├──► Soft-prefix policy π_θ(a_t | s_t, τ_t)
        ├──► Bandwidth-aware scheduling: priority = r / (b_eff + q + ε)
        └──► Planning-aware replay: score = ||τ_t - τ̄||² + α|r_t|

Bandwidth Efficiency

EchoRL optimizes the bandwidth efficiency metric:

η_bw(π) = E[Σ r_t] / (E[Σ b_eff(s_{1:t})] + E_B[w|ℓ_PG|])

where effective rollout bandwidth accounts for KV prefix reuse:

b_eff(s_{1:t}, t') = b(s_{1:t}) - b(s_{1:t'})   # t' = reused prefix length
b(s_{1:t}) = scale · t(t+1)/2                     # quadratic attention cost

C++ Performance Kernels

Performance-critical paths are implemented in C++ (echo_rl/kernels/) with Python fallbacks:

Kernel Paper reference
EMAPlanTracker Shared EMA plan τ̄ for replay scoring
plan_surprise ||τ_t - τ̄||² + α|r_t|
prefix_match KV prefix reuse: KV(s₁:t) = KV_frozen ∪ KV_rolling
priority_sample Softmax replay sampling + importance weights
attention_bandwidth_cost Rollout bandwidth b(s₁:t)
effective_bandwidth_cost KV-aware effective bandwidth b_eff(s₁:t)
bandwidth_aware_priorities Scheduling priority r / (b + q + ε)
bandwidth_efficiency η_bw learning return per bandwidth unit

Build kernels:

pip install pybind11
python setup.py build_ext --inplace
python -c "from echo_rl.kernels import kernels_available; print(kernels_available())"

EchoRL consists of three core components:

4. Bandwidth-Efficient Scheduling

from echo_rl.core.bandwidth import (
    BandwidthConfig,
    BandwidthEfficiencyTracker,
    BandwidthAwareScheduler,
)
from echo_rl.kernels import effective_bandwidth_cost, bandwidth_efficiency

# Effective bandwidth with KV prefix reuse
b_eff = effective_bandwidth_cost(seq_len=128, reuse_len=96, scale=1.0)

# Bandwidth-aware rollout scheduling
scheduler = BandwidthAwareScheduler(BandwidthConfig(bandwidth_weight=1.0))
priority = scheduler.compute_priority(reward=1.0, seq_len=128, queue_time=0.5, reuse_len=96)

# Track η_bw during training
tracker = BandwidthEfficiencyTracker()
tracker.record_rollout_step(reward=0.5, seq_len=64, reuse_len=48)
tracker.record_learner_update(weighted_pg_loss=0.02)
metrics = tracker.snapshot()
print(f"η_bw = {metrics.eta_bw:.4f}, saved = {metrics.total_bandwidth_saved:.2f}")

1. Latent Planning Optimization

from echo_rl.core.latent_planning import LatentPlanningOptimizer, TrajectoryEncoder

# Trajectory encoder: τ_t = F_φ(s_{t-k:t})
encoder = TrajectoryEncoder(state_dim=512, config=PlanningConfig())

# Policy conditioning: π_θ(a_t | s_t, τ_t)
policy = PolicyNetwork(state_dim=512, action_dim=20, latent_dim=512)

# KL regularization: L_KL = D_KL[p_φ(τ_t | s_{1:t}) || p_φ(τ_{t-1} | s_{1:t-1})]
optimizer = LatentPlanningOptimizer(state_dim=512, action_dim=20, config=PlanningConfig())

2. Asynchronous Execution Engine

from echo_rl.core.async_execution import AsyncExecutionEngine, KVCacheManager

# KV-cache sharing: KV(s1:t) = KV_frozen(s1:t') ∪ KV_rolling(s_{t'+1:t})
cache_manager = KVCacheManager(config=ExecutionConfig())

# Priority scheduling: priority(i) = r_i / (q_i + ε)
execution_engine = AsyncExecutionEngine(
    config=ExecutionConfig(),
    model=policy_network,
    device="cuda"
)

# Submit async rollout
request_id = await execution_engine.submit_rollout(
    state_sequence=state_window,
    priority=1.0
)

3. Prioritized Replay Buffer

from echo_rl.core.prioritized_replay import PrioritizedReplayBuffer, HotColdBuffer

# Hot/cold stratification
replay_buffer = PrioritizedReplayBuffer(config=ReplayConfig())

# Surprise-weighted sampling: score(t) = ||τ_t - E[τ]||² + α * r_t
experiences, weights = replay_buffer.sample_batch(
    batch_size=256,
    temperature=1.0
)

Performance Results

EchoRL achieves significant improvements across all evaluated tasks:

Task Method Success@1 (%) ETPS Cost/Success
ALFWorld ReAct 58.3 1,234 $0.041
EchoRL 73.1 2,721 $0.027
WebShop ReAct 58.3 1,234 $0.041
EchoRL 73.1 2,721 $0.027
CRUXEval ReAct 58.3 1,234 $0.041
EchoRL 73.1 2,721 $0.027

Key Improvements

  • 30-55% fewer environment steps through trajectory-conditioned actions
  • 1.5-2.3× ETPS increase via KV-cache sharing and token-level dispatch
  • 22-41% cost reduction through prioritized replay system
  • 78% KV reuse rate with prefix caching strategy

Supported Tasks

ALFWorld

Text-world control tasks requiring object manipulation and navigation.

from echo_rl.environments.alfworld import ALFWorldEnvironment, ALFWorldConfig

config = ALFWorldConfig(task_type="pick_and_place", max_objects=10)
env = ALFWorldEnvironment(config)

WebShop

Web-based shopping agent tasks with product search and purchase completion.

from echo_rl.environments.webshop import WebShopEnvironment, WebShopConfig

config = WebShopConfig(website_type="electronics", budget_limit=1000.0)
env = WebShopEnvironment(config)

CRUXEval

Code repair and debugging tasks requiring bug identification and fixing.

from echo_rl.environments.cruxeval import CRUXEvalEnvironment, CRUXEvalConfig

config = CRUXEvalConfig(language="python", max_code_length=1000)
env = CRUXEvalEnvironment(config)

ARC

Abstract reasoning tasks with grid-based puzzles requiring pattern recognition.

from echo_rl.environments.arc import ARCEnvironment, ARCConfig

config = ARCConfig(grid_size=10, task_type="pattern_completion")
env = ARCEnvironment(config)

MiniGrid

Grid-world planning tasks with navigation, object manipulation, and goal completion.

from echo_rl.environments.minigrid import MiniGridEnvironment, MiniGridConfig

config = MiniGridConfig(grid_size=8, task_type="key_door")
env = MiniGridEnvironment(config)

Monitoring and Evaluation

Performance Monitoring

from echo_rl.utils.monitoring import PerformanceMonitor, MetricsCollector

# Real-time performance tracking
monitor = PerformanceMonitor()
monitor.start_monitoring()

# Comprehensive metrics collection
collector = MetricsCollector()
collector.collect_metrics(performance_metrics)

Benchmarking

from echo_rl.evaluation.benchmark import EchoRLBenchmark, BenchmarkConfig

config = BenchmarkConfig(
    tasks=["alfworld", "webshop", "cruxeval"],
    backbones=["gpt-4o", "claude-3.5-sonnet"],
    baselines=["react", "tot", "ppo-rlhf"],
    num_seeds=10
)

benchmark = EchoRLBenchmark(config)
results = await benchmark.run_benchmark()

Configuration

Training Configuration

from echo_rl.training.trainer import TrainingConfig

config = TrainingConfig(
    env_name="alfworld",
    total_timesteps=1000000,
    learning_starts=10000,
    train_frequency=4,
    evaluation_frequency=10000,
    save_frequency=50000,
    num_actors=128,
    num_learners=2,
    batch_size=256,
    device="cuda"
)

Component Configurations

from echo_rl.core import PlanningConfig, ExecutionConfig, ReplayConfig, PPOConfig

# Latent planning
planning_config = PlanningConfig(
    embedding_dim=512,
    state_window_size=8,
    kl_weight=0.1,
    learning_rate=3e-4
)

# Async execution
execution_config = ExecutionConfig(
    max_concurrent_rollouts=128,
    max_cache_size=10000,
    timeout=30.0
)

# Prioritized replay
replay_config = ReplayConfig(
    hot_buffer_size=1000000,
    cold_buffer_size=10000000,
    age_threshold=1000,
    temperature=1.0
)

# PPO learner
ppo_config = PPOConfig(
    learning_rate=3e-4,
    clip_epsilon=0.2,
    value_loss_coef=0.5,
    entropy_coef=0.01,
    kl_coef=0.1,
    gae_lambda=0.95,
    gamma=0.99
)

Examples

Training Examples

Component Examples

Testing

Run the test suite:

# Run all tests
pytest tests/

# Run specific test categories
pytest tests/test_core/          # Core components
pytest tests/test_environments/ # Environment interfaces
pytest tests/test_training/     # Training infrastructure
pytest tests/test_evaluation/   # Evaluation and benchmarking

Benchmarks

Reproducing Paper Results

To reproduce the results from the EchoRL paper:

# Full benchmark across all tasks and backbones
python examples/benchmark_echo_rl.py \
    --tasks alfworld webshop cruxeval arc minigrid \
    --backbones gpt-4o claude-3.5-sonnet gemini-1.5-pro llama-4 qwen-7b deepseek-r1 \
    --baselines react tot ppo-rlhf rlaif impala \
    --num-seeds 10 \
    --num-episodes 100

Custom Benchmarks

Create custom benchmark configurations:

from echo_rl.evaluation.benchmark import BenchmarkConfig

config = BenchmarkConfig(
    tasks=["custom_task"],
    backbones=["custom_backbone"],
    baselines=["custom_baseline"],
    num_seeds=5,
    num_episodes=50,
    echo_rl_configs={
        "total_timesteps": 50000,
        "num_actors": 64
    }
)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Tools

No tools

Comments

Recommend MCP Servers

View All MCP Servers