A benchmarking harness for evaluating tool-using LLM agents. Multi-turn agent/user loops, sandboxed execution, deterministic grading, and rich telemetry — across any provider via LiteLLM.
- Agent + User Loop – Multi-turn conversations where both agent and user models call tools.
- Sandboxed Execution – Tool calls proxy into Dockerized services with no external network access.
- MCP-Compatible Tooling – Tasks declare tools via Model Context Protocol or built-ins.
- Deterministic Grading – JSONPath assertions, state hashes, transcript rules, optional LLM judges.
- Rich Metrics – pass@k, cost/token estimates, latency percentiles, failure attribution.
- Distributed Runner – SQLite for local runs, Postgres for multi-machine execution.
- Bring-Your-Own Models – Any provider supported by LiteLLM (OpenAI, Anthropic, Google, Azure, Bedrock, Ollama, OpenRouter, and more).
pip install tolokaforge # core
pip install "tolokaforge[browser]" # + Playwright
pip install "tolokaforge[all]" # everythingDev install:
curl -LsSf https://astral.sh/uv/install.sh | sh
uv syncSee Python Package Guide for all extras and programmatic API usage.
# 1. Configure provider keys
cp .env.example .env
# 2. Run an example benchmark (Docker services auto-start)
scripts/with_env.sh uv run tolokaforge run --config examples/custom_grading/run_config.yaml
# 3. Check results
uv run tolokaforge status --run-dir results/custom_grading_<timestamp>Docker services start automatically via auto_start_services (default: true).
No manual docker compose up needed.
For distributed execution, task packs, and advanced workflows see the Runner Guide.
tolokaforge/ # Installable Python package
├── cli/ # CLI commands (run, validate, status, analyze)
├── core/ # Orchestration, grading, metrics, queue
├── tools/ # Built-in + MCP tool system
└── env/ # Environment services (JSON DB, mock web, RAG)
examples/ # Example tasks and run configurations
| Topic | Link |
|---|---|
| Getting started | docs/GETTING_STARTED.md |
| Task authoring | docs/TASKS.md |
| Grading system | docs/GRADING.md |
| Tool reference | docs/TOOLS.md |
| Browser/mobile tools | docs/BROWSER_TOOLS.md |
| Runner & distributed execution | docs/RUNNER.md |
| Python package API | docs/PYTHON_PACKAGE.md |
| Task packs | docs/TASK_PACKS.md |
| Configuration reference | docs/REFERENCE.md |
| Security model | docs/SECURITY.md |
| Adapter architecture | docs/ADAPTER_ARCHITECTURE.md |
| Benchmark types | docs/BENCHMARK_TYPES.md |
| Testing guide | tests/README.md |
| Example | Description |
|---|---|
examples/package_api/ |
Programmatic run via Python imports |
examples/analyze_results/ |
Metrics and failure attribution analysis |
examples/distributed_run/ |
Queue-backed distributed execution |
examples/custom_grading/ |
Custom scoring patterns |
examples/browser_task/ |
Browser task authoring workflow |
make test # all tests
make test-unit # fast, isolated
make test-functional # mocked externalsSee tests/README.md for integration/E2E tests and contribution guidelines.
Apache-2.0 — see LICENSE.
See CONTRIBUTING.md.
Use CITATION.cff or CITATION.bib when referencing Tolokaforge in papers or reports.