Tolokaforge

A benchmarking harness for evaluating tool-using LLM agents. Multi-turn agent/user loops, sandboxed execution, deterministic grading, and rich telemetry — across any provider via LiteLLM.

Highlights

Agent + User Loop – Multi-turn conversations where both agent and user models call tools.
Sandboxed Execution – Tool calls proxy into Dockerized services with no external network access.
MCP-Compatible Tooling – Tasks declare tools via Model Context Protocol or built-ins.
Deterministic Grading – JSONPath assertions, state hashes, transcript rules, optional LLM judges.
Rich Metrics – pass@k, cost/token estimates, latency percentiles, failure attribution.
Distributed Runner – SQLite for local runs, Postgres for multi-machine execution.
Bring-Your-Own Models – Any provider supported by LiteLLM (OpenAI, Anthropic, Google, Azure, Bedrock, Ollama, OpenRouter, and more).

Installation

pip install tolokaforge                # core
pip install "tolokaforge[browser]"     # + Playwright
pip install "tolokaforge[all]"         # everything

Dev install:

curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync

See Python Package Guide for all extras and programmatic API usage.

Quick Start

# 1. Configure provider keys
cp .env.example .env

# 2. Run an example benchmark (Docker services auto-start)
scripts/with_env.sh uv run tolokaforge run --config examples/custom_grading/run_config.yaml

# 3. Check results
uv run tolokaforge status --run-dir results/custom_grading_<timestamp>

Docker services start automatically via auto_start_services (default: true). No manual docker compose up needed.

For distributed execution, task packs, and advanced workflows see the Runner Guide.

Project Structure

tolokaforge/          # Installable Python package
├── cli/              # CLI commands (run, validate, status, analyze)
├── core/             # Orchestration, grading, metrics, queue
├── tools/            # Built-in + MCP tool system
└── env/              # Environment services (JSON DB, mock web, RAG)
examples/             # Example tasks and run configurations

Documentation

Topic	Link
Getting started	docs/GETTING_STARTED.md
Task authoring	docs/TASKS.md
Grading system	docs/GRADING.md
Tool reference	docs/TOOLS.md
Browser/mobile tools	docs/BROWSER_TOOLS.md
Runner & distributed execution	docs/RUNNER.md
Python package API	docs/PYTHON_PACKAGE.md
Task packs	docs/TASK_PACKS.md
Configuration reference	docs/REFERENCE.md
Security model	docs/SECURITY.md
Adapter architecture	docs/ADAPTER_ARCHITECTURE.md
Benchmark types	docs/BENCHMARK_TYPES.md
Testing guide	tests/README.md

Examples

Example	Description
`examples/package_api/`	Programmatic run via Python imports
`examples/analyze_results/`	Metrics and failure attribution analysis
`examples/distributed_run/`	Queue-backed distributed execution
`examples/custom_grading/`	Custom scoring patterns
`examples/browser_task/`	Browser task authoring workflow

Testing

make test              # all tests
make test-unit         # fast, isolated
make test-functional   # mocked externals

See tests/README.md for integration/E2E tests and contribution guidelines.

License

Apache-2.0 — see LICENSE.

Contributing

See CONTRIBUTING.md.

Citation

Use CITATION.cff or CITATION.bib when referencing Tolokaforge in papers or reports.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.cursor/rules		.cursor/rules
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.roo		.roo
.vscode		.vscode
docker		docker
docs		docs
examples		examples
external_adapters		external_adapters
scripts		scripts
tests		tests
tolokaforge		tolokaforge
tools		tools
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.mcp.json		.mcp.json
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.roomodes		.roomodes
AGENTS.md		AGENTS.md
CITATION.bib		CITATION.bib
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tolokaforge

Highlights

Installation

Quick Start

Project Structure

Documentation

Examples

Testing

License

Contributing

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tolokaforge

Highlights

Installation

Quick Start

Project Structure

Documentation

Examples

Testing

License

Contributing

Citation

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages