Skip to content

Toloka/tolokaforge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tolokaforge

A benchmarking harness for evaluating tool-using LLM agents. Multi-turn agent/user loops, sandboxed execution, deterministic grading, and rich telemetry — across any provider via LiteLLM.

Highlights

  • Agent + User Loop – Multi-turn conversations where both agent and user models call tools.
  • Sandboxed Execution – Tool calls proxy into Dockerized services with no external network access.
  • MCP-Compatible Tooling – Tasks declare tools via Model Context Protocol or built-ins.
  • Deterministic Grading – JSONPath assertions, state hashes, transcript rules, optional LLM judges.
  • Rich Metrics – pass@k, cost/token estimates, latency percentiles, failure attribution.
  • Distributed Runner – SQLite for local runs, Postgres for multi-machine execution.
  • Bring-Your-Own Models – Any provider supported by LiteLLM (OpenAI, Anthropic, Google, Azure, Bedrock, Ollama, OpenRouter, and more).

Installation

pip install tolokaforge                # core
pip install "tolokaforge[browser]"     # + Playwright
pip install "tolokaforge[all]"         # everything

Dev install:

curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync

See Python Package Guide for all extras and programmatic API usage.

Quick Start

# 1. Configure provider keys
cp .env.example .env

# 2. Run an example benchmark (Docker services auto-start)
scripts/with_env.sh uv run tolokaforge run --config examples/custom_grading/run_config.yaml

# 3. Check results
uv run tolokaforge status --run-dir results/custom_grading_<timestamp>

Docker services start automatically via auto_start_services (default: true). No manual docker compose up needed.

For distributed execution, task packs, and advanced workflows see the Runner Guide.

Project Structure

tolokaforge/          # Installable Python package
├── cli/              # CLI commands (run, validate, status, analyze)
├── core/             # Orchestration, grading, metrics, queue
├── tools/            # Built-in + MCP tool system
└── env/              # Environment services (JSON DB, mock web, RAG)
examples/             # Example tasks and run configurations

Documentation

Topic Link
Getting started docs/GETTING_STARTED.md
Task authoring docs/TASKS.md
Grading system docs/GRADING.md
Tool reference docs/TOOLS.md
Browser/mobile tools docs/BROWSER_TOOLS.md
Runner & distributed execution docs/RUNNER.md
Python package API docs/PYTHON_PACKAGE.md
Task packs docs/TASK_PACKS.md
Configuration reference docs/REFERENCE.md
Security model docs/SECURITY.md
Adapter architecture docs/ADAPTER_ARCHITECTURE.md
Benchmark types docs/BENCHMARK_TYPES.md
Testing guide tests/README.md

Examples

Example Description
examples/package_api/ Programmatic run via Python imports
examples/analyze_results/ Metrics and failure attribution analysis
examples/distributed_run/ Queue-backed distributed execution
examples/custom_grading/ Custom scoring patterns
examples/browser_task/ Browser task authoring workflow

Testing

make test              # all tests
make test-unit         # fast, isolated
make test-functional   # mocked externals

See tests/README.md for integration/E2E tests and contribution guidelines.

License

Apache-2.0 — see LICENSE.

Contributing

See CONTRIBUTING.md.

Citation

Use CITATION.cff or CITATION.bib when referencing Tolokaforge in papers or reports.

About

About Universal LLM benchmarking harness for tool use, browser, mobile, coding, and long-horizon evals

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages