Add additional support for autonomous ML experiments#521
Conversation
…ers, run status, structured alerts
Adds agent-facing query APIs and tooling to make Trackio the definitive tracker
for autonomous ML experiments driven by AI coding agents.
New CLI commands:
- `trackio best` — rank runs by metric, return winner + leaderboard
- `trackio compare` — side-by-side run comparison across metrics
- `trackio summary` — full experiment overview with status/configs/metrics
New features:
- Run status tracking (running/finished/failed) with automatic lifecycle mgmt
- Structured alert data (`data={}` param on `trackio.alert()`)
- Metric watchers (`trackio.watch()`) for auto NaN/spike/stagnation detection
- `trackio.should_stop()` for training loop early stopping
- Python API: `run.metrics()`, `run.history()`, `run.summary`, `run.status`
Test infrastructure:
- Synthetic training simulator (no ML deps, runs in seconds)
- Agent test runner with 5 experiment types
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Resolve merge conflicts integrating autonomous ML features with main branch changes (run_id support, RemoteClient, query command, OAuth, error handling). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
🪼 branch checks and previews
|
🦄 change detectedThis Pull Request includes changes to the following packages.
|
🪼 branch checks and previews
Install Trackio from this PR (includes built frontend) pip install "https://huggingface.co/buckets/trackio/trackio-wheels/resolve/e58200b25c87922e98bbc1241009e5f2c26848eb/trackio-0.25.1-py3-none-any.whl" |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Fixes:
- Fix 1: Run status "failed" no longer overwritten by "finished" —
finish() now accepts a status parameter, _cleanup_current_run passes
status="failed" directly
- Fix 2: Watcher patience supports both min and max mode via new
mode parameter (was hardcoded to minimization only)
- Fix 3: Replace dead --minimize/--maximize flags with --direction
{min,max} on trackio best
- Fix 5: Watcher _values now uses deque(maxlen=window) to bound memory;
alert dedup via per-condition flags that reset on return to normal
- Fix 6: Warn if watchers exist when init() clears them; docstring
documents ordering requirement
- Fix 7: Drop Api.Run.summary cache — recompute on each access
- Fix 9: set_run_status/get_run_status now accept and use run_id,
INSERT handles run_id NOT NULL column in new schema
- Fix 12: Remove unused enumerate variable in agent_runner
Tests:
- test_watchers.py: 18 tests covering nan, spike, max/min threshold,
dedup, patience min/max mode, window bounds, manager propagation
- test_run_status.py: 6 tests covering running→finished, failed status,
idempotent finish, Api.Run.status, multi-run status
- test_cli_agent_commands.py: 10 tests covering best/compare/summary
in JSON and human-readable modes, error cases
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ests - Watchers: 18 → 9 tests by merging nan+inf, combining dedup with trigger tests, merging patience modes - Run status: 6 → 4 tests by dropping idempotent (covered by overwrite) - CLI: 10 → 4 tests by seeding project once via module fixture, dropping human-readable format tests, merging maximize/subset into main tests Total: 34 → 17 tests, 494 → 336 lines, same code path coverage. CLI tests 3x faster (7.5s vs 25s) from shared fixture. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
I had Claude analyze it and write up some actionable feedback. At first I was going to go through and leave comments for each item in the source code, but I realized it's probably more efficient to just include the Markdown file here. Note that Claude considers two issues blockers: |
Short description
Apologies for the large PR. About half the additions are from tests and docs.
This PR adds features designed for use in AI-driven training loops where an agent or automated script needs to monitor metric health, compare runs, and query results
programmatically without human supervision.
Metric watchers (
trackio.watch()/trackio.should_stop()): Register rules upfront and have Trackio fire alerts automatically on everytrackio.log()call. Supportedconditions:
NaN/Infdetection, max/min threshold breachesNstepsfncallback:trackio.watch("loss", fn=my_check)Any condition that warrants stopping sets
trackio.should_stop()toTrue, which training loops can poll to exit early. Custom conditions can signal early stopping by including"stop": Truein a returned alert dict.Run status tracking: Runs now record a lifecycle status (
running→finished/failed) in SQLite. Status is set torunningoninit(),finishedonfinish(), andfailedon unexpected process exit. Thebest,compare, andsummarycommands default to finished runs only;--include-alloverrides this.New CLI commands:
trackio best --project X --metric Y [--direction min|max] [--mode last|min|max]— find the best run for a metrictrackio compare --project X [--runs ...] [--metrics ...]— side-by-side run comparisontrackio summary --project X [--runs ...]— per-run metric summary tablePython API additions on
Run:status,final_metrics,metrics(),history().AlertReasonconstants: Every watcher-generated alert includes adata["reason"]field matching one of thetrackio.AlertReasonconstants, allowing agents to identify alert types programmatically. Custom conditions that returnTrueuseAlertReason.CUSTOM.AI Disclosure
Type of Change
Related Issues
Closes:
Testing and linting
python -m pytest ruff check --fix --select I && ruff format