# PR #521 Code Review — Autonomous ML Experiments

## Context

PR #521 (`saba/auto-ml`) adds an autonomous-experiment toolkit on top of trackio: metric watchers (`trackio.watch()` / `trackio.should_stop()`), run-status tracking (`running` → `finished` / `failed`), three new CLI commands (`best`, `compare`, `summary`), Python API additions (`Run.status`, `Run.final_metrics`, `Run.metrics()`, `Run.history()`), `AlertReason` constants, and a persisted `data` column on alerts. ~2 100 added lines across 17 files (about half tests/docs).

This document is a review-only artifact: it lists every issue found, ranked by severity, with file:line citations and concrete fix suggestions. The user will manually triage these onto the PR.

---

## Blockers

### B1. Run-status writes hit the local SQLite DB unconditionally — even for Space / self-hosted runs

`trackio/run.py:169` (in `Run.__init__`) and `trackio/run.py:1032` (end of `Run.finish`) call `SQLiteStorage.set_run_status(...)` directly, with no branch on `self._is_local`. `set_run_status` calls `SQLiteStorage.init_db(project)` (`trackio/sqlite_storage.py:2314`), which creates a local `~/.cache/huggingface/trackio/{project}.db` even for Space-only or self-hosted-only users.

Consequences:

1. Every Space-mode run leaves a stub local DB containing only a `configs` row with `status='running'/'finished'`. The remote Space DB never receives the status update.
2. `Run.status`, `Run.final_metrics`, `Run.metrics()`, `Run.history()` (all in `trackio/api.py:25-63`) and `Api.runs()` (`api.py:121-124`) read from local SQLite directly. For Space-hosted projects they return whatever the local stub happens to contain — not what the dashboard shows. No warning, no fallback to the remote client.
3. `cli.py` `list runs --space` (`cli.py:941`) calls `SQLiteStorage.get_run_statuses(args.project)` against the local DB to annotate remote runs — this only "works" because of side-effect 1 from the same machine.

**Fix sketch:** Route status writes through the remote client when `space_id` or `server_base_url` is set, mirroring how alerts/logs are sent (`run.py:430-536`). Add `bulk_run_status` (or piggy-back on `bulk_alert`) to `server.py`. For reads, `Api`/`Run` need a remote variant or should at minimum raise/warn when the project is remote.

### B2. Remote-mode `Run.alert(..., data=...)` silently drops the `data` field

`trackio/server.py:611-642` (`bulk_alert` HTTP handler) builds `alerts_by_run` and the outgoing `payload` *without* `data_list`. The local-only `SQLiteStorage.bulk_alert` was updated to take and persist `data_list` (`sqlite_storage.py:1420-1488`), and `Run._write_alerts_to_sqlite` passes it correctly (`run.py:399-422`). But the remote send path goes through `server.bulk_alert`, which ignores the field.

End result: when a Space-hosted run does `trackio.alert(..., data={...})`, the alert lands on the remote DB with `data=NULL`. `AlertReason` filtering (the headline use case in `docs/source/alerts.md:135-146`) only works locally.

**Fix:** Mirror the field through `server.bulk_alert` — add `"data_list"` to the per-run accumulator and the outgoing payload. (Also consider adding it as `entry.get("data")` and guarding with `has_data`.)

---

## Serious

### S1. New CLI commands silently ignore `--space`

`cli.py:890-895` only rejects `--space` for `show / status / sync / freeze / skills`. The new `best` (line 1409), `compare` (1477), and `summary` (1555) read directly from `SQLiteStorage`. Running `trackio best --space user/space --project foo --metric loss` reports against the local stub DB or errors with "Project not found" — neither is what a user would expect.

**Fix:** At minimum, add the three new commands to the `--space` deny list at `cli.py:890`. Better: implement remote variants via `_get_remote(args)` like `list` and `get` do (`cli.py:925-1107`).

### S2. Watchers are not exercised against the production singleton in any test

`tests/e2e-local/test_watchers.py` constructs `MetricWatcher` / `WatcherManager` instances directly and never touches `trackio._watcher_manager` (the module-level singleton at `trackio/__init__.py:838`). Untested:

- `init()` clearing the singleton (`__init__.py:685-689`);
- `log()` invoking the singleton and re-firing alerts via `alert()` (`__init__.py:720-728`);
- The "watchers cleared" warning emitted on re-init.

If two tests in the same pytest session register watchers via `trackio.watch()` without an intervening `init()`, state leaks between tests.

**Fix:** Add a small e2e test that does `init() → watch() → log(NaN) → assert alert was fired and `should_stop()==True``, and a second test that asserts the singleton is cleared by a subsequent `init()`.

### S3. `_cleanup_current_run` atexit path is never tested

The marquee bit of run-status tracking — marking abandoned runs as `failed` from the atexit hook (`__init__.py:103-110`) — has zero test coverage. `tests/e2e-local/test_run_status.py:22` (`test_status_failed_not_overwritten_by_finish`) calls `run.finish(status="failed")` directly; that's the explicit branch, not the atexit branch. If `_cleanup_current_run` were deleted, the test suite would still pass.

**Fix:** Spawn a subprocess that does `init() → log() → sys.exit(0)` (no `finish()`), then in the parent re-open the DB and assert `status == 'failed'`.

### S4. `get_final_metric_for_runs` is O(rows × runs × metrics) Python JSON parsing

`trackio/sqlite_storage.py:3194-3260`: per-run query that fetches *every* metrics row, `orjson.loads`-es each entire `metrics` blob, then `deserialize_values`-es it, just to extract one scalar. Used by:

- `trackio best` — once total (acceptable);
- `trackio summary` — once per run (acceptable for small projects, painful otherwise);
- `trackio compare` — once **per (run × metric) pair** (`cli.py:1513-1519`), with `run_names=[run]`;
- `Api.Run.final_metrics` — once per metric per run (`api.py:30-40`).

For a project with R runs × M metrics × N steps, `compare` does R×M×N JSON parses.

**Fix:** Use SQLite's built-in `json_extract(metrics, '$."<metric>"')` plus `MIN`/`MAX`/last-step ordering in pure SQL, ideally in a single query that returns one row per (run, metric). Also accept multiple metrics in one call so `compare` can avoid the N+1 pattern.

Minor adjacent issue: the `WHERE run_name = ?` query at `sqlite_storage.py:3219` does not disambiguate by `run_id`, so duplicate-named runs across IDs are silently merged — inconsistent with the rest of the new code (e.g. `cli.py:973-977` already handles dupes).

### S5. Documentation contradicts code on spike-factor semantics

`docs/source/alerts.md:76` (parameter table) says `spike_factor=3.0` means "3× the average". `docs/source/alerts.md:105` (Conditions section) says it triggers when the value deviates by more than `(spike_factor - 1) × avg`. The actual code (`watchers.py:155-157`):

```python
abs(value - recent_avg) > abs_avg * (spike_factor - 1)
```

So `spike_factor=3.0` triggers when the **deviation** is 2× |avg|, not when the value is 3× avg. The two doc sentences contradict each other; the table is wrong, the prose-section formula is right (modulo `avg` vs `|avg|`). Also the prose says "exceeds" but the code is symmetric (drops trigger too — confirmed by `test_spike_detection_works_for_negative_metrics`).

**Fix:** Rewrite line 76 to match line 105 exactly, and replace "exceeds" with "deviates from" wherever it matters. Add a worked example: `spike_factor=3.0` with `avg=1.0` triggers at `|value - 1.0| > 2.0`.

### S6. `simulate_training.py` doesn't use the new watchers API

`autonomous-experiments/test_harness/simulate_training.py:109-162` manually calls `trackio.alert()` for NaN/Inf, divergence, stagnation, and overfitting. The PR's whole premise is that `trackio.watch()` replaces this boilerplate. As shipped, the harness:

- Does not exercise the production code paths it advertises;
- Reads as a *negative example* for users who copy from it.

**Fix:** Rewrite the simulator to register watchers (`trackio.watch("train/loss", nan=True, max_value=20.0, spike_factor=3.0)` etc.) and poll `trackio.should_stop()`. Keep the explicit `alert()` calls only where they demonstrate behavior watchers can't model.

### S7. `agent_runner.py` LR-schedule bug records the wrong learning rate

`autonomous-experiments/test_harness/agent_runner.py` (in the `failure_recovery` flow): line 94 mutates `lr *= 0.1`, line 96 records `attempts.append({"run": run_name, "status": "failed", "lr": lr * 10})` — using `lr * 10` as a back-calculated original. This is a time-bomb: only correct because the multiplier is the exact inverse of the mutation. Any change to the schedule silently records the wrong value.

**Fix:** Stash the original in a temp variable before mutating: `prev_lr = lr; lr *= 0.1; attempts.append({..., "lr": prev_lr})`.

### S8. Changeset body is too vague

`.changeset/every-spoons-smash.md` reads `feat:Add additional support for autonomous ML experiments`. No mention of `watch()` / `should_stop()`, `AlertReason`, run-status tracking, the new CLI commands, or the `alerts.data` SQL migration. Users reading the changelog learn nothing.

**Fix:** Enumerate the surface-area additions (Python API, CLI commands, schema migrations) so readers can find what they need.

---

## Medium

### M1. `get_run_records` LEFT JOIN drops just-initialized runs

`trackio/sqlite_storage.py:612-621`: `FROM metrics m LEFT JOIN configs c` — `metrics` is the driver. A run that's been initialized but has logged nothing yet (its `configs` row exists from `set_run_status('running')` but no `metrics` rows) is **never returned**. `get_runs()` is built on `get_run_records` (line 2118), so `cli list runs`, `cli compare`, and `cli summary` skip just-initialized runs. The dashboard inherits the same gap.

**Fix:** Either UNION against `configs` for run-ids that exist in `configs` but not `metrics`, or invert the join (drive from `configs`, LEFT JOIN `metrics`).

### M2. `_cleanup_current_run` can block interpreter shutdown for up to 30 s

`__init__.py:103-110` calls `run.finish(status="failed")`, which sets `_stop_flag` then `thread.join(timeout=30)` (`run.py:988` / `run.py:1001`). At interpreter shutdown this can block 30 s and prints `"* Run finished. Uploading logs..."` (`run.py:987`/`999`) to stdout — jarring on Ctrl-C. `_emit_nonfatal_warning` post-shutdown can also fail because the `warnings` module may already be torn down. The `set_run_status('failed')` runs *after* the join (`run.py:1032`), so on a hard kill the status update may never land.

**Fix:** Add a short-deadline path for atexit (1–2 s), suppress the "Uploading logs" print when called from atexit, and write `set_run_status('failed')` *before* the flush attempt so the status update is durable even if the join times out.

### M3. `_seed()` is rebuilt from scratch in every CLI test

`tests/e2e-local/test_cli_agent_commands.py:62, 90, 119, 148, 177, 197` — every test calls `_seed()` to spawn 6 runs through the full `init → log×N → finish` pipeline. The recent commit `66a4eff "Simplify tests: consolidate redundant cases, seed once"` advertises a fix that didn't land. Slow and redundant.

**Fix:** Make `_seed()` a session- or module-scoped fixture that returns the project name; have each test reuse it.

### M4. Weak assertions in CLI tests

- `test_summary` asserts `data["total_alerts"] >= 1` (`test_cli_agent_commands.py:184`). `_seed` deterministically fires exactly one alert, so this should be `== 1`. The `>=` masks regressions where extra alerts get fired (e.g., a watcher accidentally registered).
- `test_best_error_cases` (`test_cli_agent_commands.py:196-210`) asserts only `returncode != 0`. A non-zero exit from an unrelated cause (import error, etc.) would still pass. Add `r.stderr` content check.
- No test for non-JSON / human-formatted output of `best` / `compare` / `summary`. The `66a4eff` commit message says "drop format tests" — that's a coverage hole, not a simplification.

### M5. `test_run_status` doesn't assert `metric=` filter actually filters

`tests/e2e-local/test_run_status.py:55-60` asserts `len(loss_history) == 5` for `Run.history(metric="loss")`, but never asserts that `acc` is absent from each entry. Add `assert all("acc" not in entry["metrics"] for entry in loss_history)` (matching the actual schema).

### M6. `agent_runner.py` may flake from clock skew on `--since`

`autonomous-experiments/test_harness/agent_runner.py:134, 173` generate `since` timestamps via `datetime.now(timezone.utc).isoformat()` and pass them to `trackio list alerts --since=...`. If the simulator subprocess writes an alert in the same wall-clock second as `since`, the `>=` vs `>` boundary may drop or duplicate the alert — depending on which side `SQLiteStorage.get_alerts` uses.

**Fix:** Track alerts by `alert_id` instead of timestamp window, or subtract a small slop (e.g. 1 s) from `since`.

### M7. `agent_runner.py` `subprocess.Popen` can deadlock on full stderr

`agent_runner.py` polls `proc.poll()` and only calls `proc.communicate()` after exit. If the simulator child writes >64 KB to stderr, the pipe buffer fills and the child blocks forever.

**Fix:** Either drain pipes via threads / `asyncio`, or set `stderr=subprocess.DEVNULL` if the harness doesn't need stderr.

### M8. No migration test for `alerts.data` on a pre-PR DB

`tests/unit/test_sqlite_storage.py` adds a legacy-schema test for the configs table, but nothing exercises the `ALTER TABLE alerts ADD COLUMN data` migration on a pre-PR DB. Symmetric coverage would catch any future schema drift.

**Fix:** Add `test_alerts_data_migration_on_legacy_db` mirroring the existing `test_get_run_records_with_run_id_but_no_finished_at` pattern.

### M9. `Api`/`Run`/CLI methods for Space projects return wrong/empty data

(Cross-reference of B1.) Even after fixing B1's local-DB pollution, `trackio.Api().runs(project)`, `Run.status`, `Run.final_metrics`, `Run.metrics()`, `Run.history()` all read from local SQLite. If the project is hosted only on a Space, callers get empty/stale data with no warning.

**Fix:** Either route through `RemoteClient` like `cli.py` `_get_remote` does, or fail fast with a clear error message when the project's metadata says it's remote.

### M10. `test_watchers.py` patience semantics need a comment

`test_patience_min_mode` (`test_watchers.py:64-74`) calls `check` 5 times with `patience=3` and expects the alert at the 4th call. Walking the code: first call sets `_best_value=1.0`; 0.9 is improvement → reset; 0.95/0.95/0.95 → counter=3 → fires. So `patience=3` actually requires 3 *non-improvements after* a best. This off-by-one between "patience=3" and "3 stagnant steps" deserves an inline comment in the test or the docstring; otherwise readers will mis-tune the parameter.

---

## Nits

### N1. `trackio.watch()` docstring contradicts implementation

`trackio/__init__.py:855-856` says `watch()` "must be called after `trackio.init()`". Nothing enforces this — calling `watch()` before any `init()` succeeds silently and the watcher persists across the first `init()`. Either enforce it (raise `RuntimeError`) or update the docstring.

### N2. `init()`-clears-watchers warning order

`trackio/__init__.py:685-689` emits the warning *only* when there were watchers, then clears. Behavior is correct, but the warning currently reads "trackio.init() cleared existing metric watchers" — phrased in past tense before clearing actually happens. Reorder for clarity, or rephrase ("…will clear…").

### N3. Inconsistent invocation of CLI vs simulator in agent_runner

`agent_runner.py` uses bare `trackio` from PATH for CLI invocation (line ~22) but `sys.executable simulate_training.py` for the trainer (line ~36). If a venv mismatches, `trackio` may resolve elsewhere. Standardize on `[sys.executable, "-m", "trackio", ...]`.

### N4. Dead code in `simulate_training.py`

Line 109-ish: `if math.isnan(train_loss) or math.isinf(train_loss):` can never fire, because `simulate_loss` returns `max(0.01, base + noise)`. Remove or actually injection-test NaN.

Also: `if args.spike_at_step and step == args.spike_at_step:` is falsey when `--spike-at-step 0` is passed. Use `is not None`.

### N5. `_watcher_manager` thread safety

The singleton `_watcher_manager` (`__init__.py:838`) has no locking. `MetricWatcher.check` mutates `_values`, `_best_value`, `_*_alerted`, `_steps_without_improvement`. If two threads in the same process both call `trackio.log()` against the same run (unusual but documented for some PyTorch DataLoader patterns), the deque mutations race. At minimum add a docstring caveat.

### N6. `tests/e2e-local/` directory not wired into CI / docs

The directory is new and contains no `conftest.py` of its own. It depends on `tests/conftest.py`'s `temp_dir` fixture, which monkey-patches `TRACKIO_DIR` as a module attribute. `test_cli_agent_commands.py` then spawns `subprocess.run` with `env={"TRACKIO_DIR": ...}`. Confirm:

1. `pytest` discovery picks up `tests/e2e-local/` (the existing convention is `tests/e2e/` and `tests/e2e-spaces/`);
2. The CLI subprocess respects `TRACKIO_DIR` env var the same way the in-process module attribute does.

If both are fine, leave as-is. If not, expect mysterious "Project not found" failures only on CI.

### N7. `EXPERIMENTS["all"] = None` sentinel

`agent_runner.py` (line ~187) maps `"all"` to `None` then special-cases it. Awkward — use a list of names or a separate flag.

### N8. `format_compare` truncation

`cli_helpers.py:193-194` does `run_w = max(...)` but never truncates long run names — a comparison of runs with 60-char names produces output wider than the terminal. Optional: cap at, say, 40 chars and ellipsize.

### N9. `trackio.watch()` doesn't validate `mode` / `fn` signature

If a user passes `mode="MIN"` or `fn=lambda v: ...` (one arg instead of two), they get an opaque error mid-training rather than at registration. Add eager validation in `watch()`.

### N10. `Run.final_metrics` is a property that does N queries

`api.py:29-40` exposes `final_metrics` as a `@property`. Readers will mistake it for a cheap accessor; it actually scans every metric in every row of every metric. At minimum, drop the `@property` so callers see `()` and recognize the cost. Better: implement it via the SQL fix proposed in S4.

### N11. `_extract_reports` ignores `run_id`

`cli.py:152-174` filters reports by run name only; if duplicate-named runs exist, reports merge silently. Same disambiguation pattern that `cli list runs` uses (`cli.py:973-977`) should apply.

### N12. `should_stop()` summary in docs is incomplete

`docs/source/alerts.md:121` lists "NaN/Inf, max_value exceeded, or patience exhausted" — but custom watchers can also trigger stop via `"stop": True` in their returned dict (covered later at line 178). Fold it into the early summary.

### N13. `min_value` symmetry

Docs and code: `min_value` does **not** set `should_stop()` (`watchers.py:148`), while `max_value` does (line 128). The table at `alerts.md:80` correctly omits "Also sets `should_stop`" from `min_value`, but readers may infer symmetry. One sentence in the prose section would close that.

### N14. `Api.alerts` keyword inconsistency check

`docs/source/alerts.md:138` calls `trackio.Api().alerts("my-project", run="brave-sunset-0")`. Verify `Api.alerts()` actually accepts `run=` (not `run_name=`) — `api.py:126-135` does, but worth eyeballing before merge.

---

## Verification plan

For each finding above, the fix can be verified against the existing test infrastructure:

- **B1, B2, M9** — add e2e tests that initialize against a mocked `RemoteClient` and assert no local DB is created and that `data` arrives at the remote endpoint.
- **S1** — extend the `--space` deny-list test in `test_cli_agent_commands.py` to cover `best`/`compare`/`summary`.
- **S2** — new tests in `test_watchers.py` calling `trackio.init() → trackio.watch() → trackio.log()` and asserting alerts via `trackio.Api().alerts(...)`.
- **S3** — subprocess-based test in `test_run_status.py` that exits without `finish()`.
- **S4** — micro-benchmark with R=20, M=10, N=1000 before/after the SQL rewrite; assert `compare` runtime drops by ≥10×.
- **S5** — re-read `alerts.md` section vs. `watchers.py:152-176` after rewrite.
- **S6, S7, M6, M7, N3, N4, N7** — manually run `python autonomous-experiments/test_harness/agent_runner.py` against a fresh project and confirm output looks coherent and uses watchers.
- **S8** — eyeball `.changeset/every-spoons-smash.md`.
- **M1** — unit test that calls `init()` (no log) then `get_run_records()` and asserts the run is present.
- **M2** — manual test: kill a Python process mid-training with SIGTERM and confirm the status lands as `failed` quickly.
- **M3, M4, M5, M10** — direct edits to test files; rerun `pytest tests/e2e-local/`.
- **M8** — add legacy-schema test mirroring `test_get_run_records_with_run_id_but_no_finished_at`.
- All — `ruff check --fix --select I && ruff format && pytest`.

## Critical files (modify these)

- `trackio/run.py` (B1, M2)
- `trackio/server.py` (B2)
- `trackio/cli.py` (S1)
- `trackio/sqlite_storage.py` (S4, M1)
- `trackio/api.py` (M9, N10)
- `trackio/__init__.py` (M2, N1, N2)
- `trackio/watchers.py` (N5, N9)
- `docs/source/alerts.md` (S5, N12, N13)
- `tests/e2e-local/test_watchers.py` (S2, M10)
- `tests/e2e-local/test_run_status.py` (S3, M5)
- `tests/e2e-local/test_cli_agent_commands.py` (M3, M4)
- `tests/unit/test_sqlite_storage.py` (M8)
- `autonomous-experiments/test_harness/simulate_training.py` (S6, N4)
- `autonomous-experiments/test_harness/agent_runner.py` (S7, M6, M7, N3, N7)
- `.changeset/every-spoons-smash.md` (S8)
