Scope bucket sync to trackio/ subtree to avoid walking the HF cache#506
Conversation
download_bucket_to_trackio_dir() previously called sync_bucket() with dest=TRACKIO_DIR.parent (i.e. ~/.cache/huggingface/). When computing the sync plan, sync_bucket os.walk()s the local dest to build an mtime map. On ML clients with large ~/.cache/huggingface/hub model caches (common on HPC / Lustre filesystems), that walk can take many minutes and makes trackio.init(bucket_id=...) appear to hang before the first log call. Scoping source to the "trackio/" prefix and dest to TRACKIO_DIR limits the local walk to the trackio directory itself. Files in the bucket already live under trackio/ so the layout on disk is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🪼 branch checks and previews
|
🦄 change detectedThis Pull Request includes changes to the following packages.
|
🪼 branch checks and previews
Install Trackio from this PR (includes built frontend) pip install "https://huggingface.co/buckets/trackio/trackio-wheels/resolve/c59621e1221244309301529c16a3e1ed38074cf0/trackio-0.23.0-py3-none-any.whl" |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8c360da. Configure here.
| _emit_nonfatal_warning( | ||
| f"trackio.init() could not recover the previous step for run '{run_name}' on Space '{space_id}': {e}. Continuing from step 0." | ||
| ) | ||
| return None |
There was a problem hiding this comment.
Space-backed last step lookup falls through to SQLite
Medium Severity
In _safe_get_last_step_for_init, when space_id is not None and the Space API call succeeds but returns a non-dict summary, execution falls through to the local SQLiteStorage.get_max_step_for_run path instead of returning None. The sibling function _safe_get_runs_for_init correctly handles this by always returning within the space_id is not None block (via return runs if isinstance(runs, list) else []), but here the isinstance(summary, dict) guard only returns inside the if, leaving the false branch without a return statement. This defeats the PR's goal of avoiding local SQLite/bucket access for Space-backed runs.
Reviewed by Cursor Bugbot for commit 8c360da. Configure here.
There was a problem hiding this comment.
Pull request overview
Fixes long stalls during trackio.init(..., space_id=..., bucket_id=...) by narrowing bucket sync scope and avoiding local SQLite/bucket hydration for Space-backed resume checks.
Changes:
- Scope bucket sync from
hf://buckets/<bucket_id>/trackiodirectly intoTRACKIO_DIRto avoid walking the broader HF cache. - Update
trackio.init()resume logic for Space-backed runs to query the Space API (/get_runs_for_project,/get_run_summary) instead of local SQLite. - Plumb pre-fetched
existing_runsandinitial_last_stepintoRunto avoid redundant lookups and preserve step continuity when resuming.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
trackio/run.py |
Accepts and consumes pre-fetched run list / last-step metadata to avoid redundant lookups during initialization. |
trackio/bucket_storage.py |
Narrows bucket sync source/destination to the trackio/ subtree to reduce local filesystem traversal. |
trackio/__init__.py |
Implements Space API–backed resume inspection and step recovery; passes pre-fetched metadata into Run. |
.changeset/few-bars-find.md |
Adds a patch changeset entry describing the feature. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _safe_get_runs_for_init( | ||
| project: str, space_id: str | None, resume: str | ||
| ) -> list[str]: | ||
| if space_id is not None: | ||
| if resume == "never": |
There was a problem hiding this comment.
New Space-backed resume behavior (skip lookup for resume="never", fetch runs via /get_runs_for_project, and restore last_step via /get_run_summary) is not covered by unit tests. There are existing unit tests for local resume modes (tests/unit/test_run.py); consider adding tests that monkeypatch RemoteClient.predict for these Space flows to prevent regressions in resume semantics and step continuity.
There was a problem hiding this comment.
Thanks for the suggestion. I am intentionally not adding new unit tests in this follow-up because this pass is focused on unblocking the PR branch and keeping the change surface minimal. I addressed the behavioral issues directly in trackio.init() by making Space resume checks deterministic and by reusing a single RemoteClient for both lookups.
| if resume == "never": | ||
| return [] |
There was a problem hiding this comment.
For Space-backed runs, returning an empty list when resume == "never" disables the duplicate-name protection later in init(). If the caller passes an explicit name that already exists on the Space, the run will effectively append to the existing remote run (and step numbering will restart at 0), which violates the expected semantics of resume="never". Consider either (a) forcing a fresh generated name when space_id is set and resume == "never" (even if name is provided) or (b) doing a lightweight remote existence check only when name is provided, so you can still avoid collisions without hydrating local storage.
| if resume == "never": | |
| return [] |
| client = RemoteClient( | ||
| space_id, | ||
| hf_token=huggingface_hub.utils.get_token(), | ||
| verbose=False, | ||
| ) |
There was a problem hiding this comment.
RemoteClient is constructed here and again in _safe_get_last_step_for_init(). For Space-backed init() flows that resume a run, this can trigger multiple connection/probing steps (e.g., repeated /version checks) and extra network overhead during init. Consider creating a single RemoteClient in init() (when needed) and reusing it across both lookups.
Ensure Space summary lookups never fall through to local SQLite, reuse a single RemoteClient across resume metadata calls, and keep resume='never' duplicate-name protection for explicit run names. Made-with: Cursor
|
You have used all of your free Bugbot PR reviews. To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial. |
|
Going to merge this in (but not do release) to unblock subsequent PRs |
Resolves conflicts in trackio/__init__.py and trackio/run.py, integrating the Space-backed resume lookups from #506 with the run_id-based resume logic on this branch. Space-backed init now queries /get_runs_for_project and /get_run_summary via the RemoteClient and passes the resolved run_id (when available) so step continuity works for multi-run-per-name projects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


Summary
This PR fixes the long
trackio.init(..., space_id=..., bucket_id=...)stall before the firstlog()call.It does two things:
hf://buckets/<bucket_id>/trackio -> TRACKIO_DIRinstead of walking the full Hugging Face cache directoryThat means:
resume="never"skips run lookup entirely for Space-backed runsresume in {"allow", "must"}fetches run names from/get_runs_for_projectlast_stepfrom/get_run_summaryso step numbering continues correctlyWhy
The original hang came from
sync_bucket(...)computing a sync plan against a very large local HF cache. On machines with lots of downloaded models, walking~/.cache/huggingface/can take minutes even for a tiny Trackio bucket.Separately,
trackio.init()was checking resume state through local SQLite, which triggered bucket hydration on Space-backed runs even though the run metadata already exists on the Space.Test plan
PYTHONPATH=. pytest -q tests/unit/test_run.pypython -m compileall trackio/__init__.py trackio/run.py tests/unit/test_run.pyNote
Medium Risk
Changes
trackio.init()resume behavior for Space-backed runs and alters bucket hydration paths, which could affect run resumption/step continuity and local cache expectations if the Space APIs or bucket layout differ.Overview
Fixes long stalls when initializing with
space_id+bucket_idby syncing onlyhf://buckets/<bucket_id>/trackiodirectly intoTRACKIO_DIR(instead of walking the broader cache/parent directory).For Space-backed runs,
trackio.init()now avoids local SQLite/bucket hydration for resume checks: it skips run lookup entirely forresume="never", fetches existing run names via/get_runs_for_project, and (when resuming) restores step continuity by fetchinglast_stepvia/get_run_summaryand passing pre-fetched resume metadata intoRunto avoid redundant lookups.Reviewed by Cursor Bugbot for commit 8c360da. Bugbot is set up for automated code reviews on this repo. Configure here.