Skip to content

Scope bucket sync to trackio/ subtree to avoid walking the HF cache#506

Merged
abidlabs merged 8 commits into
mainfrom
fix-bucket-sync-hang
Apr 17, 2026
Merged

Scope bucket sync to trackio/ subtree to avoid walking the HF cache#506
abidlabs merged 8 commits into
mainfrom
fix-bucket-sync-hang

Conversation

@abidlabs
Copy link
Copy Markdown
Member

@abidlabs abidlabs commented Apr 17, 2026

Summary

This PR fixes the long trackio.init(..., space_id=..., bucket_id=...) stall before the first log() call.

It does two things:

  • scopes bucket syncs to hf://buckets/<bucket_id>/trackio -> TRACKIO_DIR instead of walking the full Hugging Face cache directory
  • changes Space-backed resume checks to use the Space API instead of local SQLite/bucket hydration

That means:

  • resume="never" skips run lookup entirely for Space-backed runs
  • resume in {"allow", "must"} fetches run names from /get_runs_for_project
  • resumed Space runs fetch last_step from /get_run_summary so step numbering continues correctly

Why

The original hang came from sync_bucket(...) computing a sync plan against a very large local HF cache. On machines with lots of downloaded models, walking ~/.cache/huggingface/ can take minutes even for a tiny Trackio bucket.

Separately, trackio.init() was checking resume state through local SQLite, which triggered bucket hydration on Space-backed runs even though the run metadata already exists on the Space.

Test plan

  • PYTHONPATH=. pytest -q tests/unit/test_run.py
  • python -m compileall trackio/__init__.py trackio/run.py tests/unit/test_run.py

Note

Medium Risk
Changes trackio.init() resume behavior for Space-backed runs and alters bucket hydration paths, which could affect run resumption/step continuity and local cache expectations if the Space APIs or bucket layout differ.

Overview
Fixes long stalls when initializing with space_id+bucket_id by syncing only hf://buckets/<bucket_id>/trackio directly into TRACKIO_DIR (instead of walking the broader cache/parent directory).

For Space-backed runs, trackio.init() now avoids local SQLite/bucket hydration for resume checks: it skips run lookup entirely for resume="never", fetches existing run names via /get_runs_for_project, and (when resuming) restores step continuity by fetching last_step via /get_run_summary and passing pre-fetched resume metadata into Run to avoid redundant lookups.

Reviewed by Cursor Bugbot for commit 8c360da. Bugbot is set up for automated code reviews on this repo. Configure here.

download_bucket_to_trackio_dir() previously called sync_bucket() with
dest=TRACKIO_DIR.parent (i.e. ~/.cache/huggingface/). When computing
the sync plan, sync_bucket os.walk()s the local dest to build an mtime
map. On ML clients with large ~/.cache/huggingface/hub model caches
(common on HPC / Lustre filesystems), that walk can take many minutes
and makes trackio.init(bucket_id=...) appear to hang before the first
log call.

Scoping source to the "trackio/" prefix and dest to TRACKIO_DIR limits
the local walk to the trackio directory itself. Files in the bucket
already live under trackio/ so the layout on disk is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gradio-pr-bot
Copy link
Copy Markdown
Contributor

gradio-pr-bot commented Apr 17, 2026

🪼 branch checks and previews

Name Status URL
🦄 Changes detected! Details

@gradio-pr-bot
Copy link
Copy Markdown
Contributor

gradio-pr-bot commented Apr 17, 2026

🦄 change detected

This Pull Request includes changes to the following packages.

Package Version
trackio patch

  • Scope bucket sync to trackio/ subtree to avoid walking the HF cache

‼️ Changeset not approved. Ensure the version bump is appropriate for all packages before approving.

  • Maintainers can approve the changeset by checking this checkbox.

Something isn't right?

  • Maintainers can change the version label to modify the version bump.
  • If the bot has failed to detect any changes, or if this pull request needs to update multiple packages to different versions or requires a more comprehensive changelog entry, maintainers can update the changelog file directly.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

HuggingFaceDocBuilderDev commented Apr 17, 2026

🪼 branch checks and previews

Name Status URL
Spaces ready! Spaces preview

Install Trackio from this PR (includes built frontend)

pip install "https://huggingface.co/buckets/trackio/trackio-wheels/resolve/c59621e1221244309301529c16a3e1ed38074cf0/trackio-0.23.0-py3-none-any.whl"

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@abidlabs abidlabs requested a review from qgallouedec April 17, 2026 08:02
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8c360da. Configure here.

Comment thread trackio/__init__.py
_emit_nonfatal_warning(
f"trackio.init() could not recover the previous step for run '{run_name}' on Space '{space_id}': {e}. Continuing from step 0."
)
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space-backed last step lookup falls through to SQLite

Medium Severity

In _safe_get_last_step_for_init, when space_id is not None and the Space API call succeeds but returns a non-dict summary, execution falls through to the local SQLiteStorage.get_max_step_for_run path instead of returning None. The sibling function _safe_get_runs_for_init correctly handles this by always returning within the space_id is not None block (via return runs if isinstance(runs, list) else []), but here the isinstance(summary, dict) guard only returns inside the if, leaving the false branch without a return statement. This defeats the PR's goal of avoiding local SQLite/bucket access for Space-backed runs.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8c360da. Configure here.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes long stalls during trackio.init(..., space_id=..., bucket_id=...) by narrowing bucket sync scope and avoiding local SQLite/bucket hydration for Space-backed resume checks.

Changes:

  • Scope bucket sync from hf://buckets/<bucket_id>/trackio directly into TRACKIO_DIR to avoid walking the broader HF cache.
  • Update trackio.init() resume logic for Space-backed runs to query the Space API (/get_runs_for_project, /get_run_summary) instead of local SQLite.
  • Plumb pre-fetched existing_runs and initial_last_step into Run to avoid redundant lookups and preserve step continuity when resuming.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
trackio/run.py Accepts and consumes pre-fetched run list / last-step metadata to avoid redundant lookups during initialization.
trackio/bucket_storage.py Narrows bucket sync source/destination to the trackio/ subtree to reduce local filesystem traversal.
trackio/__init__.py Implements Space API–backed resume inspection and step recovery; passes pre-fetched metadata into Run.
.changeset/few-bars-find.md Adds a patch changeset entry describing the feature.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread trackio/__init__.py Outdated
Comment on lines +100 to +104
def _safe_get_runs_for_init(
project: str, space_id: str | None, resume: str
) -> list[str]:
if space_id is not None:
if resume == "never":
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New Space-backed resume behavior (skip lookup for resume="never", fetch runs via /get_runs_for_project, and restore last_step via /get_run_summary) is not covered by unit tests. There are existing unit tests for local resume modes (tests/unit/test_run.py); consider adding tests that monkeypatch RemoteClient.predict for these Space flows to prevent regressions in resume semantics and step continuity.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I am intentionally not adding new unit tests in this follow-up because this pass is focused on unblocking the PR branch and keeping the change surface minimal. I addressed the behavioral issues directly in trackio.init() by making Space resume checks deterministic and by reusing a single RemoteClient for both lookups.

Comment thread trackio/__init__.py Outdated
Comment on lines +104 to +105
if resume == "never":
return []
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Space-backed runs, returning an empty list when resume == "never" disables the duplicate-name protection later in init(). If the caller passes an explicit name that already exists on the Space, the run will effectively append to the existing remote run (and step numbering will restart at 0), which violates the expected semantics of resume="never". Consider either (a) forcing a fresh generated name when space_id is set and resume == "never" (even if name is provided) or (b) doing a lightweight remote existence check only when name is provided, so you can still avoid collisions without hydrating local storage.

Suggested change
if resume == "never":
return []

Copilot uses AI. Check for mistakes.
Comment thread trackio/__init__.py Outdated
Comment on lines +107 to +111
client = RemoteClient(
space_id,
hf_token=huggingface_hub.utils.get_token(),
verbose=False,
)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RemoteClient is constructed here and again in _safe_get_last_step_for_init(). For Space-backed init() flows that resume a run, this can trigger multiple connection/probing steps (e.g., repeated /version checks) and extra network overhead during init. Consider creating a single RemoteClient in init() (when needed) and reusing it across both lookups.

Copilot uses AI. Check for mistakes.
Ensure Space summary lookups never fall through to local SQLite, reuse a single RemoteClient across resume metadata calls, and keep resume='never' duplicate-name protection for explicit run names.

Made-with: Cursor
@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 17, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@abidlabs
Copy link
Copy Markdown
Member Author

Going to merge this in (but not do release) to unblock subsequent PRs

@abidlabs abidlabs merged commit 498bbc4 into main Apr 17, 2026
9 checks passed
abidlabs added a commit that referenced this pull request Apr 17, 2026
Resolves conflicts in trackio/__init__.py and trackio/run.py, integrating
the Space-backed resume lookups from #506 with the run_id-based resume
logic on this branch. Space-backed init now queries /get_runs_for_project
and /get_run_summary via the RemoteClient and passes the resolved run_id
(when available) so step continuity works for multi-run-per-name projects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants