Scope bucket sync to trackio/ subtree to avoid walking the HF cache by abidlabs · Pull Request #506 · gradio-app/trackio

abidlabs · 2026-04-17T07:00:53Z

Summary

This PR fixes the long trackio.init(..., space_id=..., bucket_id=...) stall before the first log() call.

It does two things:

scopes bucket syncs to hf://buckets/<bucket_id>/trackio -> TRACKIO_DIR instead of walking the full Hugging Face cache directory
changes Space-backed resume checks to use the Space API instead of local SQLite/bucket hydration

That means:

resume="never" skips run lookup entirely for Space-backed runs
resume in {"allow", "must"} fetches run names from /get_runs_for_project
resumed Space runs fetch last_step from /get_run_summary so step numbering continues correctly

Why

The original hang came from sync_bucket(...) computing a sync plan against a very large local HF cache. On machines with lots of downloaded models, walking ~/.cache/huggingface/ can take minutes even for a tiny Trackio bucket.

Separately, trackio.init() was checking resume state through local SQLite, which triggered bucket hydration on Space-backed runs even though the run metadata already exists on the Space.

Test plan

PYTHONPATH=. pytest -q tests/unit/test_run.py
python -m compileall trackio/__init__.py trackio/run.py tests/unit/test_run.py

Note

Medium Risk
Changes trackio.init() resume behavior for Space-backed runs and alters bucket hydration paths, which could affect run resumption/step continuity and local cache expectations if the Space APIs or bucket layout differ.

Overview
Fixes long stalls when initializing with space_id+bucket_id by syncing only hf://buckets/<bucket_id>/trackio directly into TRACKIO_DIR (instead of walking the broader cache/parent directory).

For Space-backed runs, trackio.init() now avoids local SQLite/bucket hydration for resume checks: it skips run lookup entirely for resume="never", fetches existing run names via /get_runs_for_project, and (when resuming) restores step continuity by fetching last_step via /get_run_summary and passing pre-fetched resume metadata into Run to avoid redundant lookups.

^{Reviewed by Cursor Bugbot for commit 8c360da. Bugbot is set up for automated code reviews on this repo. Configure here.}

download_bucket_to_trackio_dir() previously called sync_bucket() with dest=TRACKIO_DIR.parent (i.e. ~/.cache/huggingface/). When computing the sync plan, sync_bucket os.walk()s the local dest to build an mtime map. On ML clients with large ~/.cache/huggingface/hub model caches (common on HPC / Lustre filesystems), that walk can take many minutes and makes trackio.init(bucket_id=...) appear to hang before the first log call. Scoping source to the "trackio/" prefix and dest to TRACKIO_DIR limits the local walk to the trackio directory itself. Files in the bucket already live under trackio/ so the layout on disk is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gradio-pr-bot · 2026-04-17T07:01:26Z

🪼 branch checks and previews

•	Name	Status	URL
🦄	Changes	detected!	Details

gradio-pr-bot · 2026-04-17T07:01:27Z

🦄 change detected

This Pull Request includes changes to the following packages.

Package	Version
`trackio`	`patch`

Scope bucket sync to trackio/ subtree to avoid walking the HF cache

‼️ Changeset not approved. Ensure the version bump is appropriate for all packages before approving.

Maintainers can approve the changeset by checking this checkbox.

Something isn't right?

Maintainers can change the version label to modify the version bump.
If the bot has failed to detect any changes, or if this pull request needs to update multiple packages to different versions or requires a more comprehensive changelog entry, maintainers can update the changelog file directly.

HuggingFaceDocBuilderDev · 2026-04-17T07:02:53Z

🪼 branch checks and previews

•	Name	Status	URL
	Spaces	ready!	Spaces preview

Install Trackio from this PR (includes built frontend)

pip install "https://huggingface.co/buckets/trackio/trackio-wheels/resolve/c59621e1221244309301529c16a3e1ed38074cf0/trackio-0.23.0-py3-none-any.whl"

HuggingFaceDocBuilderDev · 2026-04-17T07:03:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 8c360da. Configure here.}

cursor · 2026-04-17T08:31:08Z

+            _emit_nonfatal_warning(
+                f"trackio.init() could not recover the previous step for run '{run_name}' on Space '{space_id}': {e}. Continuing from step 0."
+            )
+            return None


Space-backed last step lookup falls through to SQLite

Medium Severity

In _safe_get_last_step_for_init, when space_id is not None and the Space API call succeeds but returns a non-dict summary, execution falls through to the local SQLiteStorage.get_max_step_for_run path instead of returning None. The sibling function _safe_get_runs_for_init correctly handles this by always returning within the space_id is not None block (via return runs if isinstance(runs, list) else []), but here the isinstance(summary, dict) guard only returns inside the if, leaving the false branch without a return statement. This defeats the PR's goal of avoiding local SQLite/bucket access for Space-backed runs.

^{Reviewed by Cursor Bugbot for commit 8c360da. Configure here.}

Copilot

Pull request overview

Fixes long stalls during trackio.init(..., space_id=..., bucket_id=...) by narrowing bucket sync scope and avoiding local SQLite/bucket hydration for Space-backed resume checks.

Changes:

Scope bucket sync from hf://buckets/<bucket_id>/trackio directly into TRACKIO_DIR to avoid walking the broader HF cache.
Update trackio.init() resume logic for Space-backed runs to query the Space API (/get_runs_for_project, /get_run_summary) instead of local SQLite.
Plumb pre-fetched existing_runs and initial_last_step into Run to avoid redundant lookups and preserve step continuity when resuming.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`trackio/run.py`	Accepts and consumes pre-fetched run list / last-step metadata to avoid redundant lookups during initialization.
`trackio/bucket_storage.py`	Narrows bucket sync source/destination to the `trackio/` subtree to reduce local filesystem traversal.
`trackio/__init__.py`	Implements Space API–backed resume inspection and step recovery; passes pre-fetched metadata into `Run`.
`.changeset/few-bars-find.md`	Adds a patch changeset entry describing the feature.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T08:33:47Z

+def _safe_get_runs_for_init(
+    project: str, space_id: str | None, resume: str
+) -> list[str]:
+    if space_id is not None:
+        if resume == "never":


New Space-backed resume behavior (skip lookup for resume="never", fetch runs via /get_runs_for_project, and restore last_step via /get_run_summary) is not covered by unit tests. There are existing unit tests for local resume modes (tests/unit/test_run.py); consider adding tests that monkeypatch RemoteClient.predict for these Space flows to prevent regressions in resume semantics and step continuity.

Thanks for the suggestion. I am intentionally not adding new unit tests in this follow-up because this pass is focused on unblocking the PR branch and keeping the change surface minimal. I addressed the behavioral issues directly in trackio.init() by making Space resume checks deterministic and by reusing a single RemoteClient for both lookups.

Copilot · 2026-04-17T08:33:48Z

+        if resume == "never":
+            return []


For Space-backed runs, returning an empty list when resume == "never" disables the duplicate-name protection later in init(). If the caller passes an explicit name that already exists on the Space, the run will effectively append to the existing remote run (and step numbering will restart at 0), which violates the expected semantics of resume="never". Consider either (a) forcing a fresh generated name when space_id is set and resume == "never" (even if name is provided) or (b) doing a lightweight remote existence check only when name is provided, so you can still avoid collisions without hydrating local storage.

Suggested change

if resume == "never":

return []

Copilot · 2026-04-17T08:33:48Z

+            client = RemoteClient(
+                space_id,
+                hf_token=huggingface_hub.utils.get_token(),
+                verbose=False,
+            )


RemoteClient is constructed here and again in _safe_get_last_step_for_init(). For Space-backed init() flows that resume a run, this can trigger multiple connection/probing steps (e.g., repeated /version checks) and extra network overhead during init. Consider creating a single RemoteClient in init() (when needed) and reusing it across both lookups.

Ensure Space summary lookups never fall through to local SQLite, reuse a single RemoteClient across resume metadata calls, and keep resume='never' duplicate-name protection for explicit run names. Made-with: Cursor

cursor · 2026-04-17T08:39:04Z

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

abidlabs · 2026-04-17T16:54:48Z

Going to merge this in (but not do release) to unblock subsequent PRs

Resolves conflicts in trackio/__init__.py and trackio/run.py, integrating the Space-backed resume lookups from #506 with the run_id-based resume logic on this branch. Space-backed init now queries /get_runs_for_project and /get_run_summary via the RemoteClient and passes the resolved run_id (when available) so step continuity works for multi-run-per-name projects. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

add changeset

f65aec2

changes

b7df8b7

abidlabs added the v: patch label Apr 17, 2026

add changeset

8238c24

abidlabs requested a review from qgallouedec April 17, 2026 08:02

abidlabs and others added 3 commits April 17, 2026 01:24

changes

18dd810

changes

7daf36e

Merge branch 'main' into fix-bucket-sync-hang

8c360da

abidlabs requested a review from Copilot April 17, 2026 08:26

Copilot started reviewing on behalf of abidlabs April 17, 2026 08:27 View session

abidlabs mentioned this pull request Apr 17, 2026

Add an id field to Run which is used internally, allowing users to have multiple runs with the same run name #505

Merged

cursor Bot reviewed Apr 17, 2026

View reviewed changes

Copilot AI reviewed Apr 17, 2026

View reviewed changes

Fix Space-backed init resume metadata handling

c59621e

Ensure Space summary lookups never fall through to local SQLite, reuse a single RemoteClient across resume metadata calls, and keep resume='never' duplicate-name protection for explicit run names. Made-with: Cursor

abidlabs merged commit 498bbc4 into main Apr 17, 2026
9 checks passed

gradio-pr-bot mentioned this pull request Apr 17, 2026

chore: update versions #504

Merged

Conversation

abidlabs commented Apr 17, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test plan

Uh oh!

gradio-pr-bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🪼 branch checks and previews

Uh oh!

gradio-pr-bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦄 change detected

This Pull Request includes changes to the following packages.

Something isn't right?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🪼 branch checks and previews

Uh oh!

HuggingFaceDocBuilderDev commented Apr 17, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 17, 2026

Choose a reason for hiding this comment

Space-backed last step lookup falls through to SQLite

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

abidlabs Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot commented Apr 17, 2026

Uh oh!

abidlabs commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

abidlabs commented Apr 17, 2026 •

edited by cursor Bot

Loading

gradio-pr-bot commented Apr 17, 2026 •

edited

Loading

gradio-pr-bot commented Apr 17, 2026 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 17, 2026 •

edited

Loading