Reduce HF Spaces 429s: polling tuning and batched metric logs API#513
Conversation
🪼 branch checks and previews
|
🦄 change detectedThis Pull Request includes changes to the following packages.
|
🪼 branch checks and previews
Install Trackio from this PR (includes built frontend) pip install "https://huggingface.co/buckets/trackio/trackio-wheels/resolve/4de52e92161a649f4f22197cddec9352e2e7758a/trackio-0.23.0-py3-none-any.whl" |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Pull request overview
This PR reduces Hugging Face Spaces dashboard 429s by cutting request volume (batching log reads) and reducing/pausing polling, while adding a Spaces-only in-process cache for metric/system log reads from the local SQLite DB.
Changes:
- Add
get_logs_batch/get_system_logs_batchbackend APIs and update Metrics/SystemMetrics pages to fetch logs in a single batch request per refresh. - Introduce host-aware polling utilities (slower on
*.hf.space, pause when tab hidden, cooldown after 429) and wirerealtimeEnabledthrough to metrics pages. - Add a Spaces-only in-process read cache for metric/system logs keyed by project/run/max_points and invalidated by DB mtime.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| trackio/sqlite_storage.py | Adds batch log retrieval, refactors metric/system log fetching, and introduces Spaces-only read caches. |
| trackio/server.py | Registers new batch API endpoints in the HTTP API registry. |
| trackio/frontend/src/pages/Metrics.svelte | Switches metrics refresh from N get_logs calls to a single get_logs_batch call + host-aware polling gates. |
| trackio/frontend/src/pages/SystemMetrics.svelte | Switches system metrics refresh to get_system_logs_batch + host-aware polling gates. |
| trackio/frontend/src/lib/hostPolling.js | Adds host detection, polling interval selection, tab-hidden check, and 429 cooldown tracking. |
| trackio/frontend/src/lib/api.js | Adds getLogsBatch/getSystemLogsBatch and registers 429 cooldown hits in the shared API caller. |
| trackio/frontend/src/App.svelte | Uses host-aware polling intervals and passes realtimeEnabled down to metrics pages. |
| .changeset/polite-tigers-drop.md | Declares a minor release for the new batching/caching/polling behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -185,12 +195,16 @@ | |||
| } | |||
|
|
|||
| async function refreshCachedRuns() { | |||
| if (!realtimeEnabled) return; | |||
| if (!project || selectedRuns.length === 0) return; | |||
| if (isTabHidden()) return; | |||
| if (isRateLimitCooldownActive()) return; | |||
|
|
|||
| const batch = await getLogsBatch(project, selectedRuns); | |||
| let changed = false; | |||
| for (const run of selectedRuns) { | |||
| const logs = await getLogs(project, run); | |||
| const runKey = run.id ?? run.name; | |||
| for (const entry of batch) { | |||
There was a problem hiding this comment.
fetchNewRuns() / refreshCachedRuns() call getLogsBatch() without any error handling. If the API returns 429/500, callApi() throws and these async functions can produce unhandled promise rejections (they’re invoked from $effect / setInterval without awaiting). Wrap the batch call in a try/catch (and optionally no-op on 429 since registerRateLimitHit() already sets a cooldown).
| @@ -200,12 +210,16 @@ | |||
| } | |||
|
|
|||
| async function refreshCachedRuns() { | |||
| if (!realtimeEnabled) return; | |||
| if (!project || selectedRuns.length === 0) return; | |||
| if (isTabHidden()) return; | |||
| if (isRateLimitCooldownActive()) return; | |||
|
|
|||
| const batch = await getSystemLogsBatch(project, selectedRuns); | |||
| let changed = false; | |||
| for (const run of selectedRuns) { | |||
| const logs = await getSystemLogs(project, run); | |||
| const runKey = run.id ?? run.name; | |||
| for (const entry of batch) { | |||
| const runKey = entry.run_id ?? entry.run; | |||
There was a problem hiding this comment.
fetchNewRuns() / refreshCachedRuns() call getSystemLogsBatch() without try/catch. Because they’re triggered via $effect and setInterval, any thrown error (notably 429, which callApi() converts into an exception) can become an unhandled promise rejection. Add error handling around the batch call and consider ignoring 429s while the cooldown is active.
| _LOGS_READ_CACHE_MAX_KEYS = 512 | ||
|
|
||
|
|
There was a problem hiding this comment.
The cache is bounded by number of keys (_LOGS_READ_CACHE_MAX_KEYS = 512) but each entry can hold up to ~1500 rows of decoded metric dicts, so worst-case memory can still grow very large on Spaces with many runs/max_points and wide metric payloads. Consider lowering the default key cap and/or bounding by approximate bytes (LRU-by-size), or caching a more compact representation (e.g., raw JSON bytes) to avoid OOM risk.
| _LOGS_READ_CACHE_MAX_KEYS = 512 | |
| def _get_logs_read_cache_max_keys() -> int: | |
| default_max_keys = 64 | |
| raw_value = os.environ.get("TRACKIO_LOGS_READ_CACHE_MAX_KEYS", "").strip() | |
| if not raw_value: | |
| return default_max_keys | |
| try: | |
| parsed_value = int(raw_value) | |
| except ValueError: | |
| return default_max_keys | |
| return max(0, parsed_value) | |
| _LOGS_READ_CACHE_MAX_KEYS = _get_logs_read_cache_max_keys() |
| def _subsample_metric_rows(rows: list[Any], max_points: int | None) -> list[Any]: | ||
| if max_points is None or len(rows) <= max_points: | ||
| return rows | ||
| step = len(rows) / max_points | ||
| indices = {int(i * step) for i in range(max_points)} | ||
| indices.add(len(rows) - 1) | ||
| return [rows[i] for i in sorted(indices)] |
There was a problem hiding this comment.
_subsample_metric_rows() divides by max_points without guarding against 0/negative values. With the new get_logs_batch API allowing max_points to be passed from the client, max_points=0 will raise ZeroDivisionError and return a 500. Validate max_points as None or a positive int (and ideally cap it) before using it for subsampling.
|
|
||
| def _spaces_logs_read_cache_enabled() -> bool: | ||
| if not on_spaces(): | ||
| return False |
There was a problem hiding this comment.
The logs read-cache invalidation uses only the main DB file mtime. If TRACKIO_SQLITE_JOURNAL_MODE is overridden to WAL on Spaces, writes may land in *.db-wal and the DB file mtime may not change promptly, causing stale cached metric/system logs. Consider disabling this cache when journal_mode is WAL, or incorporate the WAL/SHM mtimes (or PRAGMA data_version) into the cache versioning.
| def _spaces_logs_read_cache_enabled() -> bool: | |
| if not on_spaces(): | |
| return False | |
| def _sqlite_journal_mode_uses_wal() -> bool: | |
| journal_mode = os.environ.get("TRACKIO_SQLITE_JOURNAL_MODE", "").strip().lower() | |
| return journal_mode == "wal" | |
| def _spaces_logs_read_cache_enabled() -> bool: | |
| if not on_spaces(): | |
| return False | |
| if _sqlite_journal_mode_uses_wal(): | |
| return False |
| def get_logs_batch( | ||
| project: str, | ||
| runs: list[dict[str, Any]], | ||
| max_points: int | None = 1500, | ||
| ) -> list[dict[str, Any]]: | ||
| return SQLiteStorage.get_logs_batch(project, runs, max_points=max_points) |
There was a problem hiding this comment.
get_logs_batch exposes max_points and runs directly from the HTTP request body, but there’s no validation/capping. A client can send max_points=0/negative (triggering server errors) or an extremely large runs list (DoS / very expensive DB reads). Add input validation here (e.g., require max_points to be a positive int within a sane upper bound, and cap len(runs)).
| def get_system_logs_batch( | ||
| project: str, | ||
| runs: list[dict[str, Any]], | ||
| ) -> list[dict[str, Any]]: | ||
| return SQLiteStorage.get_system_logs_batch(project, runs) |
There was a problem hiding this comment.
get_system_logs_batch similarly accepts an arbitrary runs list from the request with no size/type validation. Consider capping the number of runs per request and validating each element is a dict containing run/run_id to avoid 500s (e.g., when a client sends non-dict items) and to reduce DoS risk.
|
Addresses Copilot review feedback: batch log fetches wrapped in try/catch (Metrics + System metrics); read-cache uses max(DB, WAL) mtime and skips caching snapshots over 4000 rows; |
…tch errors Made-with: Cursor
- Restore conservative HF Space polling intervals (2500ms app, 3500ms metrics) so single-run views don't generate more traffic than before batching. - SystemMetrics: surface an error state on initial batch-load failure instead of marking the page loaded, and fall back to per-run get_system_logs when the batch endpoint is unavailable (older backends). - Bound get_system_logs_batch payloads with max_points (default 1500), mirroring get_logs_batch server-side subsampling to avoid OOM/timeouts on long histories. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR reduces HTTP 429 (rate limit) errors on Hugging Face Spaces when many people use the Trackio dashboard or when many runs are selected.
Approach
Dashboard polling (
hostPolling.js,App.svelte,Metrics.svelte,SystemMetrics.svelte)*.hf.space, both the app-level poll and the metrics/system poll use 2 second intervals; on localhost they stay at 1 second for both. Documented indocs/source/track.mdBatched metric logs API
POST /api/get_logs_batch: one request returns logs for multiple runs (Metrics page uses this instead of N×get_logs).get_logs_batchinSQLiteStorage: single DB connection, one loop over runs.