Reduce HF Spaces 429s: polling tuning and batched metric logs API by abidlabs · Pull Request #513 · gradio-app/trackio

abidlabs · 2026-04-17T20:17:14Z

PR reduces HTTP 429 (rate limit) errors on Hugging Face Spaces when many people use the Trackio dashboard or when many runs are selected.

Approach

Dashboard polling (hostPolling.js, App.svelte, Metrics.svelte, SystemMetrics.svelte)
- On *.hf.space, both the app-level poll and the metrics/system poll use 2 second intervals; on localhost they stay at 1 second for both. Documented in docs/source/track.md
- Skip polling while the browser tab is hidden.
- Short backoff after API responses with status 429.
- The “Refresh metrics realtime” control applies to metrics-oriented polling where wired.
Batched metric logs API
- POST /api/get_logs_batch: one request returns logs for multiple runs (Metrics page uses this instead of N× get_logs).
- get_logs_batch in SQLiteStorage: single DB connection, one loop over runs.

gradio-pr-bot · 2026-04-17T20:17:45Z

🪼 branch checks and previews

•	Name	Status	URL
🦄	Changes	detected!	Details

gradio-pr-bot · 2026-04-17T20:17:46Z

🦄 change detected

This Pull Request includes changes to the following packages.

Package	Version
`trackio`	`minor`

Reduce HF Spaces 429s: polling tuning and batched metric logs API

‼️ Changeset not approved. Ensure the version bump is appropriate for all packages before approving.

Maintainers can approve the changeset by checking this checkbox.

Something isn't right?

Maintainers can change the version label to modify the version bump.
If the bot has failed to detect any changes, or if this pull request needs to update multiple packages to different versions or requires a more comprehensive changelog entry, maintainers can update the changelog file directly.

HuggingFaceDocBuilderDev · 2026-04-17T20:18:16Z

🪼 branch checks and previews

•	Name	Status	URL
	Spaces	ready!	Spaces preview

Install Trackio from this PR (includes built frontend)

pip install "https://huggingface.co/buckets/trackio/trackio-wheels/resolve/4de52e92161a649f4f22197cddec9352e2e7758a/trackio-0.23.0-py3-none-any.whl"

HuggingFaceDocBuilderDev · 2026-04-17T20:20:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Made-with: Cursor

Copilot

Pull request overview

This PR reduces Hugging Face Spaces dashboard 429s by cutting request volume (batching log reads) and reducing/pausing polling, while adding a Spaces-only in-process cache for metric/system log reads from the local SQLite DB.

Changes:

Add get_logs_batch / get_system_logs_batch backend APIs and update Metrics/SystemMetrics pages to fetch logs in a single batch request per refresh.
Introduce host-aware polling utilities (slower on *.hf.space, pause when tab hidden, cooldown after 429) and wire realtimeEnabled through to metrics pages.
Add a Spaces-only in-process read cache for metric/system logs keyed by project/run/max_points and invalidated by DB mtime.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
trackio/sqlite_storage.py	Adds batch log retrieval, refactors metric/system log fetching, and introduces Spaces-only read caches.
trackio/server.py	Registers new batch API endpoints in the HTTP API registry.
trackio/frontend/src/pages/Metrics.svelte	Switches metrics refresh from N `get_logs` calls to a single `get_logs_batch` call + host-aware polling gates.
trackio/frontend/src/pages/SystemMetrics.svelte	Switches system metrics refresh to `get_system_logs_batch` + host-aware polling gates.
trackio/frontend/src/lib/hostPolling.js	Adds host detection, polling interval selection, tab-hidden check, and 429 cooldown tracking.
trackio/frontend/src/lib/api.js	Adds `getLogsBatch`/`getSystemLogsBatch` and registers 429 cooldown hits in the shared API caller.
trackio/frontend/src/App.svelte	Uses host-aware polling intervals and passes `realtimeEnabled` down to metrics pages.
.changeset/polite-tigers-drop.md	Declares a minor release for the new batching/caching/polling behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T20:25:58Z

@@ -185,12 +195,16 @@
  }

  async function refreshCachedRuns() {
+    if (!realtimeEnabled) return;
    if (!project || selectedRuns.length === 0) return;
+    if (isTabHidden()) return;
+    if (isRateLimitCooldownActive()) return;

+    const batch = await getLogsBatch(project, selectedRuns);
    let changed = false;
-    for (const run of selectedRuns) {
-      const logs = await getLogs(project, run);
-      const runKey = run.id ?? run.name;
+    for (const entry of batch) {


fetchNewRuns() / refreshCachedRuns() call getLogsBatch() without any error handling. If the API returns 429/500, callApi() throws and these async functions can produce unhandled promise rejections (they’re invoked from $effect / setInterval without awaiting). Wrap the batch call in a try/catch (and optionally no-op on 429 since registerRateLimitHit() already sets a cooldown).

Copilot · 2026-04-17T20:25:59Z

@@ -200,12 +210,16 @@
  }

  async function refreshCachedRuns() {
+    if (!realtimeEnabled) return;
    if (!project || selectedRuns.length === 0) return;
+    if (isTabHidden()) return;
+    if (isRateLimitCooldownActive()) return;

+    const batch = await getSystemLogsBatch(project, selectedRuns);
    let changed = false;
-    for (const run of selectedRuns) {
-      const logs = await getSystemLogs(project, run);
-      const runKey = run.id ?? run.name;
+    for (const entry of batch) {
+      const runKey = entry.run_id ?? entry.run;


fetchNewRuns() / refreshCachedRuns() call getSystemLogsBatch() without try/catch. Because they’re triggered via $effect and setInterval, any thrown error (notably 429, which callApi() converts into an exception) can become an unhandled promise rejection. Add error handling around the batch call and consider ignoring 429s while the cooldown is active.

Copilot · 2026-04-17T20:25:59Z

+_LOGS_READ_CACHE_MAX_KEYS = 512
+
+


The cache is bounded by number of keys (_LOGS_READ_CACHE_MAX_KEYS = 512) but each entry can hold up to ~1500 rows of decoded metric dicts, so worst-case memory can still grow very large on Spaces with many runs/max_points and wide metric payloads. Consider lowering the default key cap and/or bounding by approximate bytes (LRU-by-size), or caching a more compact representation (e.g., raw JSON bytes) to avoid OOM risk.

Suggested change

_LOGS_READ_CACHE_MAX_KEYS = 512

def _get_logs_read_cache_max_keys() -> int:

default_max_keys = 64

raw_value = os.environ.get("TRACKIO_LOGS_READ_CACHE_MAX_KEYS", "").strip()

if not raw_value:

return default_max_keys

try:

parsed_value = int(raw_value)

except ValueError:

return default_max_keys

return max(0, parsed_value)

_LOGS_READ_CACHE_MAX_KEYS = _get_logs_read_cache_max_keys()

Copilot · 2026-04-17T20:25:59Z

+    def _subsample_metric_rows(rows: list[Any], max_points: int | None) -> list[Any]:
+        if max_points is None or len(rows) <= max_points:
+            return rows
+        step = len(rows) / max_points
+        indices = {int(i * step) for i in range(max_points)}
+        indices.add(len(rows) - 1)
+        return [rows[i] for i in sorted(indices)]


_subsample_metric_rows() divides by max_points without guarding against 0/negative values. With the new get_logs_batch API allowing max_points to be passed from the client, max_points=0 will raise ZeroDivisionError and return a 500. Validate max_points as None or a positive int (and ideally cap it) before using it for subsampling.

Copilot · 2026-04-17T20:25:59Z

+
+def _spaces_logs_read_cache_enabled() -> bool:
+    if not on_spaces():
+        return False


The logs read-cache invalidation uses only the main DB file mtime. If TRACKIO_SQLITE_JOURNAL_MODE is overridden to WAL on Spaces, writes may land in *.db-wal and the DB file mtime may not change promptly, causing stale cached metric/system logs. Consider disabling this cache when journal_mode is WAL, or incorporate the WAL/SHM mtimes (or PRAGMA data_version) into the cache versioning.

Suggested change

def _spaces_logs_read_cache_enabled() -> bool:

if not on_spaces():

return False

def _sqlite_journal_mode_uses_wal() -> bool:

journal_mode = os.environ.get("TRACKIO_SQLITE_JOURNAL_MODE", "").strip().lower()

return journal_mode == "wal"

def _spaces_logs_read_cache_enabled() -> bool:

if not on_spaces():

return False

if _sqlite_journal_mode_uses_wal():

return False

Copilot · 2026-04-17T20:26:00Z

+def get_logs_batch(
+    project: str,
+    runs: list[dict[str, Any]],
+    max_points: int | None = 1500,
+) -> list[dict[str, Any]]:
+    return SQLiteStorage.get_logs_batch(project, runs, max_points=max_points)


get_logs_batch exposes max_points and runs directly from the HTTP request body, but there’s no validation/capping. A client can send max_points=0/negative (triggering server errors) or an extremely large runs list (DoS / very expensive DB reads). Add input validation here (e.g., require max_points to be a positive int within a sane upper bound, and cap len(runs)).

Copilot · 2026-04-17T20:26:00Z

+def get_system_logs_batch(
+    project: str,
+    runs: list[dict[str, Any]],
+) -> list[dict[str, Any]]:
+    return SQLiteStorage.get_system_logs_batch(project, runs)


get_system_logs_batch similarly accepts an arbitrary runs list from the request with no size/type validation. Consider capping the number of runs per request and validating each element is a dict containing run/run_id to avoid 500s (e.g., when a client sends non-dict items) and to reduce DoS risk.

abidlabs · 2026-04-17T20:34:52Z

Addresses Copilot review feedback: batch log fetches wrapped in try/catch (Metrics + System metrics); read-cache uses max(DB, WAL) mtime and skips caching snapshots over 4000 rows; _subsample_metric_rows treats max_points < 1 as no subsampling; get_logs_batch / get_system_logs_batch validate runs (max 64, dict entries) and clamp max_points (default 1500, cap 10000).

…tch errors Made-with: Cursor

- Restore conservative HF Space polling intervals (2500ms app, 3500ms metrics) so single-run views don't generate more traffic than before batching. - SystemMetrics: surface an error state on initial batch-load failure instead of marking the page loaded, and fall back to per-run get_system_logs when the batch endpoint is unavailable (older backends). - Bound get_system_logs_batch payloads with max_points (default 1500), mirroring get_logs_batch server-side subsampling to avoid OOM/timeouts on long histories. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

changes

72ca7d9

add changeset

d28fcf4

abidlabs added 3 commits April 17, 2026 13:20

changes

d3755df

changes

e4388af

Wire System metrics page to get_system_logs_batch

158e12f

Made-with: Cursor

abidlabs requested a review from Copilot April 17, 2026 20:21

Copilot started reviewing on behalf of abidlabs April 17, 2026 20:22 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

abidlabs marked this pull request as ready for review April 17, 2026 20:33

Address review: validate batch APIs, harden cache/subsample, catch ba…

989d2c3

…tch errors Made-with: Cursor

abidlabs changed the title ~~Reduce Spaces 429s: slower polling, get_logs_batch, read cache~~ Reduce HF Spaces 429s: polling tuning and batched metric logs API Apr 17, 2026

gradio-pr-bot and others added 3 commits April 17, 2026 23:04

add changeset

7f24a84

changes

065eaa6

abidlabs merged commit d54d290 into main Apr 17, 2026
8 of 9 checks passed

gradio-pr-bot mentioned this pull request Apr 17, 2026

chore: update versions #504

Merged

-_LOGS_READ_CACHE_MAX_KEYS = 512
+def _get_logs_read_cache_max_keys() -> int:
+    default_max_keys = 64
+    raw_value = os.environ.get("TRACKIO_LOGS_READ_CACHE_MAX_KEYS", "").strip()
+    if not raw_value:
+        return default_max_keys
+    try:
+        parsed_value = int(raw_value)
+    except ValueError:
+        return default_max_keys
+    return max(0, parsed_value)
+_LOGS_READ_CACHE_MAX_KEYS = _get_logs_read_cache_max_keys()

Conversation

abidlabs commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gradio-pr-bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🪼 branch checks and previews

Uh oh!

gradio-pr-bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦄 change detected

This Pull Request includes changes to the following packages.

Something isn't right?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🪼 branch checks and previews

Uh oh!

HuggingFaceDocBuilderDev commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

abidlabs commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

abidlabs commented Apr 17, 2026 •

edited

Loading

gradio-pr-bot commented Apr 17, 2026 •

edited

Loading

gradio-pr-bot commented Apr 17, 2026 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 17, 2026 •

edited

Loading