Skip to content

Reduce HF Spaces 429s: polling tuning and batched metric logs API#513

Merged
abidlabs merged 9 commits into
mainfrom
reduce-traffic
Apr 17, 2026
Merged

Reduce HF Spaces 429s: polling tuning and batched metric logs API#513
abidlabs merged 9 commits into
mainfrom
reduce-traffic

Conversation

@abidlabs
Copy link
Copy Markdown
Member

@abidlabs abidlabs commented Apr 17, 2026

PR reduces HTTP 429 (rate limit) errors on Hugging Face Spaces when many people use the Trackio dashboard or when many runs are selected.

Approach

  1. Dashboard polling (hostPolling.js, App.svelte, Metrics.svelte, SystemMetrics.svelte)

    • On *.hf.space, both the app-level poll and the metrics/system poll use 2 second intervals; on localhost they stay at 1 second for both. Documented in docs/source/track.md
    • Skip polling while the browser tab is hidden.
    • Short backoff after API responses with status 429.
    • The “Refresh metrics realtime” control applies to metrics-oriented polling where wired.
  2. Batched metric logs API

    • POST /api/get_logs_batch: one request returns logs for multiple runs (Metrics page uses this instead of N× get_logs).
    • get_logs_batch in SQLiteStorage: single DB connection, one loop over runs.

@gradio-pr-bot
Copy link
Copy Markdown
Contributor

gradio-pr-bot commented Apr 17, 2026

🪼 branch checks and previews

Name Status URL
🦄 Changes detected! Details

@gradio-pr-bot
Copy link
Copy Markdown
Contributor

gradio-pr-bot commented Apr 17, 2026

🦄 change detected

This Pull Request includes changes to the following packages.

Package Version
trackio minor

  • Reduce HF Spaces 429s: polling tuning and batched metric logs API

‼️ Changeset not approved. Ensure the version bump is appropriate for all packages before approving.

  • Maintainers can approve the changeset by checking this checkbox.

Something isn't right?

  • Maintainers can change the version label to modify the version bump.
  • If the bot has failed to detect any changes, or if this pull request needs to update multiple packages to different versions or requires a more comprehensive changelog entry, maintainers can update the changelog file directly.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

HuggingFaceDocBuilderDev commented Apr 17, 2026

🪼 branch checks and previews

Name Status URL
Spaces ready! Spaces preview

Install Trackio from this PR (includes built frontend)

pip install "https://huggingface.co/buckets/trackio/trackio-wheels/resolve/4de52e92161a649f4f22197cddec9352e2e7758a/trackio-0.23.0-py3-none-any.whl"

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces Hugging Face Spaces dashboard 429s by cutting request volume (batching log reads) and reducing/pausing polling, while adding a Spaces-only in-process cache for metric/system log reads from the local SQLite DB.

Changes:

  • Add get_logs_batch / get_system_logs_batch backend APIs and update Metrics/SystemMetrics pages to fetch logs in a single batch request per refresh.
  • Introduce host-aware polling utilities (slower on *.hf.space, pause when tab hidden, cooldown after 429) and wire realtimeEnabled through to metrics pages.
  • Add a Spaces-only in-process read cache for metric/system logs keyed by project/run/max_points and invalidated by DB mtime.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
trackio/sqlite_storage.py Adds batch log retrieval, refactors metric/system log fetching, and introduces Spaces-only read caches.
trackio/server.py Registers new batch API endpoints in the HTTP API registry.
trackio/frontend/src/pages/Metrics.svelte Switches metrics refresh from N get_logs calls to a single get_logs_batch call + host-aware polling gates.
trackio/frontend/src/pages/SystemMetrics.svelte Switches system metrics refresh to get_system_logs_batch + host-aware polling gates.
trackio/frontend/src/lib/hostPolling.js Adds host detection, polling interval selection, tab-hidden check, and 429 cooldown tracking.
trackio/frontend/src/lib/api.js Adds getLogsBatch/getSystemLogsBatch and registers 429 cooldown hits in the shared API caller.
trackio/frontend/src/App.svelte Uses host-aware polling intervals and passes realtimeEnabled down to metrics pages.
.changeset/polite-tigers-drop.md Declares a minor release for the new batching/caching/polling behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +182 to +205
@@ -185,12 +195,16 @@
}

async function refreshCachedRuns() {
if (!realtimeEnabled) return;
if (!project || selectedRuns.length === 0) return;
if (isTabHidden()) return;
if (isRateLimitCooldownActive()) return;

const batch = await getLogsBatch(project, selectedRuns);
let changed = false;
for (const run of selectedRuns) {
const logs = await getLogs(project, run);
const runKey = run.id ?? run.name;
for (const entry of batch) {
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetchNewRuns() / refreshCachedRuns() call getLogsBatch() without any error handling. If the API returns 429/500, callApi() throws and these async functions can produce unhandled promise rejections (they’re invoked from $effect / setInterval without awaiting). Wrap the batch call in a try/catch (and optionally no-op on 429 since registerRateLimitHit() already sets a cooldown).

Copilot uses AI. Check for mistakes.
Comment on lines +197 to +221
@@ -200,12 +210,16 @@
}

async function refreshCachedRuns() {
if (!realtimeEnabled) return;
if (!project || selectedRuns.length === 0) return;
if (isTabHidden()) return;
if (isRateLimitCooldownActive()) return;

const batch = await getSystemLogsBatch(project, selectedRuns);
let changed = false;
for (const run of selectedRuns) {
const logs = await getSystemLogs(project, run);
const runKey = run.id ?? run.name;
for (const entry of batch) {
const runKey = entry.run_id ?? entry.run;
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetchNewRuns() / refreshCachedRuns() call getSystemLogsBatch() without try/catch. Because they’re triggered via $effect and setInterval, any thrown error (notably 429, which callApi() converts into an exception) can become an unhandled promise rejection. Add error handling around the batch call and consider ignoring 429s while the cooldown is active.

Copilot uses AI. Check for mistakes.
Comment thread trackio/sqlite_storage.py
Comment on lines +177 to +179
_LOGS_READ_CACHE_MAX_KEYS = 512


Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache is bounded by number of keys (_LOGS_READ_CACHE_MAX_KEYS = 512) but each entry can hold up to ~1500 rows of decoded metric dicts, so worst-case memory can still grow very large on Spaces with many runs/max_points and wide metric payloads. Consider lowering the default key cap and/or bounding by approximate bytes (LRU-by-size), or caching a more compact representation (e.g., raw JSON bytes) to avoid OOM risk.

Suggested change
_LOGS_READ_CACHE_MAX_KEYS = 512
def _get_logs_read_cache_max_keys() -> int:
default_max_keys = 64
raw_value = os.environ.get("TRACKIO_LOGS_READ_CACHE_MAX_KEYS", "").strip()
if not raw_value:
return default_max_keys
try:
parsed_value = int(raw_value)
except ValueError:
return default_max_keys
return max(0, parsed_value)
_LOGS_READ_CACHE_MAX_KEYS = _get_logs_read_cache_max_keys()

Copilot uses AI. Check for mistakes.
Comment thread trackio/sqlite_storage.py
Comment on lines +1702 to +1708
def _subsample_metric_rows(rows: list[Any], max_points: int | None) -> list[Any]:
if max_points is None or len(rows) <= max_points:
return rows
step = len(rows) / max_points
indices = {int(i * step) for i in range(max_points)}
indices.add(len(rows) - 1)
return [rows[i] for i in sorted(indices)]
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_subsample_metric_rows() divides by max_points without guarding against 0/negative values. With the new get_logs_batch API allowing max_points to be passed from the client, max_points=0 will raise ZeroDivisionError and return a 500. Validate max_points as None or a positive int (and ideally cap it) before using it for subsampling.

Copilot uses AI. Check for mistakes.
Comment thread trackio/sqlite_storage.py
Comment on lines +179 to +182

def _spaces_logs_read_cache_enabled() -> bool:
if not on_spaces():
return False
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logs read-cache invalidation uses only the main DB file mtime. If TRACKIO_SQLITE_JOURNAL_MODE is overridden to WAL on Spaces, writes may land in *.db-wal and the DB file mtime may not change promptly, causing stale cached metric/system logs. Consider disabling this cache when journal_mode is WAL, or incorporate the WAL/SHM mtimes (or PRAGMA data_version) into the cache versioning.

Suggested change
def _spaces_logs_read_cache_enabled() -> bool:
if not on_spaces():
return False
def _sqlite_journal_mode_uses_wal() -> bool:
journal_mode = os.environ.get("TRACKIO_SQLITE_JOURNAL_MODE", "").strip().lower()
return journal_mode == "wal"
def _spaces_logs_read_cache_enabled() -> bool:
if not on_spaces():
return False
if _sqlite_journal_mode_uses_wal():
return False

Copilot uses AI. Check for mistakes.
Comment thread trackio/server.py Outdated
Comment on lines +723 to +728
def get_logs_batch(
project: str,
runs: list[dict[str, Any]],
max_points: int | None = 1500,
) -> list[dict[str, Any]]:
return SQLiteStorage.get_logs_batch(project, runs, max_points=max_points)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_logs_batch exposes max_points and runs directly from the HTTP request body, but there’s no validation/capping. A client can send max_points=0/negative (triggering server errors) or an extremely large runs list (DoS / very expensive DB reads). Add input validation here (e.g., require max_points to be a positive int within a sane upper bound, and cap len(runs)).

Copilot uses AI. Check for mistakes.
Comment thread trackio/server.py Outdated
Comment on lines +690 to +694
def get_system_logs_batch(
project: str,
runs: list[dict[str, Any]],
) -> list[dict[str, Any]]:
return SQLiteStorage.get_system_logs_batch(project, runs)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_system_logs_batch similarly accepts an arbitrary runs list from the request with no size/type validation. Consider capping the number of runs per request and validating each element is a dict containing run/run_id to avoid 500s (e.g., when a client sends non-dict items) and to reduce DoS risk.

Copilot uses AI. Check for mistakes.
@abidlabs abidlabs marked this pull request as ready for review April 17, 2026 20:33
@abidlabs
Copy link
Copy Markdown
Member Author

Addresses Copilot review feedback: batch log fetches wrapped in try/catch (Metrics + System metrics); read-cache uses max(DB, WAL) mtime and skips caching snapshots over 4000 rows; _subsample_metric_rows treats max_points < 1 as no subsampling; get_logs_batch / get_system_logs_batch validate runs (max 64, dict entries) and clamp max_points (default 1500, cap 10000).

@abidlabs abidlabs changed the title Reduce Spaces 429s: slower polling, get_logs_batch, read cache Reduce HF Spaces 429s: polling tuning and batched metric logs API Apr 17, 2026
gradio-pr-bot and others added 3 commits April 17, 2026 23:04
- Restore conservative HF Space polling intervals (2500ms app, 3500ms metrics)
  so single-run views don't generate more traffic than before batching.
- SystemMetrics: surface an error state on initial batch-load failure instead
  of marking the page loaded, and fall back to per-run get_system_logs when
  the batch endpoint is unavailable (older backends).
- Bound get_system_logs_batch payloads with max_points (default 1500),
  mirroring get_logs_batch server-side subsampling to avoid OOM/timeouts on
  long histories.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@abidlabs abidlabs merged commit d54d290 into main Apr 17, 2026
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants