gradio-app
diff --git a/‎.changeset/smart-cameras-travel.md‎
Lines changed: 5 additions & 0 deletions b/‎.changeset/smart-cameras-travel.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 27 additions & 2 deletions b/‎README.md‎
Lines changed: 27 additions & 2 deletions
diff --git a/‎tests/conftest.py‎
Lines changed: 9 additions & 0 deletions b/‎tests/conftest.py‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎tests/e2e-local/test_bulk_logging.py‎
Lines changed: 2 additions & 3 deletions b/‎tests/e2e-local/test_bulk_logging.py‎
Lines changed: 2 additions & 3 deletions
diff --git a/‎tests/e2e-spaces/conftest.py‎
Lines changed: 15 additions & 0 deletions b/‎tests/e2e-spaces/conftest.py‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎tests/e2e-spaces/test_data_robustness.py‎
Lines changed: 162 additions & 0 deletions b/‎tests/e2e-spaces/test_data_robustness.py‎
Lines changed: 162 additions & 0 deletions
diff --git a/‎tests/e2e-spaces/test_spaces_features.py‎
Lines changed: 97 additions & 0 deletions b/‎tests/e2e-spaces/test_spaces_features.py‎
Lines changed: 97 additions & 0 deletions
@@ -0,0 +1,5 @@
+---
+"trackio": minor
+---
+
+feat:Make Trackio logging much more robust
@@ -20,10 +20,13 @@
 
 </div>
 
-`trackio` is a lightweight, free experiment tracking Python library built by Hugging Face 🤗.
+Welcome to `trackio`: a lightweight, <u>free</u> experiment tracking Python library built by Hugging Face 🤗. It is local-first, supports very high logging throughputs for many parallel experiments, and provides an easy CLI interface for querying, perfect for LLM-driven experimenting.
+
+Trackio also ships with a Gradio-based dashboard you can use to view metrics locally:
 
 ![Screen Recording 2025-11-06 at 5 34 50 PM](https://github.com/user-attachments/assets/8c9c1b96-f17a-401c-83a4-26ac754f89c7)
 
+Trackio's main features:
 
 - **API compatible** with `wandb.init`, `wandb.log`, and `wandb.finish`. Drop-in replacement: just 
 
@@ -36,9 +39,10 @@
   - Persists logs in a Sqlite database locally (or, if you provide a `space_id`, in a private Hugging Face Dataset)
   - Visualize experiments with a Gradio dashboard locally (or, if you provide a `space_id`, on Hugging Face Spaces)
 - **LLM-friendly**: Built with autonomous ML experiments in mind, Trackio includes a CLI for programmatic access and a Python API for run management, making it easy for LLMs to log metrics and query experiment data.
+
 - Everything here, including hosting on Hugging Face, is **free**!
 
-Trackio is designed to be lightweight (the core codebase is <5,000 lines of Python code), not fully-featured. It is designed in an extensible way and written entirely in Python so that developers can easily fork the repository and add functionality that they care about.
+Trackio is designed to be lightweight and extensible. It is written entirely in Python so that developers can easily fork the repository and add functionality that they care about.
 
 ## Installation
 
@@ -205,6 +209,27 @@ To get started and see basic examples of usage, see these files:
 - [Persisting metrics in a Hugging Face Dataset](https://github.com/gradio-app/trackio/blob/main/examples/persist-dataset.py)
 - [Deploying the dashboard to Spaces](https://github.com/gradio-app/trackio/blob/main/examples/deploy-on-spaces.py)
 
+## Throughput & Rate Limits
+
+### Local logging
+
+`trackio.log()` is a non-blocking call that appends to an in-memory queue and returns immediately. A background thread drains the queue every **0.5 s** and writes to the local SQLite database. Because log calls never touch the network or disk on the calling thread, the client-side throughput is effectively **unlimited** -- you can burst thousands of calls per second without slowing down your training loop.
+
+### Logging to a Hugging Face Space
+
+When a `space_id` is provided, the same background thread batches queued entries and pushes them to the Space via the Gradio client API. The main factors that affect end-to-end throughput are:
+
+| Metric | Measured | Notes |
+|---|---|---|
+| **Burst from a single run** | **2,000 logs delivered in < 8 s** | `log()` calls themselves complete in ~0.01 s; the rest is network drain time. |
+| **Parallel runs (32 threads)** | **32,000 logs (32 × 1,000) delivered in ~14 s wall time** | Each thread opens its own Gradio client connection to the Space. |
+| **Logs per batch** | No hard cap | All entries queued during the 0.5 s interval are sent in a single `predict()` call. |
+| **Data safety** | Zero-loss | If a batch fails to send, it is persisted to local SQLite and retried automatically when the connection recovers. |
+
+These numbers were measured against a free-tier Hugging Face Space (2 vCPU / 16 GB RAM). Throughput will scale with the Space hardware tier, and local-only logging is orders of magnitude faster since no network round-trip is involved.
+
+> **Tip:** For high-frequency logging (e.g. logging every training step), Trackio's queue-and-batch design means your training loop is never blocked by network I/O. Even if the Space is temporarily unreachable, logs accumulate locally and are replayed once the connection is restored.
+
 ## Note: Trackio is in Beta (DB Schema May Change)
 
 Note that Trackio is in pre-release right now and we may release breaking changes. In particular, the schema of the Trackio sqlite database may change, which may require migrating or deleting existing database files (located by default at: `~/.cache/huggingface/trackio`).  
 
@@ -6,6 +6,7 @@
 import pytest
 from PIL import Image as PILImage
 
+from trackio import context_vars
 from trackio.media import write_audio, write_video
 
 
@@ -25,7 +26,15 @@ def temp_dir(monkeypatch):
             monkeypatch.setattr(f"{name}.TRACKIO_DIR", Path(tmpdir))
         for name in ["trackio.media.media", "trackio.media.utils", "trackio.utils"]:
             monkeypatch.setattr(f"{name}.MEDIA_DIR", Path(tmpdir) / "media")
+        context_vars.current_run.set(None)
+        context_vars.current_project.set(None)
+        context_vars.current_server.set(None)
+        context_vars.current_space_id.set(None)
         yield tmpdir
+        context_vars.current_run.set(None)
+        context_vars.current_project.set(None)
+        context_vars.current_server.set(None)
+        context_vars.current_space_id.set(None)
 
 
 @pytest.fixture(autouse=True)
 
@@ -14,7 +14,7 @@ def test_rapid_bulk_logging(temp_dir):
     run1_name = "bulk_test_run1"
     run2_name = "bulk_test_run2"
 
-    trackio.init(project=project_name, name=run1_name)
+    run1 = trackio.init(project=project_name, name=run1_name)
     start_time = time.time()
 
     num_logs_run1 = 300
@@ -31,8 +31,7 @@ def test_rapid_bulk_logging(temp_dir):
         f"1000 calls of trackio.log() took {time_to_run_1000_logs} seconds, which is too long"
     )
     trackio.finish()
-
-    time.sleep(0.6)  # Wait for the client to send the logs
+    run1.finish()
 
     # Verify run1 metrics
     metrics_run1 = SQLiteStorage.get_logs(project_name, run1_name)
 
@@ -0,0 +1,15 @@
+import time
+
+import pytest
+
+
+@pytest.fixture
+def wait_for_client():
+    def _wait(run, timeout=60):
+        deadline = time.time() + timeout
+        while run._client is None:
+            if time.time() > deadline:
+                raise TimeoutError("Client did not connect within timeout")
+            time.sleep(0.1)
+
+    return _wait
@@ -0,0 +1,162 @@
+import secrets
+import time
+
+from gradio_client import Client
+
+import trackio
+from trackio.sqlite_storage import SQLiteStorage
+
+
+def test_data_not_lost_on_transient_network_error(
+    test_space_id, temp_dir, wait_for_client
+):
+    """
+    When predict() fails once due to a transient network error and then
+    recovers, the failed batch should be retried and all data should
+    eventually reach the Space.
+    """
+    project_name = f"test_transient_{secrets.token_urlsafe(8)}"
+    run_name = "test_run"
+
+    run = trackio.init(project=project_name, name=run_name, space_id=test_space_id)
+    wait_for_client(run)
+
+    original_predict = run._client.predict
+    call_count = [0]
+
+    def wrapped_predict(*args, **kwargs):
+        call_count[0] += 1
+        if kwargs.get("api_name") == "/bulk_log" and call_count[0] <= 1:
+            raise Exception("ReadTimeout: The read operation timed out")
+        return original_predict(*args, **kwargs)
+
+    run._client.predict = wrapped_predict
+
+    trackio.log({"loss": 0.5})
+    trackio.log({"loss": 0.3})
+    time.sleep(5)
+    trackio.finish()
+
+    verify_client = Client(test_space_id)
+    summary = verify_client.predict(
+        project=project_name, run=run_name, api_name="/get_run_summary"
+    )
+    assert summary["num_logs"] == 2
+
+
+def test_failed_data_persisted_locally(test_space_id, temp_dir, wait_for_client):
+    """
+    When predict() permanently fails (Space unreachable), data should be
+    persisted to the local SQLite database as a fallback buffer so it is
+    not lost.
+    """
+    project_name = f"test_persist_{secrets.token_urlsafe(8)}"
+    run_name = "test_run"
+
+    run = trackio.init(project=project_name, name=run_name, space_id=test_space_id)
+    wait_for_client(run)
+
+    original_predict = run._client.predict
+
+    def always_fail_writes(*args, **kwargs):
+        if kwargs.get("api_name") in ("/bulk_log", "/bulk_log_system"):
+            raise Exception("Connection refused")
+        return original_predict(*args, **kwargs)
+
+    run._client.predict = always_fail_writes
+
+    trackio.log({"loss": 0.5})
+    trackio.log({"loss": 0.3})
+    time.sleep(3)
+    trackio.finish()
+
+    local_logs = SQLiteStorage.get_logs(project=project_name, run=run_name)
+    assert len(local_logs) >= 2, (
+        f"Expected at least 2 logs persisted in local SQLite, got {len(local_logs)}"
+    )
+
+
+def test_data_delivered_after_batch_sender_crash(
+    test_space_id, temp_dir, wait_for_client
+):
+    """
+    After a network error crashes the batch sender, subsequently-logged
+    data should still be delivered to the Space (either by retrying within
+    the same thread or by restarting the sender).
+    """
+    project_name = f"test_crash_{secrets.token_urlsafe(8)}"
+    run_name = "test_run"
+
+    run = trackio.init(project=project_name, name=run_name, space_id=test_space_id)
+    wait_for_client(run)
+
+    original_predict = run._client.predict
+    call_count = [0]
+
+    def wrapped_predict(*args, **kwargs):
+        call_count[0] += 1
+        if kwargs.get("api_name") == "/bulk_log" and call_count[0] == 1:
+            raise Exception("ReadTimeout: The read operation timed out")
+        return original_predict(*args, **kwargs)
+
+    run._client.predict = wrapped_predict
+
+    trackio.log({"loss": 0.5})
+    time.sleep(3)
+
+    trackio.log({"loss": 0.3})
+    trackio.log({"loss": 0.1})
+    time.sleep(5)
+    trackio.finish()
+
+    verify_client = Client(test_space_id)
+    summary = verify_client.predict(
+        project=project_name, run=run_name, api_name="/get_run_summary"
+    )
+    assert summary["num_logs"] >= 2, (
+        f"Expected at least 2 logs on Space after recovery, got {summary['num_logs']}"
+    )
+
+
+def test_local_buffer_flushed_after_recovery(test_space_id, temp_dir, wait_for_client):
+    """
+    When the connection recovers after several failures, data that was
+    persisted in the local SQLite fallback buffer should be flushed to the
+    Space and cleaned up from the local database.
+    """
+    project_name = f"test_flush_{secrets.token_urlsafe(8)}"
+    run_name = "test_run"
+
+    run = trackio.init(project=project_name, name=run_name, space_id=test_space_id)
+    wait_for_client(run)
+
+    original_predict = run._client.predict
+    call_count = [0]
+
+    def wrapped_predict(*args, **kwargs):
+        call_count[0] += 1
+        if kwargs.get("api_name") == "/bulk_log" and call_count[0] <= 3:
+            raise Exception("Connection refused")
+        return original_predict(*args, **kwargs)
+
+    run._client.predict = wrapped_predict
+
+    trackio.log({"loss": 0.5, "epoch": 1})
+    trackio.log({"loss": 0.3, "epoch": 2})
+    time.sleep(2)
+    trackio.log({"loss": 0.1, "epoch": 3})
+    time.sleep(10)
+    trackio.finish()
+
+    verify_client = Client(test_space_id)
+    summary = verify_client.predict(
+        project=project_name, run=run_name, api_name="/get_run_summary"
+    )
+    assert summary["num_logs"] == 3, (
+        f"Expected all 3 logs on Space after recovery, got {summary['num_logs']}"
+    )
+
+    local_logs = SQLiteStorage.get_logs(project=project_name, run=run_name)
+    assert len(local_logs) == 0, (
+        f"Expected local buffer to be empty after flush, but found {len(local_logs)} rows"
+    )
@@ -0,0 +1,97 @@
+import secrets
+import time
+from unittest.mock import patch
+
+import numpy as np
+from gradio_client import Client
+
+import trackio
+from trackio import gpu
+
+
+def test_config_persisted_on_spaces(test_space_id, wait_for_client):
+    project_name = f"test_config_{secrets.token_urlsafe(8)}"
+    run_name = "config_run"
+
+    run = trackio.init(
+        project=project_name,
+        name=run_name,
+        space_id=test_space_id,
+        config={"lr": 0.001, "batch_size": 32, "model": "resnet50"},
+    )
+    wait_for_client(run)
+
+    trackio.log({"loss": 0.5, "acc": 0.8})
+    trackio.log({"loss": 0.3, "acc": 0.9})
+    trackio.finish()
+
+    client = Client(test_space_id)
+
+    summary = client.predict(
+        project=project_name, run=run_name, api_name="/get_run_summary"
+    )
+    assert summary["num_logs"] == 2
+    assert "loss" in summary["metrics"]
+    assert "acc" in summary["metrics"]
+
+
+def test_system_metrics_on_spaces(test_space_id, wait_for_client):
+    project_name = f"test_system_{secrets.token_urlsafe(8)}"
+    run_name = "system_run"
+
+    def fake_gpu_metrics(device=None):
+        return {
+            "gpu/0/utilization": 75,
+            "gpu/0/allocated_memory": 4.5,
+            "gpu/0/total_memory": 12.0,
+            "gpu/0/temp": 65,
+            "gpu/0/power": 150.0,
+            "gpu/mean_utilization": 75,
+        }
+
+    with patch.object(gpu, "collect_gpu_metrics", fake_gpu_metrics):
+        with patch.object(gpu, "get_gpu_count", return_value=(1, [0])):
+            run = trackio.init(
+                project=project_name,
+                name=run_name,
+                space_id=test_space_id,
+                auto_log_gpu=True,
+                gpu_log_interval=0.2,
+            )
+            wait_for_client(run)
+
+            trackio.log({"loss": 0.5})
+            time.sleep(1)
+            trackio.finish()
+
+    client = Client(test_space_id)
+    summary = client.predict(
+        project=project_name, run=run_name, api_name="/get_run_summary"
+    )
+    assert summary["num_logs"] >= 1
+
+
+def test_image_upload_on_spaces(test_space_id, wait_for_client, temp_dir):
+    project_name = f"test_image_{secrets.token_urlsafe(8)}"
+    run_name = "image_run"
+
+    run = trackio.init(
+        project=project_name,
+        name=run_name,
+        space_id=test_space_id,
+    )
+    wait_for_client(run)
+
+    img_array = np.random.randint(0, 255, (64, 64, 3), dtype=np.uint8)
+    image = trackio.Image(img_array, caption="test_image")
+
+    trackio.log({"loss": 0.5, "sample": image})
+    trackio.finish()
+
+    client = Client(test_space_id)
+    summary = client.predict(
+        project=project_name, run=run_name, api_name="/get_run_summary"
+    )
+    assert summary["num_logs"] == 1
+    assert "loss" in summary["metrics"]
+    assert "sample" in summary["metrics"]
-Original file line number
+Diff line change
@@ @@ -0,0 +1,5 @@ @@
 +---
 +"trackio": minor
 +---
++
 +feat:Make Trackio logging much more robust