Skip to content

Commit 5aeb9ed

Browse files
abidlabscursoragentgradio-pr-botgithub-actions[bot]
authored
Make Trackio logging much more robust (#427)
Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: gradio-pr-bot <gradio-pr-bot@users.noreply.github.com> Co-authored-by: Gradio PR Bot <121576822+gradio-pr-bot@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
1 parent fcdc167 commit 5aeb9ed

24 files changed

Lines changed: 1521 additions & 446 deletions

.changeset/smart-cameras-travel.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
"trackio": minor
3+
---
4+
5+
feat:Make Trackio logging much more robust

README.md

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,13 @@
2020

2121
</div>
2222

23-
`trackio` is a lightweight, free experiment tracking Python library built by Hugging Face 🤗.
23+
Welcome to `trackio`: a lightweight, <u>free</u> experiment tracking Python library built by Hugging Face 🤗. It is local-first, supports very high logging throughputs for many parallel experiments, and provides an easy CLI interface for querying, perfect for LLM-driven experimenting.
24+
25+
Trackio also ships with a Gradio-based dashboard you can use to view metrics locally:
2426

2527
![Screen Recording 2025-11-06 at 5 34 50 PM](https://github.com/user-attachments/assets/8c9c1b96-f17a-401c-83a4-26ac754f89c7)
2628

29+
Trackio's main features:
2730

2831
- **API compatible** with `wandb.init`, `wandb.log`, and `wandb.finish`. Drop-in replacement: just
2932

@@ -36,9 +39,10 @@
3639
- Persists logs in a Sqlite database locally (or, if you provide a `space_id`, in a private Hugging Face Dataset)
3740
- Visualize experiments with a Gradio dashboard locally (or, if you provide a `space_id`, on Hugging Face Spaces)
3841
- **LLM-friendly**: Built with autonomous ML experiments in mind, Trackio includes a CLI for programmatic access and a Python API for run management, making it easy for LLMs to log metrics and query experiment data.
42+
3943
- Everything here, including hosting on Hugging Face, is **free**!
4044

41-
Trackio is designed to be lightweight (the core codebase is <5,000 lines of Python code), not fully-featured. It is designed in an extensible way and written entirely in Python so that developers can easily fork the repository and add functionality that they care about.
45+
Trackio is designed to be lightweight and extensible. It is written entirely in Python so that developers can easily fork the repository and add functionality that they care about.
4246

4347
## Installation
4448

@@ -205,6 +209,27 @@ To get started and see basic examples of usage, see these files:
205209
- [Persisting metrics in a Hugging Face Dataset](https://github.com/gradio-app/trackio/blob/main/examples/persist-dataset.py)
206210
- [Deploying the dashboard to Spaces](https://github.com/gradio-app/trackio/blob/main/examples/deploy-on-spaces.py)
207211

212+
## Throughput & Rate Limits
213+
214+
### Local logging
215+
216+
`trackio.log()` is a non-blocking call that appends to an in-memory queue and returns immediately. A background thread drains the queue every **0.5 s** and writes to the local SQLite database. Because log calls never touch the network or disk on the calling thread, the client-side throughput is effectively **unlimited** -- you can burst thousands of calls per second without slowing down your training loop.
217+
218+
### Logging to a Hugging Face Space
219+
220+
When a `space_id` is provided, the same background thread batches queued entries and pushes them to the Space via the Gradio client API. The main factors that affect end-to-end throughput are:
221+
222+
| Metric | Measured | Notes |
223+
|---|---|---|
224+
| **Burst from a single run** | **2,000 logs delivered in < 8 s** | `log()` calls themselves complete in ~0.01 s; the rest is network drain time. |
225+
| **Parallel runs (32 threads)** | **32,000 logs (32 × 1,000) delivered in ~14 s wall time** | Each thread opens its own Gradio client connection to the Space. |
226+
| **Logs per batch** | No hard cap | All entries queued during the 0.5 s interval are sent in a single `predict()` call. |
227+
| **Data safety** | Zero-loss | If a batch fails to send, it is persisted to local SQLite and retried automatically when the connection recovers. |
228+
229+
These numbers were measured against a free-tier Hugging Face Space (2 vCPU / 16 GB RAM). Throughput will scale with the Space hardware tier, and local-only logging is orders of magnitude faster since no network round-trip is involved.
230+
231+
> **Tip:** For high-frequency logging (e.g. logging every training step), Trackio's queue-and-batch design means your training loop is never blocked by network I/O. Even if the Space is temporarily unreachable, logs accumulate locally and are replayed once the connection is restored.
232+
208233
## Note: Trackio is in Beta (DB Schema May Change)
209234

210235
Note that Trackio is in pre-release right now and we may release breaking changes. In particular, the schema of the Trackio sqlite database may change, which may require migrating or deleting existing database files (located by default at: `~/.cache/huggingface/trackio`).

tests/conftest.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
import pytest
77
from PIL import Image as PILImage
88

9+
from trackio import context_vars
910
from trackio.media import write_audio, write_video
1011

1112

@@ -25,7 +26,15 @@ def temp_dir(monkeypatch):
2526
monkeypatch.setattr(f"{name}.TRACKIO_DIR", Path(tmpdir))
2627
for name in ["trackio.media.media", "trackio.media.utils", "trackio.utils"]:
2728
monkeypatch.setattr(f"{name}.MEDIA_DIR", Path(tmpdir) / "media")
29+
context_vars.current_run.set(None)
30+
context_vars.current_project.set(None)
31+
context_vars.current_server.set(None)
32+
context_vars.current_space_id.set(None)
2833
yield tmpdir
34+
context_vars.current_run.set(None)
35+
context_vars.current_project.set(None)
36+
context_vars.current_server.set(None)
37+
context_vars.current_space_id.set(None)
2938

3039

3140
@pytest.fixture(autouse=True)

tests/e2e-local/test_bulk_logging.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ def test_rapid_bulk_logging(temp_dir):
1414
run1_name = "bulk_test_run1"
1515
run2_name = "bulk_test_run2"
1616

17-
trackio.init(project=project_name, name=run1_name)
17+
run1 = trackio.init(project=project_name, name=run1_name)
1818
start_time = time.time()
1919

2020
num_logs_run1 = 300
@@ -31,8 +31,7 @@ def test_rapid_bulk_logging(temp_dir):
3131
f"1000 calls of trackio.log() took {time_to_run_1000_logs} seconds, which is too long"
3232
)
3333
trackio.finish()
34-
35-
time.sleep(0.6) # Wait for the client to send the logs
34+
run1.finish()
3635

3736
# Verify run1 metrics
3837
metrics_run1 = SQLiteStorage.get_logs(project_name, run1_name)

tests/e2e-spaces/conftest.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
import time
2+
3+
import pytest
4+
5+
6+
@pytest.fixture
7+
def wait_for_client():
8+
def _wait(run, timeout=60):
9+
deadline = time.time() + timeout
10+
while run._client is None:
11+
if time.time() > deadline:
12+
raise TimeoutError("Client did not connect within timeout")
13+
time.sleep(0.1)
14+
15+
return _wait
Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
import secrets
2+
import time
3+
4+
from gradio_client import Client
5+
6+
import trackio
7+
from trackio.sqlite_storage import SQLiteStorage
8+
9+
10+
def test_data_not_lost_on_transient_network_error(
11+
test_space_id, temp_dir, wait_for_client
12+
):
13+
"""
14+
When predict() fails once due to a transient network error and then
15+
recovers, the failed batch should be retried and all data should
16+
eventually reach the Space.
17+
"""
18+
project_name = f"test_transient_{secrets.token_urlsafe(8)}"
19+
run_name = "test_run"
20+
21+
run = trackio.init(project=project_name, name=run_name, space_id=test_space_id)
22+
wait_for_client(run)
23+
24+
original_predict = run._client.predict
25+
call_count = [0]
26+
27+
def wrapped_predict(*args, **kwargs):
28+
call_count[0] += 1
29+
if kwargs.get("api_name") == "/bulk_log" and call_count[0] <= 1:
30+
raise Exception("ReadTimeout: The read operation timed out")
31+
return original_predict(*args, **kwargs)
32+
33+
run._client.predict = wrapped_predict
34+
35+
trackio.log({"loss": 0.5})
36+
trackio.log({"loss": 0.3})
37+
time.sleep(5)
38+
trackio.finish()
39+
40+
verify_client = Client(test_space_id)
41+
summary = verify_client.predict(
42+
project=project_name, run=run_name, api_name="/get_run_summary"
43+
)
44+
assert summary["num_logs"] == 2
45+
46+
47+
def test_failed_data_persisted_locally(test_space_id, temp_dir, wait_for_client):
48+
"""
49+
When predict() permanently fails (Space unreachable), data should be
50+
persisted to the local SQLite database as a fallback buffer so it is
51+
not lost.
52+
"""
53+
project_name = f"test_persist_{secrets.token_urlsafe(8)}"
54+
run_name = "test_run"
55+
56+
run = trackio.init(project=project_name, name=run_name, space_id=test_space_id)
57+
wait_for_client(run)
58+
59+
original_predict = run._client.predict
60+
61+
def always_fail_writes(*args, **kwargs):
62+
if kwargs.get("api_name") in ("/bulk_log", "/bulk_log_system"):
63+
raise Exception("Connection refused")
64+
return original_predict(*args, **kwargs)
65+
66+
run._client.predict = always_fail_writes
67+
68+
trackio.log({"loss": 0.5})
69+
trackio.log({"loss": 0.3})
70+
time.sleep(3)
71+
trackio.finish()
72+
73+
local_logs = SQLiteStorage.get_logs(project=project_name, run=run_name)
74+
assert len(local_logs) >= 2, (
75+
f"Expected at least 2 logs persisted in local SQLite, got {len(local_logs)}"
76+
)
77+
78+
79+
def test_data_delivered_after_batch_sender_crash(
80+
test_space_id, temp_dir, wait_for_client
81+
):
82+
"""
83+
After a network error crashes the batch sender, subsequently-logged
84+
data should still be delivered to the Space (either by retrying within
85+
the same thread or by restarting the sender).
86+
"""
87+
project_name = f"test_crash_{secrets.token_urlsafe(8)}"
88+
run_name = "test_run"
89+
90+
run = trackio.init(project=project_name, name=run_name, space_id=test_space_id)
91+
wait_for_client(run)
92+
93+
original_predict = run._client.predict
94+
call_count = [0]
95+
96+
def wrapped_predict(*args, **kwargs):
97+
call_count[0] += 1
98+
if kwargs.get("api_name") == "/bulk_log" and call_count[0] == 1:
99+
raise Exception("ReadTimeout: The read operation timed out")
100+
return original_predict(*args, **kwargs)
101+
102+
run._client.predict = wrapped_predict
103+
104+
trackio.log({"loss": 0.5})
105+
time.sleep(3)
106+
107+
trackio.log({"loss": 0.3})
108+
trackio.log({"loss": 0.1})
109+
time.sleep(5)
110+
trackio.finish()
111+
112+
verify_client = Client(test_space_id)
113+
summary = verify_client.predict(
114+
project=project_name, run=run_name, api_name="/get_run_summary"
115+
)
116+
assert summary["num_logs"] >= 2, (
117+
f"Expected at least 2 logs on Space after recovery, got {summary['num_logs']}"
118+
)
119+
120+
121+
def test_local_buffer_flushed_after_recovery(test_space_id, temp_dir, wait_for_client):
122+
"""
123+
When the connection recovers after several failures, data that was
124+
persisted in the local SQLite fallback buffer should be flushed to the
125+
Space and cleaned up from the local database.
126+
"""
127+
project_name = f"test_flush_{secrets.token_urlsafe(8)}"
128+
run_name = "test_run"
129+
130+
run = trackio.init(project=project_name, name=run_name, space_id=test_space_id)
131+
wait_for_client(run)
132+
133+
original_predict = run._client.predict
134+
call_count = [0]
135+
136+
def wrapped_predict(*args, **kwargs):
137+
call_count[0] += 1
138+
if kwargs.get("api_name") == "/bulk_log" and call_count[0] <= 3:
139+
raise Exception("Connection refused")
140+
return original_predict(*args, **kwargs)
141+
142+
run._client.predict = wrapped_predict
143+
144+
trackio.log({"loss": 0.5, "epoch": 1})
145+
trackio.log({"loss": 0.3, "epoch": 2})
146+
time.sleep(2)
147+
trackio.log({"loss": 0.1, "epoch": 3})
148+
time.sleep(10)
149+
trackio.finish()
150+
151+
verify_client = Client(test_space_id)
152+
summary = verify_client.predict(
153+
project=project_name, run=run_name, api_name="/get_run_summary"
154+
)
155+
assert summary["num_logs"] == 3, (
156+
f"Expected all 3 logs on Space after recovery, got {summary['num_logs']}"
157+
)
158+
159+
local_logs = SQLiteStorage.get_logs(project=project_name, run=run_name)
160+
assert len(local_logs) == 0, (
161+
f"Expected local buffer to be empty after flush, but found {len(local_logs)} rows"
162+
)
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
import secrets
2+
import time
3+
from unittest.mock import patch
4+
5+
import numpy as np
6+
from gradio_client import Client
7+
8+
import trackio
9+
from trackio import gpu
10+
11+
12+
def test_config_persisted_on_spaces(test_space_id, wait_for_client):
13+
project_name = f"test_config_{secrets.token_urlsafe(8)}"
14+
run_name = "config_run"
15+
16+
run = trackio.init(
17+
project=project_name,
18+
name=run_name,
19+
space_id=test_space_id,
20+
config={"lr": 0.001, "batch_size": 32, "model": "resnet50"},
21+
)
22+
wait_for_client(run)
23+
24+
trackio.log({"loss": 0.5, "acc": 0.8})
25+
trackio.log({"loss": 0.3, "acc": 0.9})
26+
trackio.finish()
27+
28+
client = Client(test_space_id)
29+
30+
summary = client.predict(
31+
project=project_name, run=run_name, api_name="/get_run_summary"
32+
)
33+
assert summary["num_logs"] == 2
34+
assert "loss" in summary["metrics"]
35+
assert "acc" in summary["metrics"]
36+
37+
38+
def test_system_metrics_on_spaces(test_space_id, wait_for_client):
39+
project_name = f"test_system_{secrets.token_urlsafe(8)}"
40+
run_name = "system_run"
41+
42+
def fake_gpu_metrics(device=None):
43+
return {
44+
"gpu/0/utilization": 75,
45+
"gpu/0/allocated_memory": 4.5,
46+
"gpu/0/total_memory": 12.0,
47+
"gpu/0/temp": 65,
48+
"gpu/0/power": 150.0,
49+
"gpu/mean_utilization": 75,
50+
}
51+
52+
with patch.object(gpu, "collect_gpu_metrics", fake_gpu_metrics):
53+
with patch.object(gpu, "get_gpu_count", return_value=(1, [0])):
54+
run = trackio.init(
55+
project=project_name,
56+
name=run_name,
57+
space_id=test_space_id,
58+
auto_log_gpu=True,
59+
gpu_log_interval=0.2,
60+
)
61+
wait_for_client(run)
62+
63+
trackio.log({"loss": 0.5})
64+
time.sleep(1)
65+
trackio.finish()
66+
67+
client = Client(test_space_id)
68+
summary = client.predict(
69+
project=project_name, run=run_name, api_name="/get_run_summary"
70+
)
71+
assert summary["num_logs"] >= 1
72+
73+
74+
def test_image_upload_on_spaces(test_space_id, wait_for_client, temp_dir):
75+
project_name = f"test_image_{secrets.token_urlsafe(8)}"
76+
run_name = "image_run"
77+
78+
run = trackio.init(
79+
project=project_name,
80+
name=run_name,
81+
space_id=test_space_id,
82+
)
83+
wait_for_client(run)
84+
85+
img_array = np.random.randint(0, 255, (64, 64, 3), dtype=np.uint8)
86+
image = trackio.Image(img_array, caption="test_image")
87+
88+
trackio.log({"loss": 0.5, "sample": image})
89+
trackio.finish()
90+
91+
client = Client(test_space_id)
92+
summary = client.predict(
93+
project=project_name, run=run_name, api_name="/get_run_summary"
94+
)
95+
assert summary["num_logs"] == 1
96+
assert "loss" in summary["metrics"]
97+
assert "sample" in summary["metrics"]

0 commit comments

Comments
 (0)