Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .changeset/smart-cameras-travel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"trackio": minor
---

feat:Make Trackio logging much more robust
29 changes: 27 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,13 @@

</div>

`trackio` is a lightweight, free experiment tracking Python library built by Hugging Face 🤗.
Welcome to `trackio`: a lightweight, <u>free</u> experiment tracking Python library built by Hugging Face 🤗. It is local-first, supports very high logging throughputs for many parallel experiments, and provides an easy CLI interface for querying, perfect for LLM-driven experimenting.

Trackio also ships with a Gradio-based dashboard you can use to view metrics locally:

![Screen Recording 2025-11-06 at 5 34 50 PM](https://github.com/user-attachments/assets/8c9c1b96-f17a-401c-83a4-26ac754f89c7)

Trackio's main features:

- **API compatible** with `wandb.init`, `wandb.log`, and `wandb.finish`. Drop-in replacement: just

Expand All @@ -36,9 +39,10 @@
- Persists logs in a Sqlite database locally (or, if you provide a `space_id`, in a private Hugging Face Dataset)
- Visualize experiments with a Gradio dashboard locally (or, if you provide a `space_id`, on Hugging Face Spaces)
- **LLM-friendly**: Built with autonomous ML experiments in mind, Trackio includes a CLI for programmatic access and a Python API for run management, making it easy for LLMs to log metrics and query experiment data.

- Everything here, including hosting on Hugging Face, is **free**!

Trackio is designed to be lightweight (the core codebase is <5,000 lines of Python code), not fully-featured. It is designed in an extensible way and written entirely in Python so that developers can easily fork the repository and add functionality that they care about.
Trackio is designed to be lightweight and extensible. It is written entirely in Python so that developers can easily fork the repository and add functionality that they care about.

## Installation

Expand Down Expand Up @@ -205,6 +209,27 @@ To get started and see basic examples of usage, see these files:
- [Persisting metrics in a Hugging Face Dataset](https://github.com/gradio-app/trackio/blob/main/examples/persist-dataset.py)
- [Deploying the dashboard to Spaces](https://github.com/gradio-app/trackio/blob/main/examples/deploy-on-spaces.py)

## Throughput & Rate Limits

### Local logging

`trackio.log()` is a non-blocking call that appends to an in-memory queue and returns immediately. A background thread drains the queue every **0.5 s** and writes to the local SQLite database. Because log calls never touch the network or disk on the calling thread, the client-side throughput is effectively **unlimited** -- you can burst thousands of calls per second without slowing down your training loop.

### Logging to a Hugging Face Space

When a `space_id` is provided, the same background thread batches queued entries and pushes them to the Space via the Gradio client API. The main factors that affect end-to-end throughput are:

| Metric | Measured | Notes |
|---|---|---|
| **Burst from a single run** | **2,000 logs delivered in < 8 s** | `log()` calls themselves complete in ~0.01 s; the rest is network drain time. |
| **Parallel runs (32 threads)** | **32,000 logs (32 × 1,000) delivered in ~14 s wall time** | Each thread opens its own Gradio client connection to the Space. |
| **Logs per batch** | No hard cap | All entries queued during the 0.5 s interval are sent in a single `predict()` call. |
| **Data safety** | Zero-loss | If a batch fails to send, it is persisted to local SQLite and retried automatically when the connection recovers. |

These numbers were measured against a free-tier Hugging Face Space (2 vCPU / 16 GB RAM). Throughput will scale with the Space hardware tier, and local-only logging is orders of magnitude faster since no network round-trip is involved.

> **Tip:** For high-frequency logging (e.g. logging every training step), Trackio's queue-and-batch design means your training loop is never blocked by network I/O. Even if the Space is temporarily unreachable, logs accumulate locally and are replayed once the connection is restored.

## Note: Trackio is in Beta (DB Schema May Change)

Note that Trackio is in pre-release right now and we may release breaking changes. In particular, the schema of the Trackio sqlite database may change, which may require migrating or deleting existing database files (located by default at: `~/.cache/huggingface/trackio`).
Expand Down
9 changes: 9 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import pytest
from PIL import Image as PILImage

from trackio import context_vars
from trackio.media import write_audio, write_video


Expand All @@ -25,7 +26,15 @@ def temp_dir(monkeypatch):
monkeypatch.setattr(f"{name}.TRACKIO_DIR", Path(tmpdir))
for name in ["trackio.media.media", "trackio.media.utils", "trackio.utils"]:
monkeypatch.setattr(f"{name}.MEDIA_DIR", Path(tmpdir) / "media")
context_vars.current_run.set(None)
context_vars.current_project.set(None)
context_vars.current_server.set(None)
context_vars.current_space_id.set(None)
yield tmpdir
context_vars.current_run.set(None)
context_vars.current_project.set(None)
context_vars.current_server.set(None)
context_vars.current_space_id.set(None)


@pytest.fixture(autouse=True)
Expand Down
5 changes: 2 additions & 3 deletions tests/e2e-local/test_bulk_logging.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ def test_rapid_bulk_logging(temp_dir):
run1_name = "bulk_test_run1"
run2_name = "bulk_test_run2"

trackio.init(project=project_name, name=run1_name)
run1 = trackio.init(project=project_name, name=run1_name)
start_time = time.time()

num_logs_run1 = 300
Expand All @@ -31,8 +31,7 @@ def test_rapid_bulk_logging(temp_dir):
f"1000 calls of trackio.log() took {time_to_run_1000_logs} seconds, which is too long"
)
trackio.finish()

time.sleep(0.6) # Wait for the client to send the logs
run1.finish()

# Verify run1 metrics
metrics_run1 = SQLiteStorage.get_logs(project_name, run1_name)
Expand Down
15 changes: 15 additions & 0 deletions tests/e2e-spaces/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
import time

import pytest


@pytest.fixture
def wait_for_client():
def _wait(run, timeout=60):
deadline = time.time() + timeout
while run._client is None:
if time.time() > deadline:
raise TimeoutError("Client did not connect within timeout")
time.sleep(0.1)

return _wait
162 changes: 162 additions & 0 deletions tests/e2e-spaces/test_data_robustness.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
import secrets
import time

from gradio_client import Client

import trackio
from trackio.sqlite_storage import SQLiteStorage


def test_data_not_lost_on_transient_network_error(
test_space_id, temp_dir, wait_for_client
):
"""
When predict() fails once due to a transient network error and then
recovers, the failed batch should be retried and all data should
eventually reach the Space.
"""
project_name = f"test_transient_{secrets.token_urlsafe(8)}"
run_name = "test_run"

run = trackio.init(project=project_name, name=run_name, space_id=test_space_id)
wait_for_client(run)

original_predict = run._client.predict
call_count = [0]

def wrapped_predict(*args, **kwargs):
call_count[0] += 1
if kwargs.get("api_name") == "/bulk_log" and call_count[0] <= 1:
raise Exception("ReadTimeout: The read operation timed out")
return original_predict(*args, **kwargs)

run._client.predict = wrapped_predict

trackio.log({"loss": 0.5})
trackio.log({"loss": 0.3})
time.sleep(5)
trackio.finish()

verify_client = Client(test_space_id)
summary = verify_client.predict(
project=project_name, run=run_name, api_name="/get_run_summary"
)
assert summary["num_logs"] == 2


def test_failed_data_persisted_locally(test_space_id, temp_dir, wait_for_client):
"""
When predict() permanently fails (Space unreachable), data should be
persisted to the local SQLite database as a fallback buffer so it is
not lost.
"""
project_name = f"test_persist_{secrets.token_urlsafe(8)}"
run_name = "test_run"

run = trackio.init(project=project_name, name=run_name, space_id=test_space_id)
wait_for_client(run)

original_predict = run._client.predict

def always_fail_writes(*args, **kwargs):
if kwargs.get("api_name") in ("/bulk_log", "/bulk_log_system"):
raise Exception("Connection refused")
return original_predict(*args, **kwargs)

run._client.predict = always_fail_writes

trackio.log({"loss": 0.5})
trackio.log({"loss": 0.3})
time.sleep(3)
trackio.finish()

local_logs = SQLiteStorage.get_logs(project=project_name, run=run_name)
assert len(local_logs) >= 2, (
f"Expected at least 2 logs persisted in local SQLite, got {len(local_logs)}"
)


def test_data_delivered_after_batch_sender_crash(
test_space_id, temp_dir, wait_for_client
):
"""
After a network error crashes the batch sender, subsequently-logged
data should still be delivered to the Space (either by retrying within
the same thread or by restarting the sender).
"""
project_name = f"test_crash_{secrets.token_urlsafe(8)}"
run_name = "test_run"

run = trackio.init(project=project_name, name=run_name, space_id=test_space_id)
wait_for_client(run)

original_predict = run._client.predict
call_count = [0]

def wrapped_predict(*args, **kwargs):
call_count[0] += 1
if kwargs.get("api_name") == "/bulk_log" and call_count[0] == 1:
raise Exception("ReadTimeout: The read operation timed out")
return original_predict(*args, **kwargs)

run._client.predict = wrapped_predict

trackio.log({"loss": 0.5})
time.sleep(3)

trackio.log({"loss": 0.3})
trackio.log({"loss": 0.1})
time.sleep(5)
trackio.finish()

verify_client = Client(test_space_id)
summary = verify_client.predict(
project=project_name, run=run_name, api_name="/get_run_summary"
)
assert summary["num_logs"] >= 2, (
f"Expected at least 2 logs on Space after recovery, got {summary['num_logs']}"
)


def test_local_buffer_flushed_after_recovery(test_space_id, temp_dir, wait_for_client):
"""
When the connection recovers after several failures, data that was
persisted in the local SQLite fallback buffer should be flushed to the
Space and cleaned up from the local database.
"""
project_name = f"test_flush_{secrets.token_urlsafe(8)}"
run_name = "test_run"

run = trackio.init(project=project_name, name=run_name, space_id=test_space_id)
wait_for_client(run)

original_predict = run._client.predict
call_count = [0]

def wrapped_predict(*args, **kwargs):
call_count[0] += 1
if kwargs.get("api_name") == "/bulk_log" and call_count[0] <= 3:
raise Exception("Connection refused")
return original_predict(*args, **kwargs)

run._client.predict = wrapped_predict

trackio.log({"loss": 0.5, "epoch": 1})
trackio.log({"loss": 0.3, "epoch": 2})
time.sleep(2)
trackio.log({"loss": 0.1, "epoch": 3})
time.sleep(10)
trackio.finish()

verify_client = Client(test_space_id)
summary = verify_client.predict(
project=project_name, run=run_name, api_name="/get_run_summary"
)
assert summary["num_logs"] == 3, (
f"Expected all 3 logs on Space after recovery, got {summary['num_logs']}"
)

local_logs = SQLiteStorage.get_logs(project=project_name, run=run_name)
assert len(local_logs) == 0, (
f"Expected local buffer to be empty after flush, but found {len(local_logs)} rows"
)
97 changes: 97 additions & 0 deletions tests/e2e-spaces/test_spaces_features.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
import secrets
import time
from unittest.mock import patch

import numpy as np
from gradio_client import Client

import trackio
from trackio import gpu


def test_config_persisted_on_spaces(test_space_id, wait_for_client):
project_name = f"test_config_{secrets.token_urlsafe(8)}"
run_name = "config_run"

run = trackio.init(
project=project_name,
name=run_name,
space_id=test_space_id,
config={"lr": 0.001, "batch_size": 32, "model": "resnet50"},
)
wait_for_client(run)

trackio.log({"loss": 0.5, "acc": 0.8})
trackio.log({"loss": 0.3, "acc": 0.9})
trackio.finish()

client = Client(test_space_id)

summary = client.predict(
project=project_name, run=run_name, api_name="/get_run_summary"
)
assert summary["num_logs"] == 2
assert "loss" in summary["metrics"]
assert "acc" in summary["metrics"]


def test_system_metrics_on_spaces(test_space_id, wait_for_client):
project_name = f"test_system_{secrets.token_urlsafe(8)}"
run_name = "system_run"

def fake_gpu_metrics(device=None):
return {
"gpu/0/utilization": 75,
"gpu/0/allocated_memory": 4.5,
"gpu/0/total_memory": 12.0,
"gpu/0/temp": 65,
"gpu/0/power": 150.0,
"gpu/mean_utilization": 75,
}

with patch.object(gpu, "collect_gpu_metrics", fake_gpu_metrics):
with patch.object(gpu, "get_gpu_count", return_value=(1, [0])):
run = trackio.init(
project=project_name,
name=run_name,
space_id=test_space_id,
auto_log_gpu=True,
gpu_log_interval=0.2,
)
wait_for_client(run)

trackio.log({"loss": 0.5})
time.sleep(1)
trackio.finish()

client = Client(test_space_id)
summary = client.predict(
project=project_name, run=run_name, api_name="/get_run_summary"
)
assert summary["num_logs"] >= 1


def test_image_upload_on_spaces(test_space_id, wait_for_client, temp_dir):
project_name = f"test_image_{secrets.token_urlsafe(8)}"
run_name = "image_run"

run = trackio.init(
project=project_name,
name=run_name,
space_id=test_space_id,
)
wait_for_client(run)

img_array = np.random.randint(0, 255, (64, 64, 3), dtype=np.uint8)
image = trackio.Image(img_array, caption="test_image")

trackio.log({"loss": 0.5, "sample": image})
trackio.finish()

client = Client(test_space_id)
summary = client.predict(
project=project_name, run=run_name, api_name="/get_run_summary"
)
assert summary["num_logs"] == 1
assert "loss" in summary["metrics"]
assert "sample" in summary["metrics"]
Loading