Skip to content

Commit 8e26ab9

Browse files
abidlabsgradio-pr-botclaude
authored
Add an id field to Run which is used internally, allowing users to have multiple runs with the same run name (#505)
Co-authored-by: gradio-pr-bot <gradio-pr-bot@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 498bbc4 commit 8e26ab9

32 files changed

Lines changed: 1696 additions & 473 deletions

.changeset/fifty-bugs-make.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
"trackio": minor
3+
---
4+
5+
feat:Add an `id` field to `Run` which is used internally, allowing users to have multiple runs with the same run name

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -266,7 +266,7 @@ These numbers were measured against a free-tier Hugging Face Space (2 vCPU / 16
266266
267267
## Note: Trackio is in Beta (DB Schema May Change)
268268

269-
Note that Trackio is in pre-release right now and we may release breaking changes. In particular, the schema of the Trackio sqlite database may change, which may require migrating or deleting existing database files (located by default at: `~/.cache/huggingface/trackio`).
269+
Note that Trackio is in pre-release right now and we may release breaking changes. In particular, the schema of the Trackio sqlite database may change. Newer Trackio databases now use a stable `run_id` plus a non-unique `run_name`, while older databases remain readable in compatibility mode by treating `run_name` as the effective run identifier. Existing database files are located by default at: `~/.cache/huggingface/trackio`.
270270

271271
The current SQLite and parquet layout is documented in the [Storage Schema and Direct Queries](https://huggingface.co/docs/trackio/storage_schema) guide, including examples for `trackio query`.
272272

docs/source/python_api.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -89,11 +89,13 @@ Represents a single run in a project.
8989

9090
#### Properties
9191

92-
- **`id`**: The run name (same as `name`)
93-
- **`name`**: The run name
92+
- **`id`**: The stable run identifier used internally by Trackio
93+
- **`name`**: The human-readable run name. Multiple runs can share the same name.
9494
- **`project`**: The project this run belongs to
9595
- **`config`**: The run's configuration dictionary (lazy-loaded)
9696

97+
Note: Multiple runs can share the same `name`, as Trackio will use the `id` identifier to disambiguate them internally.
98+
9799
#### Methods
98100

99101
- **`delete() -> bool`**: Deletes the run from its project. Returns `True` if successful, `False` otherwise.
@@ -149,4 +151,3 @@ for run in source_runs:
149151
run.move("archive")
150152
print(f"Moved {run.name} to archive")
151153
```
152-

docs/source/track.md

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -309,15 +309,26 @@ trackio.finish()
309309

310310
## Resuming a Run
311311

312-
If you need to continue a run (for example, after an interruption), you can resume it by calling [`init`] again with the same project and run name, and setting `resume="must"`:
312+
Trackio identifies runs internally by a stable `run_id`. The human-readable `name`
313+
is no longer required to be unique, so you can create multiple runs with the same
314+
display name.
315+
316+
If you need to continue the latest run with a given name (for example, after an
317+
interruption), call [`init`] again with the same project and run name, and set
318+
`resume="must"`:
313319

314320
```python
315321
trackio.init(project="my_project", name="my_first_run", resume="must")
316322
```
317323

318-
This will load the existing run so you can keep logging data.
324+
This will load the most recently created run with that name so you can keep
325+
logging data using the same `run_id`. But if you set `resume="must"`, and no previous run exists with the same name, Trackio will raise an error.
326+
327+
For more flexibility, use `resume="allow"`. This will resume the latest run with
328+
that name if one exists, or create a new run otherwise.
319329

320-
For more flexibility, use `resume="allow"`. This will resume the run if it exists, or create a new one otherwise.
330+
The default is `resume="never"`, which always creates a fresh run with a new
331+
`run_id`, even if another run with the same `name` already exists.
321332

322333
## Tracking Configuration
323334

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
"""
2+
Demonstrates Trackio's run resume behavior when a job crashes and restarts with
3+
the same human-readable run name.
4+
5+
Usage:
6+
python examples/crash-and-resume-same-run-name.py
7+
python examples/crash-and-resume-same-run-name.py --resume never
8+
python examples/crash-and-resume-same-run-name.py --resume allow
9+
python examples/crash-and-resume-same-run-name.py --resume must
10+
11+
This example runs both phases in a single invocation:
12+
- phase 1 always starts a fresh run and logs 20 steps
13+
- a simulated crash interrupts the job
14+
- phase 2 restarts the job with the configured resume mode and logs 100 more steps
15+
16+
The restart behavior is controlled by `--resume`:
17+
- `never`: restart creates a second run with the same name and a new run_id
18+
- `allow`: restart resumes the latest run with that name if it exists
19+
- `must`: restart must resume an existing run with that name
20+
"""
21+
22+
import argparse
23+
import math
24+
import uuid
25+
import warnings
26+
27+
warnings.filterwarnings(
28+
"ignore",
29+
category=SyntaxWarning,
30+
module=r"pydub\.utils",
31+
)
32+
33+
import trackio # noqa: E402
34+
35+
DEFAULT_PROJECT = f"crash-and-resume-demo-{uuid.uuid4().hex[:8]}"
36+
DEFAULT_RUN_NAME = "trainer-job-42"
37+
DEFAULT_CRASH_STEPS = 50
38+
DEFAULT_RESTART_STEPS = 100
39+
40+
41+
def parse_args() -> argparse.Namespace:
42+
parser = argparse.ArgumentParser()
43+
parser.add_argument("--project", default=DEFAULT_PROJECT)
44+
parser.add_argument("--run-name", default=DEFAULT_RUN_NAME)
45+
parser.add_argument("--crash-steps", type=int, default=DEFAULT_CRASH_STEPS)
46+
parser.add_argument("--restart-steps", type=int, default=DEFAULT_RESTART_STEPS)
47+
parser.add_argument(
48+
"--resume",
49+
choices=["never", "allow", "must"],
50+
default="never",
51+
help="Resume mode used for the simulated restart phase.",
52+
)
53+
return parser.parse_args()
54+
55+
56+
def log_phase(
57+
start_step: int, num_steps: int, start_loss: float, end_loss: float
58+
) -> None:
59+
print(f"Logging steps {start_step}..{start_step + num_steps - 1}")
60+
for offset in range(num_steps):
61+
progress = offset / max(1, num_steps - 1)
62+
loss = (
63+
start_loss
64+
+ ((end_loss - start_loss) * progress)
65+
+ (0.01 * math.sin(offset / 6))
66+
)
67+
accuracy = (
68+
0.25
69+
+ (0.7 * (1 - (loss / max(start_loss, 0.01))))
70+
+ (0.02 * math.cos(offset / 9))
71+
)
72+
trackio.log(
73+
{
74+
"loss": round(loss, 4),
75+
"accuracy": round(max(0.0, min(0.999, accuracy)), 4),
76+
"phase_progress": offset + 1,
77+
},
78+
step=None,
79+
)
80+
81+
82+
def start_run(
83+
project: str,
84+
run_name: str,
85+
resume: str,
86+
phase: str,
87+
crash_steps: int,
88+
restart_steps: int,
89+
):
90+
run = trackio.init(
91+
project=project,
92+
name=run_name,
93+
resume=resume,
94+
config={
95+
"phase": phase,
96+
"resume_mode": resume,
97+
"crash_steps": crash_steps,
98+
"restart_steps": restart_steps,
99+
},
100+
)
101+
print(f"Trackio run name: {run.name}")
102+
print(f"Trackio run id: {run.id}")
103+
print(f"Phase: {phase}")
104+
print(f"Resume mode: {resume}")
105+
return run
106+
107+
108+
def main() -> None:
109+
args = parse_args()
110+
111+
print("=== phase 1: start fresh run ===")
112+
first_run = start_run(
113+
project=args.project,
114+
run_name=args.run_name,
115+
resume="never",
116+
phase="crash",
117+
crash_steps=args.crash_steps,
118+
restart_steps=args.restart_steps,
119+
)
120+
log_phase(start_step=0, num_steps=args.crash_steps, start_loss=0.7, end_loss=0.6)
121+
trackio.finish()
122+
123+
print(f"Simulated crash after {args.crash_steps} steps. Restarting the job now.")
124+
125+
print("=== phase 2: restart job ===")
126+
restarted_run = start_run(
127+
project=args.project,
128+
run_name=args.run_name,
129+
resume=args.resume,
130+
phase="restart",
131+
crash_steps=args.crash_steps,
132+
restart_steps=args.restart_steps,
133+
)
134+
log_phase(
135+
start_step=args.crash_steps,
136+
num_steps=args.restart_steps,
137+
start_loss=0.7,
138+
end_loss=0.2,
139+
)
140+
trackio.finish()
141+
142+
resumed_same_run = restarted_run.id == first_run.id
143+
print(f"Restart reused original run id: {resumed_same_run}")
144+
print(f"Project: {args.project}")
145+
print("Done. Open the dashboard to inspect the resulting run list and charts.")
146+
147+
148+
if __name__ == "__main__":
149+
main()

tests/e2e-local/test_api.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -191,7 +191,8 @@ def test_local_dashboard_supports_remote_client(temp_dir):
191191
settings = client.predict(api_name="/get_settings")
192192

193193
assert project in projects
194-
assert runs == [run_name]
194+
assert len(runs) == 1
195+
assert runs[0]["name"] == run_name
195196
assert "logo_urls" in settings
196197
finally:
197198
trackio.delete_project(project, force=True)
@@ -362,7 +363,9 @@ async def check_mcp() -> None:
362363
"get_runs_for_project",
363364
{"project": project},
364365
)
365-
assert runs.structuredContent["result"] == [run_name]
366+
result = runs.structuredContent["result"]
367+
assert len(result) == 1
368+
assert result[0]["name"] == run_name
366369

367370
run_summary = await session.call_tool(
368371
"get_run_summary",

tests/e2e-spaces/test_metrics_on_spaces.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -128,9 +128,10 @@ def test_runs_data_persisted_after_restart(test_space_id):
128128
deadline = time.time() + 300
129129
while time.time() < deadline:
130130
try:
131-
run_names = client.predict(
131+
run_records = client.predict(
132132
project=project_name, api_name="/get_runs_for_project"
133133
)
134+
run_names = [r["name"] if isinstance(r, dict) else r for r in run_records]
134135
if run_name in run_names:
135136
break
136137
except Exception:

tests/e2e-spaces/test_throughput.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,9 +95,10 @@ def worker(thread_idx):
9595
deadline = time.time() + 120
9696
while time.time() < deadline:
9797
try:
98-
runs = verify_client.predict(
98+
run_records = verify_client.predict(
9999
project=project_name, api_name="/get_runs_for_project"
100100
)
101+
runs = [r["name"] if isinstance(r, dict) else r for r in run_records]
101102
if len(runs) == num_threads:
102103
break
103104
except Exception:

0 commit comments

Comments
 (0)