Add multi-GPU system metrics support by Saba9 · Pull Request #481 · gradio-app/trackio

Saba9 · 2026-04-09T15:52:52Z

Summary

Backend: GpuMonitor now queries all physical GPUs from any process by ignoring CUDA_VISIBLE_DEVICES, so rank 0 can collect metrics for every GPU on the machine during distributed training (uses pynvml's nvmlDeviceGetCount() directly)
Frontend: System Metrics page renders per-GPU sub-accordions (collapsed by default) when multiple GPUs are detected, showing utilization, allocated memory, power, and temperature per GPU
Single-GPU: UI is unchanged — no sub-accordions, same summary metrics as before
Manual API: trackio.log_gpu() still respects CUDA_VISIBLE_DEVICES

Changes

trackio/gpu.py — Add get_all_gpu_count(), add all_gpus param to collect_gpu_metrics(), update GpuMonitor to use them
trackio/frontend/src/pages/SystemMetrics.svelte — Add subgroup rendering for multi-GPU, strip gpu/ prefix from summary chart titles
tests/unit/test_gpu.py — Unit tests for get_all_gpu_count() and collect_gpu_metrics(all_gpus=True/False)
tests/e2e-local/test_basic_logging.py — Update existing mock, add multi-GPU e2e test
examples/test_multi_gpu_mock.py — Mock script to test 4-GPU UI locally
examples/test_single_gpu_mock.py — Mock script to test single-GPU UI locally

Test plan

pytest tests/unit/test_gpu.py — 6 tests pass
pytest tests/e2e-local/test_basic_logging.py — 7 tests pass (including new multi-GPU test)
pytest — full suite passes (1 pre-existing flaky failure in test_import_export)
Frontend builds cleanly
Manual test with examples/test_multi_gpu_mock.py → verify System Metrics shows per-GPU accordions
Manual test with examples/test_single_gpu_mock.py → verify single-GPU UI unchanged
Test on actual multi-GPU machine

🤖 Generated with Claude Code

GpuMonitor now queries all physical GPUs from any process by ignoring CUDA_VISIBLE_DEVICES, so rank 0 can collect metrics for every GPU on the machine during distributed training. - Add get_all_gpu_count() that bypasses CUDA_VISIBLE_DEVICES - Add all_gpus parameter to collect_gpu_metrics() - Update GpuMonitor to use get_all_gpu_count() and all_gpus=True - Add per-GPU sub-accordions to SystemMetrics frontend (multi-GPU only) - Keep single-GPU UI unchanged (no sub-accordions) - Manual log_gpu() API still respects CUDA_VISIBLE_DEVICES Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gradio-pr-bot · 2026-04-09T15:53:23Z

🪼 branch checks and previews

•	Name	Status	URL
🦄	Changes	detected!	Details

gradio-pr-bot · 2026-04-09T15:53:26Z

🦄 change detected

This Pull Request includes changes to the following packages.

Package	Version
`trackio`	`minor`

Add multi-GPU system metrics support

‼️ Changeset not approved. Ensure the version bump is appropriate for all packages before approving.

Maintainers can approve the changeset by checking this checkbox.

Something isn't right?

Maintainers can change the version label to modify the version bump.
If the bot has failed to detect any changes, or if this pull request needs to update multiple packages to different versions or requires a more comprehensive changelog entry, maintainers can update the changelog file directly.

HuggingFaceDocBuilderDev · 2026-04-09T15:53:58Z

🪼 branch checks and previews

•	Name	Status	URL
	Spaces	ready!	Spaces preview

Install Trackio from this PR (includes built frontend)

pip install "https://huggingface.co/buckets/trackio/trackio-wheels/resolve/afcda92b6d9e0986ba1ac98afb2f6512bc6dc6c2/trackio-0.22.0-py3-none-any.whl"

HuggingFaceDocBuilderDev · 2026-04-09T15:55:54Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

abidlabs · 2026-04-09T16:09:56Z

Thanks @Saba9 were you able to test on a multiGPU machine (potentially with HF jobs)? Would be great to see how it looks

Saba9 · 2026-04-09T16:13:00Z

@abidlabs Not yet. I ran tests where I replaced pynvml with MagicMock to simulate multi-gpu. I'll try running it with HF jobs and post pictures soon.

Saba9 · 2026-04-09T16:49:50Z

@abidlabs Tested it with HF jobs on a dual GPU machine. Seems to be working!

Default View

Per-GPU metrics expanded

- Add unit labels to chart titles (%, GiB, W, °C) in SystemMetrics - Per-GPU sub-accordions default to closed - Per-GPU accordion labels use "GPU 0", "GPU 1" etc. - Strip gpu/ prefix from summary chart titles - Add HF Jobs stress test script for real multi-GPU validation - Format fixes from ruff Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Replace manual save/restore in GPU unit tests with pytest fixture that also restores _energy_baseline (was leaking between tests) - Move keyMetricSuffixes to script section in SystemMetrics.svelte - Remove test_multi_gpu_hf_job.py (temporary monkeypatch workaround, not a useful example once the feature ships) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove test_collect_gpu_metrics_default_respects_cuda_visible (tests pre-existing behavior unchanged by this PR) - Remove test_multi_gpu_mock.py and test_single_gpu_mock.py (developer testing aids, not user-facing examples; automated tests cover this) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abidlabs · 2026-04-09T18:59:07Z

    assert "timestamp" in log
+
+
+def test_auto_log_gpu_multi(temp_dir):


Not sure if this test is adding much since everything is mocked.

abidlabs · 2026-04-09T19:00:23Z

    assert gpu._energy_baseline == {}
+
+
+def _make_mock_pynvml(num_gpus=4):


Again, I don't think we really need to create this whole mock fixture to test whether the gpus are being counted correctly? I think it'd be better to remove or replace with a simpler test

abidlabs · 2026-04-09T19:15:44Z

Amazing, @Saba9! I was exploring the UI, and I think it might be useful to actually plot the the system metrics from multiple GPUs on the same graph, as users may want to compare metrics across the different GPUs easily? What do you think -- here's how wandb seems to do it for reference:

I know this might get a bit crowded but what we could do is, for the System Metrics page, have a list of devices/gpus in the left sidebar, just like we have runs, allowing people to trim the number of devices if it becomes too unwieldy

cc @qgallouedec @kashif for visibility

kashif · 2026-04-09T19:22:03Z

thanks! looks good

Copilot

Pull request overview

This PR adds multi-GPU system metrics collection on the backend (so rank 0 can report all physical GPUs regardless of CUDA_VISIBLE_DEVICES) and updates the System Metrics UI to display per-GPU plots when multiple devices are present.

Changes:

Backend: add get_all_gpu_count() and an all_gpus mode in collect_gpu_metrics(), and switch GpuMonitor to log system-wide GPU metrics.
Frontend: render subgroup accordions intended for per-GPU metrics and improve chart titles/units.
Tests: extend unit/e2e-local coverage for multi-GPU logging behavior and add a changeset entry.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
trackio/gpu.py	Adds physical GPU enumeration + `all_gpus` collection path; `GpuMonitor` now logs all GPUs system-wide.
trackio/frontend/src/pages/SystemMetrics.svelte	Adds subgroup rendering (intended per-GPU) and metric title/unit formatting.
tests/unit/test_gpu.py	Adds tests for `get_all_gpu_count()` and `collect_gpu_metrics(all_gpus=True)`.
tests/e2e-local/test_basic_logging.py	Updates mocks and adds an e2e-local multi-GPU auto-log test.
.changeset/forty-pigs-beg.md	Declares a minor release for the multi-GPU system metrics feature.

Comments suppressed due to low confidence (1)

trackio/gpu.py:151

collect_gpu_metrics() now has two indexing modes (logical indices from CUDA_VISIBLE_DEVICES vs physical indices when all_gpus=True). _energy_baseline is keyed by logical_idx, so calling collect_gpu_metrics() in both modes within a run can mix baselines across different physical GPUs and produce incorrect/negative energy_consumed values. Key the baseline by physical_idx (or otherwise disambiguate by mode) so energy deltas are tracked per physical device.

    if all_gpus and device is None:
        gpu_count, visible_gpus = get_all_gpu_count()
    else:
        gpu_count, visible_gpus = get_gpu_count()
    if gpu_count == 0:
        return {}

    if device is not None:
        if device < 0 or device >= gpu_count:
            return {}
        gpu_indices = [(device, visible_gpus[device])]
    else:
        gpu_indices = list(enumerate(visible_gpus))

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-16T01:10:48Z

+        {@const subEntries = Object.entries(group.subgroups)}
+        {#if subEntries.length > 1}
+          <div class="subgroup-list">
+            {#each subEntries as [subName, subMetrics]}


The subgroup rendering only runs when subEntries.length > 1. For single-GPU runs, gpu/0/* metrics end up in group.subgroups (see groupMetricsByPrefix), so this condition prevents any per-GPU plots from rendering, which breaks the stated “single-GPU UI unchanged” behavior. Handle the subEntries.length === 1 case by rendering that subgroup’s metrics inline (without an extra accordion) or by merging them into orderedDirect.

Copilot · 2026-04-16T01:10:48Z

+              {@const filteredSub = subMetrics.filter((m) => keyMetricSuffixes.some((s) => m.endsWith("/" + s)))}
+              {@const subKey = `sys:${groupName}:${subName}`}
+              {@const orderedSub = getOrderedMetrics(subKey, filteredSub)}
+              <Accordion label="GPU {subName}" open={false}>


This subgroup UI is hard-coded to GPUs (label="GPU {subName}") and filters subgroup metrics using keyMetricSuffixes, but groupMetricsByPrefix() will also create subgroups for other prefixes (e.g. cpu/{core}/utilization from Apple Silicon metrics). As written, this will mislabel non-GPU subgroups as “GPU …” and may hide their metrics. Either restrict this subgroup block to groupName === "gpu", or make the labeling/filtering generic per group.

Suggested change

{@const filteredSub = subMetrics.filter((m) => keyMetricSuffixes.some((s) => m.endsWith("/" + s)))}

{@const subKey = `sys:${groupName}:${subName}`}

{@const orderedSub = getOrderedMetrics(subKey, filteredSub)}

<Accordion label="GPU {subName}" open={false}>

{@const filteredSub = groupName === "gpu"

? subMetrics.filter((m) => keyMetricSuffixes.some((s) => m.endsWith("/" + s)))

: subMetrics}

{@const subKey = `sys:${groupName}:${subName}`}

{@const orderedSub = getOrderedMetrics(subKey, filteredSub)}

{@const subgroupLabel = `${groupName === "gpu" ? "GPU" : groupName.charAt(0).toUpperCase() + groupName.slice(1)} ${subName}`}

<Accordion label={subgroupLabel} open={false}>

Copilot · 2026-04-16T01:10:49Z

+    if all_gpus and device is None:
+        gpu_count, visible_gpus = get_all_gpu_count()
+    else:
+        gpu_count, visible_gpus = get_gpu_count()
    if gpu_count == 0:


Test coverage doesn’t currently exercise the all_gpus=False path of collect_gpu_metrics() with CUDA_VISIBLE_DEVICES set (i.e., verifying that logical indices are used in keys while NVML handles use physical indices). Adding a unit test for collect_gpu_metrics() default behavior under a non-trivial CUDA_VISIBLE_DEVICES (e.g. "2,3") would help prevent regressions alongside the new all_gpus mode.

abidlabs · 2026-04-16T03:44:06Z

Made some UI tweaks to put all of the devices on the same graph @Saba9!

https://huggingface.co/spaces/abidlabs/pr-481-multigpu-demo-20260415-1935

Everything else LGTM, so I'll go ahead and merge this in after CI is green

add changeset

68a0fcf

Saba9 marked this pull request as ready for review April 9, 2026 15:57

Saba9 and others added 4 commits April 9, 2026 09:52

Merge branch 'main' into saba/multi-gpu

4554819

abidlabs reviewed Apr 9, 2026

View reviewed changes

abidlabs requested a review from qgallouedec April 9, 2026 19:16

abidlabs added 2 commits April 14, 2026 14:24

Merge branch 'main' into saba/multi-gpu

68fa742

Merge branch 'main' into saba/multi-gpu

1717f21

abidlabs requested a review from Copilot April 16, 2026 01:06

Copilot started reviewing on behalf of abidlabs April 16, 2026 01:07 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

abidlabs added 4 commits April 15, 2026 19:48

changes

5e59726

changes

5ce959b

changes

3632f23

changes

afcda92

abidlabs merged commit 882647e into main Apr 16, 2026
9 checks passed

gradio-pr-bot mentioned this pull request Apr 16, 2026

chore: update versions #499

Merged

		assert "timestamp" in log


		def test_auto_log_gpu_multi(temp_dir):

		assert gpu._energy_baseline == {}


		def _make_mock_pynvml(num_gpus=4):

Conversation

Saba9 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

gradio-pr-bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🪼 branch checks and previews

Uh oh!

gradio-pr-bot commented Apr 9, 2026

🦄 change detected

This Pull Request includes changes to the following packages.

Something isn't right?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🪼 branch checks and previews

Uh oh!

HuggingFaceDocBuilderDev commented Apr 9, 2026

Uh oh!

abidlabs commented Apr 9, 2026

Uh oh!

Saba9 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Saba9 commented Apr 9, 2026

Uh oh!

abidlabs Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

abidlabs Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abidlabs commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kashif commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

abidlabs commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Saba9 commented Apr 9, 2026 •

edited

Loading

gradio-pr-bot commented Apr 9, 2026 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 9, 2026 •

edited

Loading

Saba9 commented Apr 9, 2026 •

edited

Loading

abidlabs Apr 9, 2026 •

edited

Loading

abidlabs commented Apr 9, 2026 •

edited

Loading

abidlabs commented Apr 16, 2026 •

edited

Loading