Skip to content

feat: Thor GPU process table and VRAM chart via pynvml#852

Open
whitesscott wants to merge 2 commits intorbonghi:masterfrom
whitesscott:thor-gpu-memory
Open

feat: Thor GPU process table and VRAM chart via pynvml#852
whitesscott wants to merge 2 commits intorbonghi:masterfrom
whitesscott:thor-gpu-memory

Conversation

@whitesscott
Copy link
Copy Markdown
Contributor

@whitesscott whitesscott commented May 2, 2026

Add per-process GPU memory tracking and VRAM visualization using pynvml.py
for Jetson Thor (nvidia.ko stack), matching the intent of upstream PR #849
without using nvidia-smi.

  • when a process using GPU is active, this PR reports GPU process memory use identical to the nvidia-smi method, while avoiding the overhead of invoking it.
  • processes.py: detect Thor via is_thor(); use pynvml NVML to query compute+graphics running processes, deduplicating by PID with max() to avoid double-counting processes that appear in both lists
  • thor_gpu.py: add nvml_process_table() with lazy nvmlInit() (fork-safe: jtop forks a monitoring subprocess; NVML state does not survive fork so init must happen in the child); add nvml_gpu_used_kb() with 1-second TTL cache; update read_gpu_mem_rows_for_gui() to populate vram_used_b from the NVML process sum
  • pgpu_thor.py (2GPU page): VRAM chart now plots two series — grey for total system RAM in use (context) and yellow for GPU VRAM (process allocations); label shows used/total without redundant prefix
  • memory.py: add shared_label='VRAM' on Thor (unified memory, no NvMapMemUsed) vs 'Shared' on Orin; expose in RAM dict
  • pmem.py (4MEM page): read shared_label from memory dict; compute spacing dynamically so the value column stays flush-aligned regardless of label length
  • Implemented GPC0 meter on 2GPU for Thor.

Tested on:
Jetson AGX Thor Developer Kit (L4T R38.4)
Jetson AGX Orin Developer Kit 32GB (L4T R36.5)

Summary by Sourcery

Add NVML-based per-process GPU memory tracking and VRAM visualization for Thor boards, aligning process tables, memory stats, and GPU UI components with the nvidia.ko stack.

New Features:

  • Introduce NVML-backed GPU process table for Thor that reports per-process GPU memory usage and names for use by the process service.
  • Expose Thor GPU VRAM usage via a cached NVML process memory sum and surface it through the GPU memory summary API for the GUI.
  • Add a dual-series VRAM chart on the Thor GPU page showing both system RAM usage and GPU VRAM allocations.
  • Label unified Thor GPU memory as VRAM in the RAM statistics and propagate this label into the 4MEM page legend.
  • Display Thor GPC frequency as a dedicated meter in both the generic NVML GPU status and the Thor GPU page UI, and show current power control mode in the Thor GPU view.

Enhancements:

  • Refine process status selection logic to choose NVML-based data on Thor and nvmap-based data on Orin, with common post-processing.
  • Adjust Thor memory legend spacing to support variable-length shared/VRAM labels without misalignment.

Summary by Sourcery

Add NVML-based GPU memory accounting and visualization for Thor boards, integrating per-process VRAM usage into process reporting, memory stats, and the Thor GPU UI.

New Features:

  • Expose a Thor-specific NVML-backed GPU process table that reports per-process GPU memory usage for the process service.
  • Provide a Thor GPU memory summary API that surfaces NVML-derived VRAM usage for use by the GUI.
  • Add a dual-series RAM/VRAM chart on the Thor GPU page showing both total system RAM usage and GPU VRAM allocations.
  • Label unified Thor GPU memory as VRAM in RAM statistics and propagate this label into the 4MEM page legend.
  • Introduce a GPC frequency meter for Thor in both generic NVML GPU status and the Thor GPU page.

Enhancements:

  • Select NVML-based GPU process data on Thor and nvmap-based data on Orin with shared post-processing logic in the process service.
  • Cache the NVML GPU memory sum with a short TTL to avoid redundant NVML queries per sampling tick.
  • Adjust Thor GPU page labels, colors, and layout to better present VRAM usage and power control mode.
  • Make the 4MEM page legend spacing robust to variable-length shared/VRAM labels to keep the value column aligned.

  Add per-process GPU memory tracking and VRAM visualization for
  Jetson Thor (nvidia.ko stack), matching the intent of upstream PR rbonghi#849
  without using nvidia-smi.

  - processes.py: detect Thor via is_thor(); use pynvml NVML to query
    compute+graphics running processes, deduplicating by PID with max()
    to avoid double-counting processes that appear in both lists
  - thor_gpu.py: add nvml_process_table() with lazy nvmlInit() (fork-safe:
    jtop forks a monitoring subprocess; NVML state does not survive fork
    so init must happen in the child); add nvml_gpu_used_kb() with 1-second
    TTL cache; update read_gpu_mem_rows_for_gui() to populate vram_used_b
    from the NVML process sum
  - pgpu_thor.py (2GPU page): VRAM chart now plots two series — grey for
    total system RAM in use (context) and yellow for GPU VRAM (process
    allocations); label shows used/total without redundant prefix
  - memory.py: add shared_label='VRAM' on Thor (unified memory, no
    NvMapMemUsed) vs 'Shared' on Orin; expose in RAM dict
  - pmem.py (4MEM page): read shared_label from memory dict; compute
    spacing dynamically so the value column stays flush-aligned regardless
    of label length
  - Implemented GPC0 meter on 2GPU for Thor.

  Tested on:  Jetson AGX Thor Developer Kit (L4T R38.4)
              Jetson AGX Orin Developer Kit 32GB (L4T R36.5)
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented May 2, 2026

Reviewer's Guide

Implements NVML-based per-process GPU memory tracking and VRAM visualization for Thor (nvidia.ko stack), wiring the new data path into process accounting, GPU status, and GUI pages while preserving existing Orin/nvmap behavior.

Sequence diagram for Thor NVML-backed GPU process table in get_status

sequenceDiagram
    actor User
    participant ProcessService
    participant Processes as processes_py
    participant ThorGPU as thor_gpu_py
    participant NVML as pynvml

    User ->> ProcessService: request GPU process list
    ProcessService ->> Processes: get_status()

    activate Processes
    Processes ->> Processes: check _isThor and _nvml_process_table
    alt Thor board (nvidia_ko stack)
        Processes ->> ThorGPU: nvml_process_table()
        activate ThorGPU
        ThorGPU ->> NVML: nvmlInit()
        ThorGPU ->> NVML: nvmlDeviceGetCount()
        loop for each GPU device
            ThorGPU ->> NVML: nvmlDeviceGetHandleByIndex(idx)
            ThorGPU ->> NVML: nvmlDeviceGetComputeRunningProcesses(handle)
            ThorGPU ->> NVML: nvmlDeviceGetGraphicsRunningProcesses(handle)
            ThorGPU ->> ThorGPU: deduplicate by PID and max(usedGpuMemory)
            ThorGPU ->> NVML: nvmlSystemGetProcessName(pid)
        end
        ThorGPU ->> Processes: total_kb, rows [pid_str, user, name, gpu_mem_kb]
        deactivate ThorGPU
        Processes ->> Processes: map rows via get_process_info(pid, gpu_mem_kb, name, uptime)
    else Orin or other nvgpu stack
        Processes ->> processes_py: read_process_table(nvmap_debugfs_path)
        processes_py -->> Processes: total_kb, rows
        Processes ->> Processes: map rows via get_process_info(...)
    end
    Processes ->> ProcessService: total, table (filtered nonempty)
    deactivate Processes
    ProcessService ->> User: render per-process GPU memory table
Loading

Sequence diagram for Thor NVML VRAM usage caching for GUI and memory service

sequenceDiagram
    participant GUI as pgpu_thor_py
    participant MemoryService as memory_service
    participant ThorGPU as thor_gpu_py
    participant NVML as pynvml

    rect rgb(230,230,230)
        GUI ->> ThorGPU: read_gpu_mem_rows_for_gui(device_index)
        MemoryService ->> ThorGPU: read_gpu_mem_rows_for_gui(device_index)
    end

    activate ThorGPU
    ThorGPU ->> ThorGPU: nvml_gpu_used_kb()
    alt cache hit (< 1s TTL)
        ThorGPU ->> ThorGPU: return cached kb from _GPU_USED_CACHE
    else cache miss
        ThorGPU ->> ThorGPU: nvml_process_table()
        ThorGPU ->> NVML: nvmlInit(), nvmlDeviceGetCount(), process queries
        ThorGPU ->> ThorGPU: sum per-PID usedGpuMemory to total_kb
        ThorGPU ->> ThorGPU: update _GPU_USED_CACHE{ts, kb or None}
    end
    ThorGPU ->> ThorGPU: vram_used_b = used_kb * 1024 or 0
    ThorGPU ->> ThorGPU: vram_total_b = memtotal if NVML else 0
    ThorGPU ->> ThorGPU: shared_used_b = memtotal - memavailable
    ThorGPU ->> GUI: dict{vram_used_b, vram_total_b, shared_used_b, shared_total_b}
    ThorGPU ->> MemoryService: same dict
    deactivate ThorGPU

    GUI ->> GUI: update_chart_ram() computes series and label_mem
    MemoryService ->> MemoryService: use vram_used_b in summaries
Loading

Flow diagram for NVML-based VRAM data into Thor GPU and memory UIs

flowchart TD
    subgraph NVML_stack
        NVML[pynvml NVML API]
    end

    subgraph Core_Thor_GPU
        A[nvml_process_table
        - compute+graphics processes
        - deduplicate by PID
        - sum usedGpuMemory]
        B[nvml_gpu_used_kb
        - 1s TTL cache of total_kb]
        C[read_gpu_mem_rows_for_gui
        - vram_used_b
        - vram_total_b
        - shared_used_b
        - shared_total_b]
    end

    subgraph Core_Memory
        D[memory_get_status
        - reads MemInfo
        - computes RAM stats
        - sets shared_label VRAM or Shared]
    end

    subgraph Process_Accounting
        E[processes_ProcessService
        get_status
        - uses nvml_process_table on Thor
        - uses read_process_table on Orin]
    end

    subgraph GUI_Thor_GPU_Page
        F[pgpu_thor_update_chart_ram
        - series0 grey: shared_used_b
        - series1 yellow: vram_used_b]
        G[pgpu_thor_draw
        - VRAM label used/total
        - shows power_control
        - shows GPC meter]
    end

    subgraph GUI_4MEM_Page
        H[pmem_draw_ram_legend
        - uses shared_label
        - dynamic spacing]
    end

    NVML --> A
    A --> B
    B --> C
    C --> F
    C --> G

    C --> E
    D --> H

    D -->|shared_label and RAM stats| H
    A -->|total_kb and rows| E

    subgraph GPU_Status_Common
        I[nvml_read_gpu_status
        - augments freq_data
        - adds GPC lane meter]
    end

    NVML --> I
    I --> G
Loading

File-Level Changes

Change Details Files
Add NVML-backed GPU process table and cached VRAM aggregation for Thor and expose it via the Thor GPU status API.
  • Introduce nvml_process_table() to query compute and graphics running processes via pynvml, deduplicating by PID and returning a nvidia-smi–compatible row format and total memory sum.
  • Add nvml_gpu_used_kb() with a 1-second TTL cache to avoid duplicate NVML queries per tick and have read_gpu_mem_rows_for_gui() surface vram_used_b/vram_total_b from NVML alongside shared RAM usage.
  • Extend Thor NVML status reporting to include a GPC frequency lane so the UI can render a dedicated GPC meter.
jtop/core/thor_gpu.py
jtop/core/gpu.py
Branch process GPU-memory accounting between Thor (NVML) and Orin (nvmap) while reusing the existing ProcessService post-processing.
  • Detect Thor boards in the Processes service via is_thor() and lazily bind nvml_process_table when available.
  • Update get_status() to read uptime once and then choose NVML-based process tables on Thor or nvmap debugfs tables on Orin, mapping both into get_process_info() and filtering out empty entries.
jtop/core/processes.py
Enhance the Thor GPU page to visualize VRAM vs total system RAM and show power control mode alongside existing GPU metrics.
  • Convert the Thor RAM chart into a dual-series chart (grey for total system RAM in use, yellow for NVML-based GPU VRAM allocations), rescale it against total unified memory, and update colors and label text accordingly.
  • Change the VRAM label to show used/total VRAM without a leading prefix and reposition the frequency meters and add a textual power-control mode indicator in the Thor GPU view.
jtop/gui/pgpu_thor.py
Label unified Thor GPU memory as VRAM in memory stats and propagate this into the 4MEM page legend with dynamic spacing.
  • Use is_thor() in memory.get_status() to set shared_label to 'VRAM' on Thor and expose it in the RAM dictionary while keeping 'Shared' on other platforms.
  • Update the 4MEM page legend rendering to read shared_label, compute spacing based on label length so the value column stays aligned, and continue to color it as shared/VRAM memory.
jtop/core/memory.py
jtop/gui/pmem.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="jtop/core/thor_gpu.py" line_range="81-90" />
<code_context>
+def nvml_process_table() -> Tuple[int, List]:
</code_context>
<issue_to_address>
**issue (bug_risk):** Broad exception handling in nvml_process_table can hide real NVML issues and makes debugging harder.

Wrapping the entire NVML interaction in a blanket `Exception` handler turns any unexpected error (driver misconfig, transient NVML failure, coding bug) into `(0, [])`, which hides real issues and can produce misleading always-zero VRAM data. Please narrow the try/except to the specific NVML calls that can fail and/or log the exception so operational problems are visible while still allowing the app to continue.
</issue_to_address>

### Comment 2
<location path="jtop/core/processes.py" line_range="145" />
<code_context>
-            # Use the memory table to measure
-            total, table = read_process_table(self._root_path + "/debug/nvmap/iovmm/maps")
-
-            uptime = float(open('/proc/uptime', 'r').readline().split()[0])
-
-            table = [self.get_process_info(prc[0], prc[3], prc[2], uptime) for prc in table]
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The /proc/uptime file is opened without a context manager, which can lead to small resource leaks over time.

Now that this path runs unconditionally and may execute often, unclosed file descriptors can accumulate in long‑running processes. Please wrap this in a context manager, e.g. `with open('/proc/uptime') as f:`, to ensure the file is always closed and the intent is clear.

```suggestion
        with open('/proc/uptime', 'r') as f:
            uptime = float(f.readline().split()[0])
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread jtop/core/thor_gpu.py
Comment thread jtop/core/processes.py Outdated
  Comment 1 — nvml_process_table(): The blanket except Exception is gone. Init/count failures catch
  NVMLError specifically and log at debug level before returning (0, []). The per-device handle call does
  the same. Per-getter and per-process-name lookups also narrow to NVMLError. A real coding bug (e.g. a
  TypeError or AttributeError) will now propagate visibly instead of silently returning zeros.

  Comment 2 — /proc/uptime: Wrapped with with open(...) as f: — the file descriptor is now guaranteed
  closed on every exit path including exceptions.
@whitesscott
Copy link
Copy Markdown
Contributor Author

@sourcery-ai review

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="jtop/gui/pmem.py" line_range="266-267" />
<code_context>
         plot_name_info(self.stdscr, pos_y + 1, pos_x + 2, 'Used', used, spacing=3, color=NColors.cyan())
         shared = size_to_string(self.jetson.memory['RAM']['shared'], 'k')
-        plot_name_info(self.stdscr, pos_y + 2, pos_x + 2, 'Shared', shared, spacing=1, color=NColors.green())
+        shared_label = self.jetson.memory['RAM'].get('shared_label', 'Shared')
+        shared_spacing = max(0, 7 - len(shared_label))
+        plot_name_info(self.stdscr, pos_y + 2, pos_x + 2, shared_label, shared, spacing=shared_spacing, color=NColors.green())
         buffers = size_to_string(self.jetson.memory['RAM']['buffers'], 'k')
</code_context>
<issue_to_address>
**nitpick:** Avoid zero spacing in the memory legend to keep layout readable with longer labels.

`shared_spacing = max(0, 7 - len(shared_label))` allows `spacing=0` for longer labels, which can visually run the value into the label. If `plot_name_info` relies on at least one separating space, consider clamping to `max(1, 7 - len(shared_label))` (or another minimum) to preserve readability for long labels.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread jtop/gui/pmem.py
Comment on lines +266 to +267
shared_label = self.jetson.memory['RAM'].get('shared_label', 'Shared')
shared_spacing = max(0, 7 - len(shared_label))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Avoid zero spacing in the memory legend to keep layout readable with longer labels.

shared_spacing = max(0, 7 - len(shared_label)) allows spacing=0 for longer labels, which can visually run the value into the label. If plot_name_info relies on at least one separating space, consider clamping to max(1, 7 - len(shared_label)) (or another minimum) to preserve readability for long labels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant