feat: Thor GPU process table and VRAM chart via pynvml#852
feat: Thor GPU process table and VRAM chart via pynvml#852whitesscott wants to merge 2 commits intorbonghi:masterfrom
Conversation
Add per-process GPU memory tracking and VRAM visualization for Jetson Thor (nvidia.ko stack), matching the intent of upstream PR rbonghi#849 without using nvidia-smi. - processes.py: detect Thor via is_thor(); use pynvml NVML to query compute+graphics running processes, deduplicating by PID with max() to avoid double-counting processes that appear in both lists - thor_gpu.py: add nvml_process_table() with lazy nvmlInit() (fork-safe: jtop forks a monitoring subprocess; NVML state does not survive fork so init must happen in the child); add nvml_gpu_used_kb() with 1-second TTL cache; update read_gpu_mem_rows_for_gui() to populate vram_used_b from the NVML process sum - pgpu_thor.py (2GPU page): VRAM chart now plots two series — grey for total system RAM in use (context) and yellow for GPU VRAM (process allocations); label shows used/total without redundant prefix - memory.py: add shared_label='VRAM' on Thor (unified memory, no NvMapMemUsed) vs 'Shared' on Orin; expose in RAM dict - pmem.py (4MEM page): read shared_label from memory dict; compute spacing dynamically so the value column stays flush-aligned regardless of label length - Implemented GPC0 meter on 2GPU for Thor. Tested on: Jetson AGX Thor Developer Kit (L4T R38.4) Jetson AGX Orin Developer Kit 32GB (L4T R36.5)
Reviewer's GuideImplements NVML-based per-process GPU memory tracking and VRAM visualization for Thor (nvidia.ko stack), wiring the new data path into process accounting, GPU status, and GUI pages while preserving existing Orin/nvmap behavior. Sequence diagram for Thor NVML-backed GPU process table in get_statussequenceDiagram
actor User
participant ProcessService
participant Processes as processes_py
participant ThorGPU as thor_gpu_py
participant NVML as pynvml
User ->> ProcessService: request GPU process list
ProcessService ->> Processes: get_status()
activate Processes
Processes ->> Processes: check _isThor and _nvml_process_table
alt Thor board (nvidia_ko stack)
Processes ->> ThorGPU: nvml_process_table()
activate ThorGPU
ThorGPU ->> NVML: nvmlInit()
ThorGPU ->> NVML: nvmlDeviceGetCount()
loop for each GPU device
ThorGPU ->> NVML: nvmlDeviceGetHandleByIndex(idx)
ThorGPU ->> NVML: nvmlDeviceGetComputeRunningProcesses(handle)
ThorGPU ->> NVML: nvmlDeviceGetGraphicsRunningProcesses(handle)
ThorGPU ->> ThorGPU: deduplicate by PID and max(usedGpuMemory)
ThorGPU ->> NVML: nvmlSystemGetProcessName(pid)
end
ThorGPU ->> Processes: total_kb, rows [pid_str, user, name, gpu_mem_kb]
deactivate ThorGPU
Processes ->> Processes: map rows via get_process_info(pid, gpu_mem_kb, name, uptime)
else Orin or other nvgpu stack
Processes ->> processes_py: read_process_table(nvmap_debugfs_path)
processes_py -->> Processes: total_kb, rows
Processes ->> Processes: map rows via get_process_info(...)
end
Processes ->> ProcessService: total, table (filtered nonempty)
deactivate Processes
ProcessService ->> User: render per-process GPU memory table
Sequence diagram for Thor NVML VRAM usage caching for GUI and memory servicesequenceDiagram
participant GUI as pgpu_thor_py
participant MemoryService as memory_service
participant ThorGPU as thor_gpu_py
participant NVML as pynvml
rect rgb(230,230,230)
GUI ->> ThorGPU: read_gpu_mem_rows_for_gui(device_index)
MemoryService ->> ThorGPU: read_gpu_mem_rows_for_gui(device_index)
end
activate ThorGPU
ThorGPU ->> ThorGPU: nvml_gpu_used_kb()
alt cache hit (< 1s TTL)
ThorGPU ->> ThorGPU: return cached kb from _GPU_USED_CACHE
else cache miss
ThorGPU ->> ThorGPU: nvml_process_table()
ThorGPU ->> NVML: nvmlInit(), nvmlDeviceGetCount(), process queries
ThorGPU ->> ThorGPU: sum per-PID usedGpuMemory to total_kb
ThorGPU ->> ThorGPU: update _GPU_USED_CACHE{ts, kb or None}
end
ThorGPU ->> ThorGPU: vram_used_b = used_kb * 1024 or 0
ThorGPU ->> ThorGPU: vram_total_b = memtotal if NVML else 0
ThorGPU ->> ThorGPU: shared_used_b = memtotal - memavailable
ThorGPU ->> GUI: dict{vram_used_b, vram_total_b, shared_used_b, shared_total_b}
ThorGPU ->> MemoryService: same dict
deactivate ThorGPU
GUI ->> GUI: update_chart_ram() computes series and label_mem
MemoryService ->> MemoryService: use vram_used_b in summaries
Flow diagram for NVML-based VRAM data into Thor GPU and memory UIsflowchart TD
subgraph NVML_stack
NVML[pynvml NVML API]
end
subgraph Core_Thor_GPU
A[nvml_process_table
- compute+graphics processes
- deduplicate by PID
- sum usedGpuMemory]
B[nvml_gpu_used_kb
- 1s TTL cache of total_kb]
C[read_gpu_mem_rows_for_gui
- vram_used_b
- vram_total_b
- shared_used_b
- shared_total_b]
end
subgraph Core_Memory
D[memory_get_status
- reads MemInfo
- computes RAM stats
- sets shared_label VRAM or Shared]
end
subgraph Process_Accounting
E[processes_ProcessService
get_status
- uses nvml_process_table on Thor
- uses read_process_table on Orin]
end
subgraph GUI_Thor_GPU_Page
F[pgpu_thor_update_chart_ram
- series0 grey: shared_used_b
- series1 yellow: vram_used_b]
G[pgpu_thor_draw
- VRAM label used/total
- shows power_control
- shows GPC meter]
end
subgraph GUI_4MEM_Page
H[pmem_draw_ram_legend
- uses shared_label
- dynamic spacing]
end
NVML --> A
A --> B
B --> C
C --> F
C --> G
C --> E
D --> H
D -->|shared_label and RAM stats| H
A -->|total_kb and rows| E
subgraph GPU_Status_Common
I[nvml_read_gpu_status
- augments freq_data
- adds GPC lane meter]
end
NVML --> I
I --> G
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="jtop/core/thor_gpu.py" line_range="81-90" />
<code_context>
+def nvml_process_table() -> Tuple[int, List]:
</code_context>
<issue_to_address>
**issue (bug_risk):** Broad exception handling in nvml_process_table can hide real NVML issues and makes debugging harder.
Wrapping the entire NVML interaction in a blanket `Exception` handler turns any unexpected error (driver misconfig, transient NVML failure, coding bug) into `(0, [])`, which hides real issues and can produce misleading always-zero VRAM data. Please narrow the try/except to the specific NVML calls that can fail and/or log the exception so operational problems are visible while still allowing the app to continue.
</issue_to_address>
### Comment 2
<location path="jtop/core/processes.py" line_range="145" />
<code_context>
- # Use the memory table to measure
- total, table = read_process_table(self._root_path + "/debug/nvmap/iovmm/maps")
-
- uptime = float(open('/proc/uptime', 'r').readline().split()[0])
-
- table = [self.get_process_info(prc[0], prc[3], prc[2], uptime) for prc in table]
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The /proc/uptime file is opened without a context manager, which can lead to small resource leaks over time.
Now that this path runs unconditionally and may execute often, unclosed file descriptors can accumulate in long‑running processes. Please wrap this in a context manager, e.g. `with open('/proc/uptime') as f:`, to ensure the file is always closed and the intent is clear.
```suggestion
with open('/proc/uptime', 'r') as f:
uptime = float(f.readline().split()[0])
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Comment 1 — nvml_process_table(): The blanket except Exception is gone. Init/count failures catch NVMLError specifically and log at debug level before returning (0, []). The per-device handle call does the same. Per-getter and per-process-name lookups also narrow to NVMLError. A real coding bug (e.g. a TypeError or AttributeError) will now propagate visibly instead of silently returning zeros. Comment 2 — /proc/uptime: Wrapped with with open(...) as f: — the file descriptor is now guaranteed closed on every exit path including exceptions.
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 1 issue
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location path="jtop/gui/pmem.py" line_range="266-267" />
<code_context>
plot_name_info(self.stdscr, pos_y + 1, pos_x + 2, 'Used', used, spacing=3, color=NColors.cyan())
shared = size_to_string(self.jetson.memory['RAM']['shared'], 'k')
- plot_name_info(self.stdscr, pos_y + 2, pos_x + 2, 'Shared', shared, spacing=1, color=NColors.green())
+ shared_label = self.jetson.memory['RAM'].get('shared_label', 'Shared')
+ shared_spacing = max(0, 7 - len(shared_label))
+ plot_name_info(self.stdscr, pos_y + 2, pos_x + 2, shared_label, shared, spacing=shared_spacing, color=NColors.green())
buffers = size_to_string(self.jetson.memory['RAM']['buffers'], 'k')
</code_context>
<issue_to_address>
**nitpick:** Avoid zero spacing in the memory legend to keep layout readable with longer labels.
`shared_spacing = max(0, 7 - len(shared_label))` allows `spacing=0` for longer labels, which can visually run the value into the label. If `plot_name_info` relies on at least one separating space, consider clamping to `max(1, 7 - len(shared_label))` (or another minimum) to preserve readability for long labels.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| shared_label = self.jetson.memory['RAM'].get('shared_label', 'Shared') | ||
| shared_spacing = max(0, 7 - len(shared_label)) |
There was a problem hiding this comment.
nitpick: Avoid zero spacing in the memory legend to keep layout readable with longer labels.
shared_spacing = max(0, 7 - len(shared_label)) allows spacing=0 for longer labels, which can visually run the value into the label. If plot_name_info relies on at least one separating space, consider clamping to max(1, 7 - len(shared_label)) (or another minimum) to preserve readability for long labels.
Add per-process GPU memory tracking and VRAM visualization using pynvml.py
for Jetson Thor (nvidia.ko stack), matching the intent of upstream PR #849
without using nvidia-smi.
Tested on:
Jetson AGX Thor Developer Kit (L4T R38.4)
Jetson AGX Orin Developer Kit 32GB (L4T R36.5)
Summary by Sourcery
Add NVML-based per-process GPU memory tracking and VRAM visualization for Thor boards, aligning process tables, memory stats, and GPU UI components with the nvidia.ko stack.
New Features:
Enhancements:
Summary by Sourcery
Add NVML-based GPU memory accounting and visualization for Thor boards, integrating per-process VRAM usage into process reporting, memory stats, and the Thor GPU UI.
New Features:
Enhancements: