Skip to content
Open
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
config:
theme: base
themeVariables:
primaryColor: "#9f62eb"
---
xychart-beta
title "NCCL all_reduce_perf — Avg Bus Bandwidth (GB/s)"
x-axis ["1nic-unaligned (cross-NUMA)", "1nic-aligned (same NUMA)", "2nic-aligned (same NUMA)"]
y-axis "Avg busbw (GB/s)" 0 --> 120
bar [25, 56, 112]
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
flowchart TB
subgraph User["User / Workload Author"]
RCT["ResourceClaimTemplate<br/>(CEL selectors:<br/>NUMA match)"]
PodSpec["Pod Spec<br/>(resourceClaims reference)"]
end

subgraph CP["Kubernetes Control Plane"]
API["API Server"]
Sched["Scheduler<br/>(DRA-aware)"]
RS_GPU["ResourceSlice<br/>(gpu.nvidia.com)<br/>pciBusID, NUMA, pcieRoot"]
RS_NIC["ResourceSlice<br/>(dra.net)<br/>rdmaDevice, NUMA, pciAddress"]
end

subgraph Node["AKS Node (ND GB300-v6)"]
NVDRV["NVIDIA GPU DRA Driver<br/>(DaemonSet)"]
DRANETDRV["DRANET DRA Driver<br/>(DaemonSet)"]
end

%% User submits workload
PodSpec -->|"Submit pod with<br/>resource claims"| API
RCT -->|"Define GPU+NIC<br/>co-location constraints"| API

%% Drivers publish device topology
NVDRV -->|"Discover GPUs &<br/>publish topology"| RS_GPU
DRANETDRV -->|"Discover NICs &<br/>publish topology"| RS_NIC

%% Scheduler uses slices to allocate
RS_GPU --> Sched
RS_NIC --> Sched
API -->|"Pending pod"| Sched
Sched -->|"Evaluate CEL selectors<br/>& allocate NUMA-aligned<br/>GPU + NIC"| API
API -->|"Bind pod to node<br/>with allocation result"| Node

%% Styling
style User fill:#fef7e0,stroke:#fbbc04
style CP fill:#e8f0fe,stroke:#4285f4
style Node fill:#f3e8fd,stroke:#9f62eb
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
flowchart TB
Kubelet["Kubelet"]
CRI["containerd"]
NRI["NRI Plugin<br/>(DRANET)"]

subgraph NUMA0["NUMA Node 0"]
GPU0["GPU 0<br/>NVIDIA GB300"]
GPU1["GPU 1<br/>NVIDIA GB300"]
NIC0["NIC 0 · mlx5_0<br/>800 Gb/s IB"]
NIC1["NIC 1 · mlx5_1<br/>800 Gb/s IB"]
end

subgraph NUMA1["NUMA Node 1"]
GPU2["GPU 2<br/>NVIDIA GB300"]
GPU3["GPU 3<br/>NVIDIA GB300"]
NIC2["NIC 2 · mlx5_2<br/>800 Gb/s IB"]
NIC3["NIC 3 · mlx5_3<br/>800 Gb/s IB"]
end

subgraph Pod["Scheduled Pod"]
Container["Container<br/>/dev/infiniband/uverbs0<br/>/dev/infiniband/uverbs1"]
end

%% Runtime flow
Kubelet -->|"1. Receive allocation<br/>result from API Server"| CRI
CRI -->|"2. OCI CreateContainer<br/>hook"| NRI
NRI -->|"3. Inject only allocated<br/>/dev/infiniband/* devices"| Pod

%% NUMA-aligned GDR paths
GPU0 <-.->|"PCIe · GDR ✓"| NIC0
GPU1 <-.->|"PCIe · GDR ✓"| NIC1
GPU2 <-.->|"PCIe · GDR ✓"| NIC2
GPU3 <-.->|"PCIe · GDR ✓"| NIC3

%% Cross-NUMA penalty
GPU0 <-.->|"QPI/UPI · No GDR ✗"| NIC2

%% Pod uses aligned devices
Container -.->|"NCCL uses<br/>GPU 0 + mlx5_0, mlx5_1"| GPU0
Container -.->|"RDMA traffic"| NIC0
Container -.->|"RDMA traffic"| NIC1

%% Styling
style NUMA0 fill:#e6f4ea,stroke:#34a853
style NUMA1 fill:#fce8e6,stroke:#ea4335
style Pod fill:#fef7e0,stroke:#fbbc04
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading