Add dranet blog by anson627 · Pull Request #5697 · Azure/AKS

anson627 · 2026-04-02T19:37:24Z

No description provided.

Copilot

Pull request overview

Adds a new AKS website blog post explaining how to use DRANET + Dynamic Resource Allocation (DRA) for NUMA-aware GPU/NIC alignment (RDMA performance) and includes accompanying control-plane/data-plane diagrams (Mermaid sources + exported SVGs).

Changes:

New blog post: RDMA/NUMA scheduling problem statement, DRANET architecture, DRA ResourceClaimTemplate examples, and NCCL benchmark walkthrough/results.
Added control-plane and data-plane diagrams as both .mmd sources and rendered .svg assets.

Reviewed changes

Copilot reviewed 3 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/index.md	New blog post content + configuration examples + benchmark walkthrough
website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/control-plane-diagram.mmd	Mermaid source for control-plane diagram
website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/control-plane-diagram.svg	Rendered control-plane diagram used by the post
website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/data-plane-diagram.mmd	Mermaid source for data-plane diagram
website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/data-plane-diagram.svg	Rendered data-plane diagram used by the post

website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/index.md

Copilot · 2026-04-02T19:41:38Z

website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/index.md

+
+Large-scale AI training and inferencing on Kubernetes depends on high-throughput, low-latency GPU-to-GPU communication. [DRANET](https://github.com/kubernetes-sigs/dranet) is an open-source DRA network driver that discovers RDMA capable devices, exposes their topology as Kubernetes DRA attributes, and injects only desired devices into each container. Combined with the [NVIDIA GPU DRA driver](https://github.com/kubernetes-purgatory/nvidia-dra-driver-gpu), it enables topology-aware co-scheduling of GPUs and NICs to deliver high-performance networking for demanding applications in Kubernetes.
+
+In previous post, we covered [fundamental DRA concepts](/2025/11/17/dra-devices-and-drivers-on-kubernetes). In this post, we walk through how DRANET works on [AKS 1.34](https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/) with [ND GB300-v6](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nd-gb300-v6-series?tabs=sizebasic) nodes, demonstrate three NUMA (Non-uniform memory access) alignment scenarios, and show the benchmark results.


Grammar: "In previous post" reads like a missing article. Consider changing to "In a previous post, we covered..." (or similar) for correct English.

Suggested change

In previous post, we covered [fundamental DRA concepts](/2025/11/17/dra-devices-and-drivers-on-kubernetes). In this post, we walk through how DRANET works on [AKS 1.34](https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/) with [ND GB300-v6](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nd-gb300-v6-series?tabs=sizebasic) nodes, demonstrate three NUMA (Non-uniform memory access) alignment scenarios, and show the benchmark results.

In a previous post, we covered [fundamental DRA concepts](/2025/11/17/dra-devices-and-drivers-on-kubernetes). In this post, we walk through how DRANET works on [AKS 1.34](https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/) with [ND GB300-v6](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nd-gb300-v6-series?tabs=sizebasic) nodes, demonstrate three NUMA (Non-uniform memory access) alignment scenarios, and show the benchmark results.

Copilot · 2026-04-02T19:41:39Z

website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/index.md

+
+Large-scale AI training and inferencing on Kubernetes depends on high-throughput, low-latency GPU-to-GPU communication. [DRANET](https://github.com/kubernetes-sigs/dranet) is an open-source DRA network driver that discovers RDMA capable devices, exposes their topology as Kubernetes DRA attributes, and injects only desired devices into each container. Combined with the [NVIDIA GPU DRA driver](https://github.com/kubernetes-purgatory/nvidia-dra-driver-gpu), it enables topology-aware co-scheduling of GPUs and NICs to deliver high-performance networking for demanding applications in Kubernetes.
+
+In previous post, we covered [fundamental DRA concepts](/2025/11/17/dra-devices-and-drivers-on-kubernetes). In this post, we walk through how DRANET works on [AKS 1.34](https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/) with [ND GB300-v6](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nd-gb300-v6-series?tabs=sizebasic) nodes, demonstrate three NUMA (Non-uniform memory access) alignment scenarios, and show the benchmark results.


The Microsoft Learn URL uses a locale-specific path (/en-us/). Repo blog guidance recommends using locale-agnostic Learn links (no /en-us/) to avoid unnecessary redirects and keep links consistent.

Suggested change

In previous post, we covered [fundamental DRA concepts](/2025/11/17/dra-devices-and-drivers-on-kubernetes). In this post, we walk through how DRANET works on [AKS 1.34](https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/) with [ND GB300-v6](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nd-gb300-v6-series?tabs=sizebasic) nodes, demonstrate three NUMA (Non-uniform memory access) alignment scenarios, and show the benchmark results.

In previous post, we covered [fundamental DRA concepts](/2025/11/17/dra-devices-and-drivers-on-kubernetes). In this post, we walk through how DRANET works on [AKS 1.34](https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/) with [ND GB300-v6](https://learn.microsoft.com/azure/virtual-machines/sizes/gpu-accelerated/nd-gb300-v6-series?tabs=sizebasic) nodes, demonstrate three NUMA (Non-uniform memory access) alignment scenarios, and show the benchmark results.

Copilot · 2026-04-02T19:41:39Z

website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/index.md

+| Resource | Count | Detail |
+|---|---|---|
+| GPU | 4x NVIDIA GB300 | 288 GB HBM3E each, NVLink-18 all-to-all |
+| NIC | 4x Mellanox ConnectX | 800 Gb/s InfiniBand each |
+| NUMA nodes | 2 | 2 GPUs + 2 NICs per NUMA node |


Several Markdown tables start with || (double leading pipe), which renders as an extra empty first column in CommonMark/Docusaurus tables. Removing the extra leading | (use | ... | ... |) will make the tables render as intended.

website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/index.md

Copilot

Pull request overview

Copilot reviewed 4 out of 7 changed files in this pull request and generated 4 comments.

website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/index.md

Copilot · 2026-04-02T20:55:08Z

website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/index.md

+      - name: nic
+        exactly:
+          deviceClassName: dranet.net
+          count: 1
+          selectors:
+          - cel:
+              expression: >-
+                device.attributes["dra.net"]["rdmaDevice"] == "mlx5_2"
+```


Same inconsistency here as in the earlier templates: deviceClassName: dranet.net should match the driver/DeviceClass identifier used elsewhere in the post (e.g., the dra.net driver/attribute namespace shown above).

Copilot

Pull request overview

Copilot reviewed 5 out of 8 changed files in this pull request and generated 1 comment.

website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/index.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 8 changed files in this pull request and generated 2 comments.

Copilot · 2026-04-03T17:41:16Z

website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/index.md

+      dra.net/numaNode:
+        int: 0
+      dra.net/pciAddress:
+        string: "0101:00:00.0"
+      dra.net/rdma:
+        bool: true
+      dra.net/rdmaDevice:
+        string: mlx5_0
+      dra.net/pciVendor:
+        string: Mellanox Technologies


The ResourceSlice example defines attribute keys like dra.net/numaNode / dra.net/rdmaDevice with typed values (int, string, bool), but the CEL selectors later access them as device.attributes["dra.net"]["numaNode"] / ... ["rdmaDevice"] and compare directly to primitives. Please make the attribute schema and the selector syntax consistent (either update the ResourceSlice example to match the selector structure, or update selectors to reference the exact attribute keys/types shown in the ResourceSlice example).

Suggested change

dra.net/numaNode:

int: 0

dra.net/pciAddress:

string: "0101:00:00.0"

dra.net/rdma:

bool: true

dra.net/rdmaDevice:

string: mlx5_0

dra.net/pciVendor:

string: Mellanox Technologies

dra.net:

numaNode:

int: 0

pciAddress:

string: "0101:00:00.0"

rdma:

bool: true

rdmaDevice:

string: mlx5_0

pciVendor:

string: Mellanox Technologies

Copilot · 2026-04-03T17:41:16Z

website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/index.md

+              expression: >-
+                device.attributes["dra.net"]["rdmaDevice"] == "mlx5_0"
+```
+


The CEL selector examples appear to assume attributes are nested under device.attributes["dra.net"] and directly comparable (e.g., == 0 / == true). If the published attributes follow the dra.net/<key>: {int|bool|string: ...} pattern shown earlier, these selectors won’t match as written. Please update the selector examples to the correct attribute access pattern so readers can copy/paste them successfully.

Copilot

Pull request overview

Copilot reviewed 5 out of 8 changed files in this pull request and generated 3 comments.

Copilot · 2026-04-08T02:47:25Z

website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/index.md

+| Resource | Count | Detail |
+|---|---|---|
+| GPU | 4x NVIDIA GB300 | 288 GB HBM3E each, NVLink-18 all-to-all |
+| NIC | 4x Mellanox ConnectX | 800 Gb/s InfiniBand each |
+| NUMA nodes | 2 | 2 GPUs + 2 NICs per NUMA node |


The hardware-topology table is written with a leading || on each row (for example || Resource | Count | Detail |), which renders as an extra empty first column in Markdown. Convert these rows to standard table syntax (single leading |, or | | only when you intentionally need a blank header cell) so the table renders as intended.

Copilot · 2026-04-08T02:47:26Z

website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/index.md

+      dra.net/numaNode:
+        int: 0
+      dra.net/pciAddress:
+        string: "0101:00:00.0"
+      dra.net/rdma:
+        bool: true
+      dra.net/rdmaDevice:
+        string: mlx5_0
+      dra.net/pciVendor:
+        string: Mellanox Technologies


In the ResourceSlice example, attributes are shown with flat keys like dra.net/numaNode and dra.net/pciAddress, but later the CEL selectors access attributes as a nested map (device.attributes["dra.net"]["..."]). These formats are inconsistent; update the examples so the published attributes and selector expressions use the same schema.

Suggested change

dra.net/numaNode:

int: 0

dra.net/pciAddress:

string: "0101:00:00.0"

dra.net/rdma:

bool: true

dra.net/rdmaDevice:

string: mlx5_0

dra.net/pciVendor:

string: Mellanox Technologies

dra.net:

numaNode:

int: 0

pciAddress:

string: "0101:00:00.0"

rdma:

bool: true

rdmaDevice:

string: mlx5_0

pciVendor:

string: Mellanox Technologies

Copilot · 2026-04-08T02:47:26Z

website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/index.md

+
+## ResourceClaimTemplates for topology-aware allocation
+
+With both drivers publishing ResourceSlices, we can write ResourceClaimTemplates that use CEL selectors to express precise GPU-NIC co-location constraints. Each template creates a per-pod ResourceClaim that requests devices from both the `gpu.nvidia.com` and `dranet.net` DeviceClasses, filtered by attributes like NUMA node or PCI address. We define three templates to demonstrate different NUMA placement strategies.


This section says the NIC devices come from the dranet.net DeviceClass, but elsewhere in the post the NIC driver/namespace is dra.net (for example driver: dra.net and dra.net/* attributes). Please clarify the intended identifiers (DeviceClass vs driver name) and make the examples consistent so readers can copy/paste them reliably.

Suggested change

With both drivers publishing ResourceSlices, we can write ResourceClaimTemplates that use CEL selectors to express precise GPU-NIC co-location constraints. Each template creates a per-pod ResourceClaim that requests devices from both the `gpu.nvidia.com` and `dranet.net` DeviceClasses, filtered by attributes like NUMA node or PCI address. We define three templates to demonstrate different NUMA placement strategies.

With both drivers publishing ResourceSlices, we can write ResourceClaimTemplates that use CEL selectors to express precise GPU-NIC co-location constraints. Each template creates a per-pod ResourceClaim that requests devices from both the `gpu.nvidia.com` DeviceClass and the DRANET NIC DeviceClass, `dranet.net`, filtered by attributes like NUMA node or PCI address. In these examples, `dranet.net` is the DeviceClass name, while `dra.net` is the DRANET driver name and attribute namespace used in the published device attributes and CEL selectors. We define three templates to demonstrate different NUMA placement strategies.

anson627 added 5 commits April 1, 2026 23:04

Add blog on dranet

f7c40bf

add system diagram for control plane and data plane

c3326c2

update blog to remove placement group and reference system diagrams

cb06cdf

consolidate getting started and next steps

6e300bd

update intro section

2ed6e8a

anson627 requested review from a team, AllenWen-at-Azure and Copilot April 2, 2026 19:37

Copilot started reviewing on behalf of anson627 April 2, 2026 19:38 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

anson627 added 3 commits April 2, 2026 20:43

add benchmark results chart

1979650

remove benchmark results table

5731b9a

polish intro

1d2975d

Copilot AI review requested due to automatic review settings April 2, 2026 20:50

Copilot started reviewing on behalf of anson627 April 2, 2026 20:51 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

add authors

183b9a0

Copilot AI review requested due to automatic review settings April 3, 2026 17:24

Copilot started reviewing on behalf of anson627 April 3, 2026 17:25 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/index.md Show resolved Hide resolved

fix lint

ff109d6

anson627 force-pushed the add-dranet-blog branch from 9842ea5 to ff109d6 Compare April 3, 2026 17:34

Apply suggestions from code review

96218ab

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 3, 2026 17:37

Copilot started reviewing on behalf of anson627 April 3, 2026 17:38 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

anson627 added 2 commits April 3, 2026 17:51

update benchmark

14ab76f

update problem section

08f6cf4

Copilot AI review requested due to automatic review settings April 8, 2026 02:43

Copilot started reviewing on behalf of anson627 April 8, 2026 02:44 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

polish problem section

2c16d95


		Large-scale AI training and inferencing on Kubernetes depends on high-throughput, low-latency GPU-to-GPU communication. [DRANET](https://github.com/kubernetes-sigs/dranet) is an open-source DRA network driver that discovers RDMA capable devices, exposes their topology as Kubernetes DRA attributes, and injects only desired devices into each container. Combined with the [NVIDIA GPU DRA driver](https://github.com/kubernetes-purgatory/nvidia-dra-driver-gpu), it enables topology-aware co-scheduling of GPUs and NICs to deliver high-performance networking for demanding applications in Kubernetes.

		In previous post, we covered [fundamental DRA concepts](/2025/11/17/dra-devices-and-drivers-on-kubernetes). In this post, we walk through how DRANET works on [AKS 1.34](https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/) with [ND GB300-v6](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nd-gb300-v6-series?tabs=sizebasic) nodes, demonstrate three NUMA (Non-uniform memory access) alignment scenarios, and show the benchmark results.


		## ResourceClaimTemplates for topology-aware allocation

		With both drivers publishing ResourceSlices, we can write ResourceClaimTemplates that use CEL selectors to express precise GPU-NIC co-location constraints. Each template creates a per-pod ResourceClaim that requests devices from both the `gpu.nvidia.com` and `dranet.net` DeviceClasses, filtered by attributes like NUMA node or PCI address. We define three templates to demonstrate different NUMA placement strategies.

Conversation

anson627 commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants