You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: designs/karpenter-dra-kwok-driver.md
+81-77Lines changed: 81 additions & 77 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,20 +3,20 @@
3
3
## Summary
4
4
The upstream kubernetes/perf-tests repository includes a [DRA KWOK Driver](https://github.com/kubernetes/perf-tests/pull/3491/files), but it's designed for **ClusterLoader2 scale testing** with pre-created static nodes that cannot be used for Karpenter testing.
5
5
6
-
This design introduces a **Karpenter DRA KWOK Driver** - a mock DRA driver that acts on behalf of KWOK nodes created by Karpenter. When KWOK nodes register with the cluster, the driver creates ResourceSlices advertising fake GPU/device resources. This simulates what a real DRA driver (like NVIDIA GPU Operator) would do, but with fake devices for testing purposes. The driver watches for KWOK nodes and creates corresponding ResourceSlices based on either Node Overlay or ConfigMap configuration. The driver acts independently as a standard Kubernetes controller, ensuring ResourceSlices exist on the API server for both the scheduler and Karpenter's cluster state to discover.
6
+
This design introduces a **Karpenter DRA KWOK Driver** - a mock DRA driver that acts on behalf of KWOK nodes created by Karpenter. When KWOK nodes register with the cluster, the driver creates ResourceSlices advertising fake GPU/device resources. This simulates what a real DRA driver (like NVIDIA GPU Operator) would do, but with fake devices for testing purposes. The driver uses a polling approach (30-second interval) to periodically reconcile all KWOK nodes and creates corresponding ResourceSlices based on either Node Overlay or DRAConfig CRD. The driver acts independently as a standard Kubernetes controller, ensuring ResourceSlices exist on the API server for both the scheduler and Karpenter's cluster state to discover.
2.**Test creates DRA pod**referencing the ResourceClaim
9
+
1.**Test creates DRAConfig CRD**defining device pools and node selectors
10
+
2.**Test creates DRA pod**with a ResourceClaim referencing device attributes
11
11
3.**Karpenter provisions KWOK node** in response to unschedulable pod
12
-
4.**Node registration triggers ResourceSlice creation** based on:
12
+
4.**Driver polling loop detects new node** (within 30 seconds) and creates ResourceSlices based on:
13
13
-**Case 1:** Check for matching NodeOverlay with embedded ResourceSlice objects (future enhancement)
14
-
-**Case 2:** Use ConfigMap mappings if no NodeOverlay matches
14
+
-**Case 2:** Use DRAConfig CRD pools if no NodeOverlay matches
15
15
-**Case 3:** Eventually cloudproviders will be able to provide potential ResourceSlice shapes through the InstanceType interface (Future TODO: implement a way for cloudproviders to inform our DRAKWOKDriver of those shapes).
16
16
5.**Kubernetes scheduler discovers ResourceSlices** and binds pod to node
17
17
6.**Pod successfully schedules** to the node with available DRA resources
18
18
7.**Test validates** node creation, ResourceSlice creation, pod scheduling, and Karpenter behavior
19
-
8.**Cleanup automatically removes** ResourceSlices when nodes are deleted
19
+
8.**Cleanup automatically removes** ResourceSlices in next polling cycle when nodes are deleted
20
20
21
21
## Implementation
22
22
@@ -25,7 +25,7 @@ Tests **Karpenter's integrated DRA scheduling** where DRA device counts are know
25
25
26
26
**Example Node Overlay with DRA** (future API extension):
27
27
```yaml
28
-
apiVersion: karpenter.sh/v1alpha1
28
+
apiVersion: test.karpenter.sh/v1alpha1
29
29
kind: NodeOverlay
30
30
metadata:
31
31
name: gpu-dra-config
@@ -36,107 +36,111 @@ spec:
36
36
operator: In
37
37
values: ["g5.48xlarge"]
38
38
capacity:
39
-
karpenter.sh.dra-kwok-driver/device: "8"# Custom extended resource for DRA devices
39
+
test.karpenter.sh/device: "8"# Custom extended resource for DRA devices
40
40
# TODO: Extend NodeOverlay API to embed ResourceSlice templates
# nodeName will be filled in by driver when node is created
46
-
driver: "karpenter.sh.dra-kwok-driver"
46
+
driver: "test.karpenter.sh"
47
47
devices:
48
48
- name: "nvidia-h100-0"
49
-
driver: "karpenter.sh.dra-kwok-driver"
49
+
driver: "test.karpenter.sh"
50
50
attributes:
51
51
memory: "80Gi"
52
52
compute-capability: "9.0"
53
53
vendor: "nvidia"
54
54
- name: "nvidia-h100-1"
55
-
driver: "karpenter.sh.dra-kwok-driver"
55
+
driver: "test.karpenter.sh"
56
56
attributes:
57
57
memory: "80Gi"
58
58
compute-capability: "9.0"
59
59
vendor: "nvidia"
60
-
# ... (6 more devices for total of 8)
61
60
```
62
61
63
62
**How it works**:
64
63
1. **Test author defines NodeOverlay configuration**: "g5.48xlarge KWOK nodes should have 8x fake H100 GPUs" via ResourceSlices
65
-
2.**Driver watches for KWOK nodes**: When Karpenter creates a KWOK node with `instance-type: g5.48xlarge`
66
-
3.**NodeOverlay match found**: Driver checks for NodeOverlay with embedded ResourceSlice objects, finds matching configuration
67
-
4.**Driver creates ResourceSlice**: Acts as fake DRA driver using embedded ResourceSlice objects from NodeOverlay
68
-
5.**Scheduler sees configured devices**: ResourceSlices with fake devices become available for DRA pod scheduling
69
-
6.**Test validation**: Validates that the driver correctly provides DRA resources and enables successful pod scheduling
64
+
2. **Karpenter creates KWOK node**: Node with `instance-type: g5.48xlarge` is created
65
+
3. **Driver polling detects new node**: Within 30 seconds, driver reconciliation loop discovers the node
66
+
4. **NodeOverlay match found**: Driver checks for NodeOverlay with embedded ResourceSlice objects, finds matching configuration
67
+
5. **Driver creates ResourceSlice**: Acts as fake DRA driver using embedded ResourceSlice objects from NodeOverlay
68
+
6. **Scheduler sees configured devices**: ResourceSlices with fake devices become available for DRA pod scheduling
69
+
7. **Test validation**: Validates that the driver correctly provides DRA resources and enables successful pod scheduling
70
70
71
-
### Case 2: ConfigMap Fallback Configuration
72
-
Tests **DRA resource provisioning when no NodeOverlay configuration is found** - simulating scenarios where ResourceSlices exist on nodes but weren't defined through NodeOverlay configuration. This addresses when other out of band components manage nodes, partial NodeOverlay coverage (only some instance types configured), and 3rd party DRA driver integration (GPU operators working independently). The driver falls back to ConfigMap-based device configuration when no matching NodeOverlay is found, creating ResourceSlices that Karpenter must then discover and incorporate into future scheduling decisions. This ensures we correctly test that Karpenter successfully discovers ResourceSlices and schedules against them, even if they weren't defined on any NodeOverlays.
71
+
### Case 2: CRD-Based Fallback Configuration
72
+
Tests **DRA resource provisioning via strongly-typed CRD when no NodeOverlay configuration is found** - simulating scenarios where ResourceSlices exist on nodes but weren't defined through NodeOverlay configuration. This addresses when other out of band components manage nodes, partial NodeOverlay coverage (only some instance types configured), and 3rd party DRA driver integration (GPU operators working independently). The driver falls back to DRAConfig CRD-based device configuration when no matching NodeOverlay is found, creating ResourceSlices that Karpenter must then discover and incorporate into future scheduling decisions. This ensures we correctly test that Karpenter successfully discovers ResourceSlices and schedules against them, even if they weren't defined on any NodeOverlays.
73
73
74
74
```yaml
75
-
apiVersion: v1
76
-
kind: ConfigMap
75
+
apiVersion: test.karpenter.sh/v1alpha1
76
+
kind: DRAConfig
77
77
metadata:
78
-
name: dra-kwok-configmap
79
-
namespace: karpenter
80
-
data:
81
-
config.yaml: |
82
-
driver: "karpenter.sh.dra-kwok-driver"
83
-
mappings:
84
-
- name: "h100-nodes"
85
-
nodeSelector:
86
-
matchLabels:
87
-
node.kubernetes.io/instance-type: "g5.48xlarge"
88
-
kwok.x-k8s.io/node: "fake"
89
-
resourceSlice:
90
-
devices:
91
-
- name: "nvidia-h100"
92
-
count: 8
93
-
attributes:
94
-
memory: "80Gi"
95
-
compute-capability: "9.0"
96
-
device_class: "gpu"
97
-
vendor: "nvidia"
98
-
- name: "fpga-nodes"
99
-
nodeSelector:
100
-
matchLabels:
101
-
node.kubernetes.io/instance-type: "f1.2xlarge"
102
-
kwok.x-k8s.io/node: "fake"
103
-
resourceSlice:
104
-
devices:
105
-
- name: "xilinx-u250"
106
-
count: 1
107
-
attributes:
108
-
memory: "16Gi"
109
-
device_class: "fpga"
110
-
vendor: "xilinx"
78
+
name: gpu-config # User-chosen name
79
+
spec:
80
+
driver: "test.karpenter.sh" # Simulated driver name
81
+
pools:
82
+
- name: "h100-pool"
83
+
nodeSelectorTerms:
84
+
- matchExpressions:
85
+
- key: node.kubernetes.io/instance-type
86
+
operator: In
87
+
values: ["g5.48xlarge"]
88
+
resourceSlices:
89
+
- devices:
90
+
- name: "nvidia-h100-0"
91
+
attributes:
92
+
memory: {stringValue: "80Gi"}
93
+
compute-capability: {stringValue: "9.0"}
94
+
device_class: {stringValue: "gpu"}
95
+
vendor: {stringValue: "nvidia"}
96
+
- name: "fpga-pool"
97
+
nodeSelectorTerms:
98
+
- matchExpressions:
99
+
- key: node.kubernetes.io/instance-type
100
+
operator: In
101
+
values: ["f1.2xlarge"]
102
+
resourceSlices:
103
+
- devices:
104
+
- name: "xilinx-u250-0"
105
+
attributes:
106
+
memory: {stringValue: "16Gi"}
107
+
device_class: {stringValue: "fpga"}
108
+
vendor: {stringValue: "xilinx"}
109
+
111
110
```
112
111
113
112
**How it works**:
114
-
1. **Test author defines ConfigMap configuration**: "g5.48xlarge KWOK nodes should have 8x fake H100 GPUs when no NodeOverlay is found"
115
-
2. **Driver watches for KWOK nodes**: When Karpenter creates a KWOK node with `instance-type: g5.48xlarge`
116
-
3. **No NodeOverlay match found**: Driver checks for NodeOverlay with embedded ResourceSlice objects, finds none, falls back to ConfigMap
117
-
4. **Driver creates ResourceSlice**: Acts as fake DRA driver using ConfigMap configuration
118
-
5. **Scheduler sees configured devices**: ResourceSlices with fake devices become available for DRA pod scheduling
119
-
6. **Test validation**: Validates that the driver correctly provides DRA resources and enables successful pod scheduling
113
+
1. **Test author defines DRAConfig CRD**: "g5.48xlarge KWOK nodes should have fake H100 GPUs when no NodeOverlay is found"
114
+
2. **Karpenter creates KWOK node**: Node with `instance-type: g5.48xlarge` is created
115
+
3. **Driver polling detects new node**: Within 30 seconds, driver reconciliation loop discovers the node
116
+
4. **No NodeOverlay match found**: Driver checks for NodeOverlay with embedded ResourceSlice objects, finds none, falls back to DRAConfig CRD
117
+
5. **Driver reads DRAConfig**: Gets `test.karpenter.sh` DRAConfig (checked during each 30s polling cycle)
118
+
6. **Driver creates ResourceSlices**: For each KWOK node matching the pool's nodeSelectorTerms
119
+
7. **Scheduler sees configured devices**: ResourceSlices with fake devices become available for DRA pod scheduling
120
+
8. **Test validation**: Validates that the driver correctly provides DRA resources and enables successful pod scheduling
3. nodeoverlay.go tries to find matching NodeOverlay (Case 1)
140
-
4. If no match: configmap.go provides fallback config (Case 2)
141
-
5. resourceslice.go creates/updates/deletes the ResourceSlices
142
-
6. types.go provides the data structures throughout
141
+
142
+
**Architecture:**
143
+
1. `main.go` starts ResourceSlice controller with namespace
144
+
2. `resourceslice.go` polls nodes every 30 seconds, LISTs all DRAConfig CRDs, groups by driver, and creates ResourceSlices for each driver independently
145
+
3. `draconfig_types.go` defines CRD types with Pool structs (pool names auto-generated as `<driver>/<node>`)
0 commit comments