Skip to content

Latest commit

 

History

History
487 lines (368 loc) · 20.6 KB

File metadata and controls

487 lines (368 loc) · 20.6 KB

Karpenter Provider Flex

Overview

This guide walks through deploying karpenter to an AKS Flex cluster and using Karpenter to automatically provision and deprovision cloud nodes. By the end you will have:

  • The karpenter controller running in the cluster
  • NodeClass and NodePool resources configured for Azure and/or Nebius compute instances
  • Workloads that trigger automatic node scale-up
  • An understanding of how to scale down and clean up provisioned nodes

Karpenter watches for unschedulable pods and automatically provisions new nodes to meet demand. The karpenter extends Karpenter with multiple cloud providers:

  • Azure (AKSNodeClass) — provisions Azure VMs directly into the cluster's node resource group, joining the existing AKS cluster.
  • Nebius (NebiusNodeClass) — provisions Nebius VMs that join the AKS cluster as worker nodes over WireGuard or Unbounded CNI.

Getting Started

Prerequisites

  • AKS Flex CLI -- installed and configured with a .env file. See CLI Setup.
  • AKS cluster -- an AKS cluster provisioned via the CLI. For Nebius nodes, the cluster must also have WireGuard or Unbounded CNI enabled for cross-cloud connectivity. See AKS Cluster Setup.
  • Nebius service account credentials (Nebius only) -- a Nebius credentials JSON file for the karpenter controller. See the Nebius authorized keys documentation.
  • Helm -- required for installing the karpenter chart.

Configuration

Ensure your .env file contains the standard Azure settings:

export LOCATION=southcentralus
export AZURE_SUBSCRIPTION_ID=<your-subscription-id>
export RESOURCE_GROUP_NAME=rg-aks-flex-<username>
export CLUSTER_NAME=aks

The CLI resolves all Helm chart values from these environment variables and the live AKS cluster. No additional Karpenter-specific environment variables are required.

Installing the Provider via Helm

1. Create the karpenter namespace

$ kubectl create namespace karpenter

2. Locate your Nebius credentials file

The karpenter controller needs Nebius API credentials to provision VMs. The credentials file is a JSON file generated by the Nebius console (see the Nebius authorized keys documentation).

Note the local path to this file — you will pass it to the CLI in step 4 via --nebius-credentials-file. The chart will create the nebius-credentials Secret in the karpenter namespace automatically during helm upgrade --install; no separate kubectl create secret step is needed.

3. Azure permissions for the karpenter identity

The AKS ARM template (aks-flex-cli aks deploy) automatically provisions a user-assigned managed identity named karpenter-flex and assigns the following roles:

Role Scope Purpose
Network Contributor Resource group VNET GUID resolution at startup, subnet join when creating NICs
Virtual Machine Contributor Node resource group VM lifecycle — create and delete Azure VMs
Network Contributor Node resource group NIC lifecycle — create and delete NICs for provisioned VMs
Managed Identity Operator Node resource group Assign managed identities to provisioned VMs

The template also creates a federated identity credential that pairs the managed identity with the AKS cluster's OIDC issuer, granting access to the karpenter service account in the karpenter namespace. This enables workload identity — no manual role assignment steps are required.

4. Generate the Helm values file and install

Use the CLI to generate a karpenter_values.yaml file with all required values pre-populated. Pass --nebius-credentials-file to have the chart create the nebius-credentials Secret automatically, and --ssh-public-key-file to embed the SSH public key used when bootstrapping provisioned nodes:

$ aks-flex-cli config karpenter helm \
    --nebius-credentials-file ~/.nebius/credentials.json \
    --ssh-public-key-file ~/.ssh/id_ed25519.pub

The command reads both files, embeds their contents into karpenter_values.yaml, and prints the install command to stdout:

helm upgrade --install karpenter charts/karpenter \
  --namespace karpenter --create-namespace \
  --values karpenter_values.yaml

The generated karpenter_values.yaml looks like:

# Karpenter Helm values — generated by: aks-flex config karpenter helm
settings:
  clusterName: "aks"
  clusterEndpoint: "https://aks-xxxx.hcp.eastus2.azmk8s.io:443"

logLevel: debug
replicas: 1

serviceAccount:
  annotations:
    azure.workload.identity/client-id: "<karpenter-flex-client-id>"

podLabels:
  azure.workload.identity/use: "true"

controller:
  nebiusCredentials:
    enabled: true
  image:
    digest: ""
  env:
    - name: ARM_CLOUD
      value: "AzurePublicCloud"
    - name: LOCATION
      value: "southcentralus"
    - name: ARM_RESOURCE_GROUP
      value: "rg-aks-flex-<username>"
    - name: AZURE_TENANT_ID
      value: "<tenant-id>"
    - name: AZURE_CLIENT_ID
      value: "<karpenter-flex-client-id>"
    - name: AZURE_SUBSCRIPTION_ID
      value: "<subscription-id>"
    - name: AZURE_NODE_RESOURCE_GROUP
      value: "<node-resource-group>"
    - name: SSH_PUBLIC_KEY
      value: "ssh-key-not-set"
    - name: VNET_SUBNET_ID
      value: "/subscriptions/.../subnets/aks"
    - name: KUBELET_BOOTSTRAP_TOKEN
      value: "<token-id>.<token-secret>"
    - name: DISABLE_LEADER_ELECTION
      value: "false"

If any value cannot be resolved (e.g. the cluster is not reachable), it is replaced with <replace-with-actual-value>. Edit the file before running the install command.

Run the install from the karpenter/ directory:

$ helm upgrade --install karpenter charts/karpenter \
  --namespace karpenter --create-namespace \
  --values karpenter_values.yaml

Specifying a custom output path

By default the values file is written to karpenter_values.yaml in the current directory. Use --output to write it elsewhere:

$ aks-flex-cli config karpenter helm \
    --nebius-credentials-file ~/.nebius/credentials.json \
    --ssh-public-key-file ~/.ssh/id_ed25519.pub \
    --output /path/to/my-values.yaml

Overriding the controller image

To use a custom controller image instead of the chart default, pass the --image flag:

$ aks-flex-cli config karpenter helm \
    --nebius-credentials-file ~/.nebius/credentials.json \
    --ssh-public-key-file ~/.ssh/id_ed25519.pub \
    --image myregistry.io/karpenter:v0.2.0

This adds controller.image.repository and controller.image.tag entries to the generated values file.

5. Verify the controller is running

$ kubectl -n karpenter get pods
NAME                         READY   STATUS    RESTARTS      AGE
karpenter-6b55df659d-m2d5g   1/1     Running   7 (13m ago)   20m

Creating Nodes on Azure via Karpenter

With the karpenter controller running, you can define an AKSNodeClass and NodePool to provision Azure VMs directly into the cluster's node resource group.

Creating an AKSNodeClass

The AKSNodeClass defines the Azure-specific configuration for provisioned nodes:

$ kubectl apply -f examples/azure/nodeclass.yaml

Verify the node class is ready:

$ kubectl get aksnodeclass
NAME    READY   AGE
azure   True    5s

Creating a CPU NodePool

$ kubectl apply -f examples/azure/cpu_nodepool.yaml

Verify the node pool is ready:

$ kubectl get nodepool
NAME                 NODECLASS   NODES   READY   AGE
azure-cpu-nodepool   azure       0       True    4s

Creating a GPU NodePool

For GPU workloads, create a NodePool that pins to a specific GPU SKU via node.kubernetes.io/instance-type:

$ kubectl apply -f examples/azure/gpu_nodepool.yaml

Both node pools should now be ready:

$ kubectl get nodepool
NAME                 NODECLASS   NODES   READY   AGE
azure-cpu-nodepool   azure       0       True    4s
azure-gpu-nodepool   azure       0       True    2s

Deploy a workload to trigger scale-up

$ kubectl apply -f examples/azure/cpu_deployment.yaml

Karpenter detects the unschedulable pod and creates a NodeClaim:

$ kubectl get nodeclaims
NAME                       TYPE           CAPACITY   ZONE   NODE                         READY   AGE
azure-cpu-nodepool-6rhlk                                    aks-azure-cpu-nodepool-6rhlk True    2m

Note: GPU workloads require an NVIDIA plugin to advertise GPU resources. Install one with the CLI before creating GPU workloads:

# NVIDIA GPU Device Plugin (standard resource-based allocation)
aks-flex-cli aks deploy --nvidia-device-plugin --skip-arm

# NVIDIA DRA Driver (Dynamic Resource Allocation)
aks-flex-cli aks deploy --nvidia-dra-driver --skip-arm

Creating Nodes on Nebius via Karpenter

With the karpenter controller running, you can define a NebiusNodeClass and NodePool to tell Karpenter how and when to provision Nebius nodes.

Creating a NebiusNodeClass

The NebiusNodeClass defines the Nebius-specific configuration for provisioned nodes:

$ kubectl apply -f examples/nebius/nodeclass.yaml

Verify the node class is ready:

$ kubectl get nebiusnodeclass
NAME     READY   AGE
nebius   True    3s

Note: The wireguardPeerCIDR field in the NebiusNodeClass is only required when using WireGuard for cross-cloud connectivity. When using Unbounded CNI, this field should not be set.

Creating a NodePool

The NodePool defines scheduling constraints and references the NebiusNodeClass:

$ kubectl apply -f examples/nebius/cpu_nodepool.yaml

Verify the node pool is ready:

$ kubectl get nodepool
NAME                  NODECLASS   NODES   READY   AGE
nebius-cpu-nodepool   nebius      0       True    4s

Creating a GPU NodePool

For GPU workloads, create a separate NodePool that does not restrict by CPU SKU. The karpenter.azure.com/sku-cpu label is not present on GPU instance types, so the CPU NodePool's Gt requirement would prevent GPU instances from ever being selected. The GPU NodePool omits that constraint and relies on the workload's node affinity (via node.kubernetes.io/instance-type) to select the appropriate GPU instance:

$ kubectl apply -f examples/nebius/gpu_nodepool.yaml

Both node pools should now be ready:

$ kubectl get nodepool
NAME                  NODECLASS   NODES   READY   AGE
nebius-cpu-nodepool   nebius      0       True    4s
nebius-gpu-nodepool   nebius      0       True    2s

Deploy a workload to trigger scale-up

Create a deployment that schedules pods away from system nodes. Karpenter will detect the unschedulable pods and provision a new Nebius node:

$ kubectl apply -f examples/nebius/cpu_deployment.yaml

The pod will initially be in Pending state while Karpenter provisions a new node:

$ kubectl get pods
NAME                             READY   STATUS    RESTARTS   AGE
sample-cpu-app-6c7bb4ccb-wbl5h   1/1     Running   0          9m51s

Karpenter creates a NodeClaim to request a new node from Nebius:

$ kubectl get nodeclaims
NAME                        TYPE                 CAPACITY    ZONE   NODE                                 READY   AGE
nebius-cpu-nodepool-6g8v8   cpu-d3-16vcpu-64gb   on-demand   1      computeinstance-e00a4p0rrnms9n24jp   True    9m35s

After a few minutes, the new Nebius node should appear:

$ kubectl get nodes -o wide
NAME                                 STATUS   ROLES    AGE     VERSION   INTERNAL-IP    EXTERNAL-IP     OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
aks-system-94214615-vmss000000       Ready    <none>   3h13m   v1.34.2   172.16.1.4     <none>          Ubuntu 22.04.5 LTS   5.15.0-1102-azure    containerd://1.7.30-1
aks-system-94214615-vmss000001       Ready    <none>   3h13m   v1.34.2   172.16.1.5     <none>          Ubuntu 22.04.5 LTS   5.15.0-1102-azure    containerd://1.7.30-1
aks-system-94214615-vmss000002       Ready    <none>   3h13m   v1.34.2   172.16.1.6     <none>          Ubuntu 22.04.5 LTS   5.15.0-1102-azure    containerd://1.7.30-1
aks-wireguard-23306360-vmss000000    Ready    <none>   3h9m    v1.34.2   172.16.2.4     20.91.194.208   Ubuntu 22.04.5 LTS   5.15.0-1102-azure    containerd://1.7.30-1
computeinstance-e00a4p0rrnms9n24jp   Ready    <none>   8m30s   v1.33.3   100.96.1.237   <none>          Ubuntu 24.04.4 LTS   6.11.0-1016-nvidia   containerd://2.0.4

The pod is now running on the Nebius node:

$ kubectl get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE   IP            NODE                                 NOMINATED NODE   READINESS GATES
sample-cpu-app-6c7bb4ccb-wbl5h   1/1     Running   0          10m   10.0.10.159   computeinstance-e00a4p0rrnms9n24jp   <none>           <none>
$ kubectl logs -f sample-cpu-app-6c7bb4ccb-wbl5h
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2026/02/27 20:37:12 [notice] 1#1: using the "epoll" event method
2026/02/27 20:37:12 [notice] 1#1: nginx/1.21.6
2026/02/27 20:37:12 [notice] 1#1: built by gcc 10.2.1 20210110 (Debian 10.2.1-6) 
2026/02/27 20:37:12 [notice] 1#1: OS: Linux 6.11.0-1016-nvidia
2026/02/27 20:37:12 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576

Deploy a GPU workload

For GPU workloads, create a deployment that requests GPU resources and targets GPU instance types:

$ kubectl apply -f examples/nebius/gpu_deployment.yaml

The GPU pod will be pending while Karpenter provisions a GPU node:

$ kubectl get pods
NAME                              READY   STATUS    RESTARTS   AGE
sample-cpu-app-6c7bb4ccb-wbl5h    1/1     Running   0          11m
sample-gpu-app-5d8b85c989-5l9zt   0/1     Pending   0          5s

Karpenter creates a new NodeClaim for the GPU instance:

$ kubectl get nodeclaims
NAME                        TYPE                               CAPACITY    ZONE   NODE                                 READY     AGE
nebius-cpu-nodepool-6g8v8   cpu-d3-16vcpu-64gb                 on-demand   1      computeinstance-e00a4p0rrnms9n24jp   True      11m
nebius-gpu-nodepool-r2qwq   gpu-h100-sxm-8gpu-128vcpu-1600gb   on-demand                                              Unknown   16s

Note: GPU workloads require an NVIDIA plugin to be installed so that GPU resources are advertised to the scheduler. Install one with the CLI before creating GPU workloads:

# NVIDIA GPU Device Plugin (standard resource-based allocation)
aks-flex-cli aks deploy --nvidia-device-plugin --skip-arm

# NVIDIA DRA Driver (Dynamic Resource Allocation)
aks-flex-cli aks deploy --nvidia-dra-driver --skip-arm

After the GPU node is provisioned, both nodes and pods should be running:

$ kubectl get nodes
NAME                                 STATUS   ROLES    AGE     VERSION
aks-system-94214615-vmss000000       Ready    <none>   3h19m   v1.34.2
aks-system-94214615-vmss000001       Ready    <none>   3h19m   v1.34.2
aks-system-94214615-vmss000002       Ready    <none>   3h18m   v1.34.2
aks-wireguard-23306360-vmss000000    Ready    <none>   3h14m   v1.34.2
computeinstance-e00a4p0rrnms9n24jp   Ready    <none>   13m     v1.33.3
computeinstance-e00zjdx1e50bxcfekk   Ready    <none>   107s    v1.33.3
$ kubectl get pods -o wide
NAME                              READY   STATUS    RESTARTS   AGE     IP            NODE                                 NOMINATED NODE   READINESS GATES
sample-cpu-app-6c7bb4ccb-7dg9t    1/1     Running   0          75s     10.0.12.199   computeinstance-e00zjdx1e50bxcfekk   <none>           <none>
sample-gpu-app-5d8b85c989-5l9zt   1/1     Running   0          4m17s   10.0.12.66    computeinstance-e00zjdx1e50bxcfekk   <none>           <none>
$ kubectl logs -f sample-gpu-app-76b4884cbd-m8bft
Sun Feb 22 22:05:49 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:8D:00.0 Off |                    0 |
| N/A   28C    P0             68W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Scale down nodes

When demand decreases, Karpenter automatically deprovisions nodes that are no longer needed. To test this, scale the deployments down:

$ kubectl scale deployment sample-cpu-app --replicas=0
$ kubectl scale deployment sample-gpu-app --replicas=0

You can observe the disruption lifecycle by describing a node claim:

$ kubectl describe nodeclaims nebius-nodepool-dfpmw
Events:
  Type    Reason                 Age                   From       Message
  ----    ------                 ----                  ----       -------
  Normal  Launched               9m15s                 karpenter  Status condition transitioned, Type: Launched, Status: Unknown -> True, Reason: Launched
  Normal  DisruptionBlocked      7m5s (x2 over 9m15s)  karpenter  Nodeclaim does not have an associated node
  Normal  Registered             5m30s                 karpenter  Status condition transitioned, Type: Registered, Status: Unknown -> True, Reason: Registered
  Normal  DisruptionBlocked      4m41s                 karpenter  Node isn't initialized
  Normal  Initialized            2m56s                 karpenter  Status condition transitioned, Type: Initialized, Status: Unknown -> True, Reason: Initialized
  Normal  Ready                  2m56s                 karpenter  Status condition transitioned, Type: Ready, Status: Unknown -> True, Reason: Ready
  Normal  DisruptionBlocked      2m35s                 karpenter  Node is nominated for a pending pod
  Normal  Unconsolidatable       111s                  karpenter  Not all pods would schedule, default/sample-gpu-app-76b4884cbd-m8bft => would schedule against uninitialized nodeclaim/nebius-nodepool-7g2rq default/sample-app-66986dd6c6-qs6gt => would schedule against uninitialized nodeclaim/nebius-nodepool-7g2rq
  Normal  DisruptionTerminating  19s                   karpenter  Disrupting NodeClaim: Underutilized
  Normal  DisruptionBlocked      19s                   karpenter  Node is deleting or marked for deletion

After the disruption grace period, Karpenter will terminate the idle Nebius nodes and they will be removed from the cluster:

$ kubectl get nodes
NAME                                 STATUS   ROLES    AGE    VERSION
aks-system-32742974-vmss000000       Ready    <none>   18h    v1.33.6
aks-system-32742974-vmss000001       Ready    <none>   18h    v1.33.6
aks-wireguard-12237243-vmss000000    Ready    <none>   18h    v1.33.6