Summary
The oras binary is missing from the AzureLinux V3 VHD image 202601.27.0, causing the Custom Script Extension (CSE) bootstrap to fail with exit code 211 (ERR_ORAS_PULL_NETWORK_TIMEOUT) on Karpenter-provisioned AKS nodes. Affected nodes never join the cluster.
The previous production image 202601.13.0 works correctly.
Environment
| Field |
Value |
| OS |
AzureLinux V3 |
| Failing image version |
202601.27.0 |
| Last known-good image |
202601.13.0 |
| Node provisioner |
Karpenter (karpenter-provider-azure) |
| Failure symptom |
Node never joins cluster; CSE exits 211 |
Root Cause
oras is called early in the CSE bootstrap flow via oras_login_with_kubelet_identity and related functions in cse_helpers.sh. If the oras binary is absent from the image (not installed during VHD build, or accidentally omitted), any call to oras silently fails with a confusing exit code — in this case 211 (ERR_ORAS_PULL_NETWORK_TIMEOUT), which is misleading because the real cause is a missing binary, not a network timeout.
The true failure mode — oras not found in $PATH — is never logged, making it very hard to diagnose from CSE logs alone.
Impact
- Blast radius: All AzureLinux V3 nodes provisioned with image 202601.27.0 via Karpenter fail to join the cluster.
- User impact: NodeClaims remain in a pending/disrupted state indefinitely; workloads cannot schedule.
- Diagnosability: Exit code 211 (
ERR_ORAS_PULL_NETWORK_TIMEOUT) points to a network issue, causing engineers to investigate firewalls and IMDS reachability before discovering the binary is simply missing.
Steps to Reproduce
- Provision an AKS cluster with Karpenter enabled.
- Create a
NodePool that targets AzureLinux V3 nodes.
- Schedule a workload that triggers node provisioning using image 202601.27.0.
- Observe that the node never reaches
Ready state and CSE exits with code 211.
Expected vs Actual Behavior
|
Expected |
Actual |
| oras present |
CSE completes, node joins cluster |
— |
| oras missing |
CSE exits with ERR_ORAS_BINARY_NOT_FOUND and logs clear diagnostic info |
CSE exits with ERR_ORAS_PULL_NETWORK_TIMEOUT (211), no indication binary is missing |
Fix
This PR adds a pre-flight check at the top of oras_login_with_kubelet_identity (in cse_helpers.sh) that:
- Calls
command -v oras to verify the binary is in $PATH.
- If missing: logs
$PATH, probes known install locations (/usr/local/bin/oras, /usr/bin/oras, /opt/bin/oras), dumps /etc/os-release, and queries rpm or dpkg for any installed oras packages.
- Returns a new, unambiguous error code
ERR_ORAS_BINARY_NOT_FOUND=232 so operators immediately understand what happened.
A separate investigation is needed to determine why oras was not included in the 202601.27.0 VHD build. That is a VHD pipeline issue tracked separately.
Related
- Fix PR: (this PR)
- karpenter-provider-azure tracking issue: (filed separately — surface CSE exit code in NodeClaim conditions)
Additional Context
Exit code 211 is defined as ERR_ORAS_PULL_NETWORK_TIMEOUT — it is reached in retrycmd_get_refresh_token_for_oras when the ACR token exchange fails. But without oras in $PATH, the function that calls it (retrycmd_can_oras_ls_acr_anonymously) fails immediately, and error propagation bubbles up as a generic network timeout rather than a missing-binary error.
Summary
The
orasbinary is missing from the AzureLinux V3 VHD image 202601.27.0, causing the Custom Script Extension (CSE) bootstrap to fail with exit code 211 (ERR_ORAS_PULL_NETWORK_TIMEOUT) on Karpenter-provisioned AKS nodes. Affected nodes never join the cluster.The previous production image 202601.13.0 works correctly.
Environment
Root Cause
orasis called early in the CSE bootstrap flow viaoras_login_with_kubelet_identityand related functions incse_helpers.sh. If theorasbinary is absent from the image (not installed during VHD build, or accidentally omitted), any call toorassilently fails with a confusing exit code — in this case 211 (ERR_ORAS_PULL_NETWORK_TIMEOUT), which is misleading because the real cause is a missing binary, not a network timeout.The true failure mode —
orasnot found in$PATH— is never logged, making it very hard to diagnose from CSE logs alone.Impact
ERR_ORAS_PULL_NETWORK_TIMEOUT) points to a network issue, causing engineers to investigate firewalls and IMDS reachability before discovering the binary is simply missing.Steps to Reproduce
NodePoolthat targets AzureLinux V3 nodes.Readystate and CSE exits with code 211.Expected vs Actual Behavior
ERR_ORAS_BINARY_NOT_FOUNDand logs clear diagnostic infoERR_ORAS_PULL_NETWORK_TIMEOUT(211), no indication binary is missingFix
This PR adds a pre-flight check at the top of
oras_login_with_kubelet_identity(incse_helpers.sh) that:command -v orasto verify the binary is in$PATH.$PATH, probes known install locations (/usr/local/bin/oras,/usr/bin/oras,/opt/bin/oras), dumps/etc/os-release, and queriesrpmordpkgfor any installed oras packages.ERR_ORAS_BINARY_NOT_FOUND=232so operators immediately understand what happened.A separate investigation is needed to determine why
oraswas not included in the 202601.27.0 VHD build. That is a VHD pipeline issue tracked separately.Related
Additional Context
Exit code 211 is defined as
ERR_ORAS_PULL_NETWORK_TIMEOUT— it is reached inretrycmd_get_refresh_token_for_oraswhen the ACR token exchange fails. But withoutorasin$PATH, the function that calls it (retrycmd_can_oras_ls_acr_anonymously) fails immediately, and error propagation bubbles up as a generic network timeout rather than a missing-binary error.