fix: enforce control plane upgrade before node pool upgrades (fixes #465)#735
Merged
lonegunmanb merged 2 commits intomainfrom Mar 18, 2026
Merged
Conversation
Fixes #465 When upgrading AKS cluster versions, the node pool orchestrator_version was being updated by the azurerm provider before the control plane upgrade (via azapi_update_resource) completed, causing NodePoolMcVersionIncompatible errors. Changes: - Add default_node_pool[0].orchestrator_version to ignore_changes in azurerm_kubernetes_cluster.main to prevent the azurerm provider from racing the control plane upgrade - Add orchestrator_version to ignore_changes in both node_pool_create_before_destroy and node_pool_create_after_destroy - Add azapi_update_resource.aks_cluster_default_nodepool_version to upgrade the default node pool version after control plane completes - Add azapi_update_resource.node_pool_version (for_each) to upgrade extra node pools after control plane completes - Add null_resource keepers to track version changes and trigger replacements Verified via sandbox experiment: 1.32 -> 1.33 upgrade completed successfully with correct ordering (control plane first, then node pools) and zero NodePoolMcVersionIncompatible errors. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…mpotency The orchestrator_version_keeper and node_pool_orchestrator_version_keeper null_resources were always created even when var.orchestrator_version was null (default). This caused upgrade tests to fail with 'terraform configuration not idempotent' because upgrading from the released version showed these new resources as 'Plan: 1 to add'. Fix: Add count/for_each conditions matching the corresponding azapi_update_resource resources - keepers are only created when orchestrator_version is actually set. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Fixes #465
When upgrading AKS cluster Kubernetes versions (e.g., from 1.27 to 1.28), users encounter
NodePoolMcVersionIncompatibleerror:AKS requires the control plane to be upgraded before node pools. The module's current code violates this ordering, causing the node pool upgrade to race ahead of the control plane upgrade.
Multiple users independently confirmed this issue (see #465 comments by @dunefro, @zioproto, @nnstt1, and others).
Root Cause
PR #336 refactored control plane upgrades to use
azapi_update_resource.aks_cluster_post_createand addedkubernetes_versiontoignore_changesonazurerm_kubernetes_cluster.main. However, it did not adddefault_node_pool[0].orchestrator_versiontoignore_changes.This creates a DAG race condition:
azurerm_kubernetes_cluster.mainsees a diff ondefault_node_pool[0].orchestrator_versionand attempts to update the default node pool version directly via the azurerm provideraks_cluster_post_createupgrades the control planeaks_cluster_post_create(which would have upgraded the control plane first) never gets a chance to run because it depends onazurerm_kubernetes_cluster.maincompleting successfullyExtra node pools have the same issue: although they declare
depends_on = [azapi_update_resource.aks_cluster_post_create], withoutignore_changes = [orchestrator_version], Terraform still computes a diff and sends a version update via the azurerm provider. Thedepends_ononly controls execution ordering — it does not prevent the azurerm provider from issuing its own update. As module maintainer @zioproto confirmed: the fix must cover bothazurerm_kubernetes_cluster(default node pool) andazurerm_kubernetes_cluster_node_pool(extra node pools).Key code references:
main.tf:602-608— existinglifecycle.ignore_changes(missingdefault_node_pool[0].orchestrator_version)main.tf:749-766—aks_cluster_post_create(control plane upgrade via azapi, introduced by PR [Breaking] - Ignore changes onkubernetes_versionfrom outside of Terraform #336)extra_node_pool.tf:148-152—node_pool_create_before_destroylifecycle (missingorchestrator_version)extra_node_pool.tf:320—node_pool_create_after_destroylifecycle (noignore_changesat all)Solution
Three-tier approach following the established pattern from PR #336:
1. Blind the azurerm provider (
ignore_changes)Prevent the azurerm provider from computing diffs on
orchestrator_versionfor all node pool resources:default_node_pool[0].orchestrator_versiontoignore_changesinazurerm_kubernetes_cluster.mainorchestrator_versionto existingignore_changesignore_changesblock withorchestrator_version2. Build external schedulers (
azapi_update_resource)Delegate node pool version upgrades to
azapi_update_resourceresources that we can explicitly order:aks_cluster_default_nodepool_version— Upgrades the default node pool via Azure REST API (PUT .../agentPools/{pool_name})resource_idfollows the same pattern as existingaks_cluster_agents_pool_local_dns_config(main.tf:861):"${azurerm_kubernetes_cluster.main.id}/agentPools/${var.agents_pool_name}"local.aks_api_version(2025-09-01), consistent with all other azapi resources in the modulevar.orchestrator_version != nullnode_pool_version— Upgrades extra node pools viafor_eachresource_idfollows the sametry()pattern as existingaks_cluster_local_dns_config(main.tf:807-810):try(node_pool_create_before_destroy[each.key].id, node_pool_create_after_destroy[each.key].id)3. Enforce ordering (
depends_on)Both new
azapi_update_resourceresources declaredepends_on = [azapi_update_resource.aks_cluster_post_create], ensuring the control plane upgrade completes before any node pool upgrade begins.Version change tracking uses
null_resource+replace_triggered_by(same pattern as existingkubernetes_version_keeper):orchestrator_version_keeper— tracksvar.orchestrator_versionfor the default node poolnode_pool_orchestrator_version_keeper— trackseach.value.orchestrator_versionfor extra poolsDAG Execution Order
Changes (2 files, +64/-1 lines)
main.tfignore_changesdefault_node_pool[0].orchestrator_versiontoazurerm_kubernetes_cluster.mainlifecycleorchestrator_version_keepernull_resourceto trackvar.orchestrator_versionchanges (triggersreplace_triggered_by)aks_cluster_default_nodepool_versionazapi_update_resourceto upgrade default node pool version via Azure REST API, gated bydepends_on = [aks_cluster_post_create]extra_node_pool.tfignore_changesorchestrator_versiontonode_pool_create_before_destroylifecycleignore_changesorchestrator_versiontonode_pool_create_after_destroylifecyclenode_pool_orchestrator_version_keepernull_resourcewithfor_eachto track per-poolorchestrator_versionchangesnode_pool_versionazapi_update_resourcewithfor_eachto upgrade extra node pool versions, gated bydepends_on = [aks_cluster_post_create]Verification
Sandbox experiment performed: AKS cluster upgraded from 1.32 → 1.33 with 1 default node pool + 1 extra node pool.
Execution log (confirms correct ordering):
Key assertions verified:
aks_cluster_post_create) completed before any node pool upgrade startedNodePoolMcVersionIncompatibleerrors throughout the entire processazurerm_kubernetes_cluster_node_poolshowed noorchestrator_versiondiff (ignore_changesconfirmed working)Compatibility
ignore_changesprevents Terraform from detecting version drift onorchestrator_version, which is intentional — version management is delegated toazapi_update_resource.kubernetes_versionfrom outside of Terraform #336 for control plane upgrades (kubernetes_version→ignore_changes+azapi_update_resource+null_resourcekeeper).azapi_update_resourcepattern has been battle-tested for control plane upgrades since PR [Breaking] - Ignore changes onkubernetes_versionfrom outside of Terraform #336.