feat: topology-aware drift disruption ordering via DriftPolicy#2963
feat: topology-aware drift disruption ordering via DriftPolicy#2963BrunoChauvet wants to merge 17 commits intokubernetes-sigs:mainfrom
Conversation
|
Hi @BrunoChauvet. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: BrunoChauvet The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Spec for topology-aware drift disruption ordering via a new DriftPolicy field on Disruption. Supports three modes: default (globally oldest-first), sequential by domain, and sequential by domain with concurrency.
38a8f18 to
a87e395
Compare
Expand the field-level and type-level comments to explain the what, why, and how of DriftPolicy: - Clarify the relationship with Budgets (orthogonal, not overlapping) - Document the alphabetical domain selection rationale - Note that unlabelled nodes are unaffected - Add examples to TopologyKey and MaxConcurrentPerDomain
baa307a to
355d0f6
Compare
|
/easycla |
…yclo limit Extract per-domain budget check into domainBudgetOK helper to bring applyDriftPolicies complexity from 14 down to <= 11 (gocyclo limit).
…reformat - Fix British spelling "behaviour" → "behavior" in nodepool.go godoc (golangci-lint misspell linter, locale: US) - Remove extra column-alignment whitespace in drift_test.go struct literals (golangci-lint goimports formatter) - Regenerate CRDs with updated YAML indentation style from controller-gen (go generate ./... + cp pkg/apis/crds kwok/charts after spelling fix) The root cause was a sequencing issue: go generate ran before golangci-lint fixed the spelling in nodepool.go, so previously committed CRDs still had the British spelling. Running go generate after the spelling fix produces stable, idempotent output that satisfies make verify.
Two DriftPolicy tests had wrong expectations: 1. "should respect maxConcurrentPerDomain budget gate": - nodeClaim1 has DeletionTimestamp (in-flight); budget=1 is exhausted - Controller correctly returns 0 new commands - Test was asserting HaveLen(1) — fixed to HaveLen(0) 2. "should allow second disruption when maxConcurrentPerDomain=2 with only 1 in-flight": - nodeClaim1 has DeletionTimestamp; SimulateScheduling skips it (errCandidateDeleting) - Controller correctly disrupts nodeClaim2 (the non-deleting candidate) - Test was asserting nodeClaim1.Name — fixed to nodeClaim2.Name
These were internal working documents, not part of the upstream contribution.
Add a new "Topology-Aware Drift Ordering (DriftPolicy)" subsection under the Drift section of the disruption concepts page. Covers: - What problem DriftPolicy solves (fan-out blast radius during fleet-wide drift) - The two fields: topologyKey and maxConcurrentPerDomain - How active-domain selection works (in-flight first, then alphabetical) - Interaction with existing NodePool Disruption Budgets - Example NodePool YAML Related upstream PR: kubernetes-sigs/karpenter#2963
Adds designs/drift-topology-policy.md per upstream maintainer request. The RFC leads with Kafka/Cassandra use cases (quorum loss, rebalance storms), explains why Budget.Nodes is insufficient, and covers the four key design choices: Budget scope, CRD placement, alphabetical domain selection, and drift-only scope.
04cf692 to
d5176cb
Compare
|
/easycla |
- Add EBS single-attach volume sequence to explain why concurrent cross-zone replacement creates simultaneous partition vacancies - Explicitly address why PDB/surge eviction doesn't solve the problem: it controls eviction pacing (how), not node selection (which) - Note custom eviction webhooks as complementary but insufficient
Description
When many nodes are drifted at once, Karpenter fans out disruption across all availability zones simultaneously. This is fast, but it can reduce availability for workloads that are sensitive to how replacements roll across failure domains.
For instance stateful workloads replicating data across topology domains, may be impacted of nodes holding the same partition across distinct domains are concurrently disrupted.
This PR adds an opt-in
DriftPolicyfield toNodePool.Spec.Disruptionthat lets cluster operators control how drift disruption proceeds across topology domains. WhentopologyKeyis set, Karpenter disrupts one domain at a time — finishing replacements inus-east-1abefore moving tous-east-1b, for example — rather than spreading disruption across all zones at once. An optionalmaxConcurrentPerDomainfield (absolute or percentage) controls how many nodes can be replaced simultaneously within the active domain.The active domain is selected alphabetically, which gives stable and predictable behaviour across controller restarts and as replacements progress. Nodes that do not carry the topology label are unaffected and remain eligible for disruption regardless. Without
DriftPolicyset, behaviour is unchanged from today.DriftPolicyis intentionally separate fromBudget— budgets control how many disruptions are allowed in total;DriftPolicycontrols the order in which domains are disrupted. The two are orthogonal and can be combined.Documentation update aws/karpenter-provider-aws#9073
How was this change tested?
DriftPolicyAPI validation (valid keys, invalid keys, percentage values)make presubmitBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.