feat: Add feature-gated pod deletion cost management controller#2894
feat: Add feature-gated pod deletion cost management controller#2894nathangeology wants to merge 5 commits intokubernetes-sigs:mainfrom
Conversation
Introduce a new controller that manages pod-deletion-cost annotations on pods scheduled to Karpenter-managed nodes. This influences which pods the Kubernetes scheduler prefers to evict during consolidation, enabling smarter disruption decisions. The controller is gated behind the PodDeletionCostManagement feature flag (default: disabled) and supports configurable ranking strategies: - Random: random cost assignment - LargestToSmallest: prefer evicting larger pods first - SmallestToLargest: prefer evicting smaller pods first - UnallocatedVCPUPerPodCost: rank by unallocated vCPU per pod Key components: - controller.go: reconciliation loop watching Karpenter nodes - ranking.go: pluggable ranking strategies for cost calculation - annotation.go: pod annotation management with batch updates - changedetector.go: optimization to skip unchanged node states - events.go/metrics.go: observability for ranking operations Infrastructure changes: - Add PodDeletionCostManagement feature gate to options.go - Add ranking strategy and change detection CLI flags - Register controller conditionally in controllers.go - Add pod update/patch RBAC to kwok clusterrole - Add example env vars to kwok chart values
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: nathangeology The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @nathangeology. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
…lysis and recommendations (kp-v9h)
… drift ranking (kp-368) Remove all ranking strategies (Random, LargestToSmallest, SmallestToLargest, UnallocatedVCPUPerPodCost) and replace with PodCount-only ranking. Implement three-tier node partitioning: 1. Drifted nodes (lowest deletion costs) - ConditionTypeDrifted=True 2. Normal nodes (middle) 3. Do-not-disrupt nodes (highest deletion costs) Each tier sorted by pod count ascending with deterministic node name tiebreak. Replace hardcoded BaseRank=-1000 with -n where n is total managed nodes. Remove PodDeletionCostRankingStrategy option field, CLI flag, and env var. Remove strategy label from metrics. Simplify RankingEngine constructor.
Add comprehensive unit tests covering ranking, annotation management, change detection, and controller reconciliation for the pod deletion cost feature. Tests validate two-tier partitioning, sentinel annotation detection, customer-managed annotation preservation, and change detection optimization. 23 tests across 4 test files: - ranking_test.go: tier partitioning and rank assignment - annotation_test.go: sentinel detection and pod annotation updates - changedetector_test.go: hash-based change detection - controller_test.go: reconciliation and feature gate behavior
…s, missing pieces (kp-2za)
…README, third-party detection, bounded labeling (kp-2iw) Fix test compilation: - Remove RankingStrategyRandom references (symbol removed) - Use NewRankingEngine() with no args (API simplified) - Compute BaseRank dynamically as -len(nodes) (not exported) - Remove PodDeletionCostRankingStrategy from test setup (field removed) - Add PodDeletionCostManagement to test FeatureGates Remove orphaned IPVS files: - Delete ipvs_steadystate.go referencing removed SteadyState labels - Delete all ipvs_*_test.go files left from IPVS feature removal Add third-party annotation conflict detection: - Track lastAssignedValues per pod UID in AnnotationManager - Detect when current value differs from what Karpenter last set - Remove sentinel annotation on conflict, skip pod, emit warning event - Clean up tracking map when pods are deleted Add bounded node labeling with cleanup: - Limit annotation updates to top 50 nodes per cycle - Track previouslyLabeledNodes across reconciles - Clean up pod annotations when nodes drop out of top-N Add tests for new features: - Third-party detection: value modified externally → sentinel removed - Bounded labeling: nodes under limit → all annotated Rewrite README to describe PodCount-only with three-tier drift ranking. Remove all references to Random, LargestToSmallest, SmallestToLargest, UnallocatedVCPUPerPodCost strategies.
…date README - Fix test compilation: remove references to deleted strategy symbols - Add third-party annotation conflict detection with lastAssignedValues tracking - Add bounded node labeling (maxNodesPerCycle=50) with orphan cleanup - Update README to PodCount-only with three-tier drift ranking - Add tests for conflict detection and bounded labeling - Clean up event reasons and test options
podutils.HasDoNotDisrupt does not exist on upstream main. Use direct annotation check against v1.DoNotDisruptAnnotationKey instead. Also removes unused podutils import and cleans up go.mod/go.sum.
Pod Deletion Cost Management Controller
RFC: #2935
Status: Ready for review
What This PR Does
Adds a new feature-gated controller (
pod.deletioncost) that managescontroller.kubernetes.io/pod-deletion-costannotations on pods running on Karpenter-managed nodes. This bridges the coordination gap between the ReplicaSet controller (which decides which pods to delete during scale-down) and Karpenter's consolidation controller (which decides which nodes to drain).How It Works
PodCount ranking with three-tier drift partitioning:
Partition nodes into three tiers:
ConditionTypeDrifted=Truekarpenter.sh/do-not-disruptpodSort each tier by pod count ascending (fewest pods = lowest deletion cost = drained first), with deterministic tiebreak by node name.
Assign sequential ranks starting at
-n(where n = total managed nodes), so the range is[-n, -1].Annotate pods on the top 50 consolidation candidate nodes with their node's rank as the
pod-deletion-costvalue.This aligns ReplicaSet scale-down with Karpenter's consolidation goals: scale-down events remove pods from nodes Karpenter wants to consolidate → those nodes empty faster → Karpenter consolidates them with less disruption.
Feature Gate
Key Design Decisions
PodCount-only strategy: Ranks nodes by total pod count, matching Karpenter's own consolidation candidate sorting. No configurable strategies — simplicity over flexibility.
Three-tier drift priority: Drifted nodes get the lowest deletion costs so RS scale-down naturally drains them first, helping both consolidation and drift progress in a single action.
Bounded labeling (top 50 nodes): Only the top 50 consolidation candidate nodes are annotated per cycle. Nodes that drop out of the top 50 have their annotations cleaned up. This bounds API server write load for large clusters.
Third-party annotation conflict detection: Tracks the last value Karpenter set on each pod. If a third-party controller modifies a Karpenter-managed annotation, Karpenter detects the change, removes its sentinel annotation, and yields management of that pod to the third party.
Customer annotation protection: Pods with an existing
pod-deletion-costannotation but without Karpenter's sentinel are never modified.Change detection: SHA-256 hash of cluster state skips annotation updates when nothing has changed (zero API writes in steady state).
What's Included
ranking.go— PodCount ranking with three-tier drift partitioningannotation.go— Safe pod annotation updates with third-party conflict detection and bounded cleanupchangedetector.go— Hash-based optimization to skip unchanged statecontroller.go— Orchestrates ranking, bounded labeling (top 50), and orphan cleanupevents.go— Events for ranking completion, failures, conflict detectionmetrics.go— Prometheus metrics for nodes ranked, pods updated, ranking durationranking_test.go,annotation_test.go,changedetector_test.go,controller_test.go— Unit testsPodDeletionCostManagement(default: false)updateandpatchverbs