Skip to content

feat: Add feature-gated pod deletion cost management controller#2894

Draft
nathangeology wants to merge 5 commits intokubernetes-sigs:mainfrom
nathangeology:feat/pod-deletion-cost-management
Draft

feat: Add feature-gated pod deletion cost management controller#2894
nathangeology wants to merge 5 commits intokubernetes-sigs:mainfrom
nathangeology:feat/pod-deletion-cost-management

Conversation

@nathangeology
Copy link
Copy Markdown
Contributor

@nathangeology nathangeology commented Mar 5, 2026

Pod Deletion Cost Management Controller

RFC: #2935
Status: Ready for review

What This PR Does

Adds a new feature-gated controller (pod.deletioncost) that manages controller.kubernetes.io/pod-deletion-cost annotations on pods running on Karpenter-managed nodes. This bridges the coordination gap between the ReplicaSet controller (which decides which pods to delete during scale-down) and Karpenter's consolidation controller (which decides which nodes to drain).

How It Works

PodCount ranking with three-tier drift partitioning:

  1. Partition nodes into three tiers:

    • Tier 1 (lowest cost): Drifted nodes — ConditionTypeDrifted=True
    • Tier 2 (middle): Normal nodes — not drifted, no do-not-disrupt pods
    • Tier 3 (highest cost): Do-not-disrupt nodes — has at least one karpenter.sh/do-not-disrupt pod
  2. Sort each tier by pod count ascending (fewest pods = lowest deletion cost = drained first), with deterministic tiebreak by node name.

  3. Assign sequential ranks starting at -n (where n = total managed nodes), so the range is [-n, -1].

  4. Annotate pods on the top 50 consolidation candidate nodes with their node's rank as the pod-deletion-cost value.

This aligns ReplicaSet scale-down with Karpenter's consolidation goals: scale-down events remove pods from nodes Karpenter wants to consolidate → those nodes empty faster → Karpenter consolidates them with less disruption.

Feature Gate

--feature-gates=PodDeletionCostManagement=true

Key Design Decisions

PodCount-only strategy: Ranks nodes by total pod count, matching Karpenter's own consolidation candidate sorting. No configurable strategies — simplicity over flexibility.

Three-tier drift priority: Drifted nodes get the lowest deletion costs so RS scale-down naturally drains them first, helping both consolidation and drift progress in a single action.

Bounded labeling (top 50 nodes): Only the top 50 consolidation candidate nodes are annotated per cycle. Nodes that drop out of the top 50 have their annotations cleaned up. This bounds API server write load for large clusters.

Third-party annotation conflict detection: Tracks the last value Karpenter set on each pod. If a third-party controller modifies a Karpenter-managed annotation, Karpenter detects the change, removes its sentinel annotation, and yields management of that pod to the third party.

Customer annotation protection: Pods with an existing pod-deletion-cost annotation but without Karpenter's sentinel are never modified.

Change detection: SHA-256 hash of cluster state skips annotation updates when nothing has changed (zero API writes in steady state).

What's Included

  • ranking.go — PodCount ranking with three-tier drift partitioning
  • annotation.go — Safe pod annotation updates with third-party conflict detection and bounded cleanup
  • changedetector.go — Hash-based optimization to skip unchanged state
  • controller.go — Orchestrates ranking, bounded labeling (top 50), and orphan cleanup
  • events.go — Events for ranking completion, failures, conflict detection
  • metrics.go — Prometheus metrics for nodes ranked, pods updated, ranking duration
  • ranking_test.go, annotation_test.go, changedetector_test.go, controller_test.go — Unit tests
  • Feature gate PodDeletionCostManagement (default: false)
  • RBAC: pods get update and patch verbs

Introduce a new controller that manages pod-deletion-cost annotations
on pods scheduled to Karpenter-managed nodes. This influences which
pods the Kubernetes scheduler prefers to evict during consolidation,
enabling smarter disruption decisions.

The controller is gated behind the PodDeletionCostManagement feature
flag (default: disabled) and supports configurable ranking strategies:
- Random: random cost assignment
- LargestToSmallest: prefer evicting larger pods first
- SmallestToLargest: prefer evicting smaller pods first
- UnallocatedVCPUPerPodCost: rank by unallocated vCPU per pod

Key components:
- controller.go: reconciliation loop watching Karpenter nodes
- ranking.go: pluggable ranking strategies for cost calculation
- annotation.go: pod annotation management with batch updates
- changedetector.go: optimization to skip unchanged node states
- events.go/metrics.go: observability for ranking operations

Infrastructure changes:
- Add PodDeletionCostManagement feature gate to options.go
- Add ranking strategy and change detection CLI flags
- Register controller conditionally in controllers.go
- Add pod update/patch RBAC to kwok clusterrole
- Add example env vars to kwok chart values
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 5, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nathangeology
Once this PR has been reviewed and has the lgtm label, please assign jmdeal for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 5, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @nathangeology. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 5, 2026
nathangeology added a commit to nathangeology/karpenter-core that referenced this pull request Apr 17, 2026
… drift ranking (kp-368)

Remove all ranking strategies (Random, LargestToSmallest, SmallestToLargest,
UnallocatedVCPUPerPodCost) and replace with PodCount-only ranking.

Implement three-tier node partitioning:
1. Drifted nodes (lowest deletion costs) - ConditionTypeDrifted=True
2. Normal nodes (middle)
3. Do-not-disrupt nodes (highest deletion costs)

Each tier sorted by pod count ascending with deterministic node name tiebreak.

Replace hardcoded BaseRank=-1000 with -n where n is total managed nodes.

Remove PodDeletionCostRankingStrategy option field, CLI flag, and env var.
Remove strategy label from metrics. Simplify RankingEngine constructor.
Add comprehensive unit tests covering ranking, annotation management,
change detection, and controller reconciliation for the pod deletion
cost feature. Tests validate two-tier partitioning, sentinel annotation
detection, customer-managed annotation preservation, and change
detection optimization.

23 tests across 4 test files:
- ranking_test.go: tier partitioning and rank assignment
- annotation_test.go: sentinel detection and pod annotation updates
- changedetector_test.go: hash-based change detection
- controller_test.go: reconciliation and feature gate behavior
nathangeology added a commit to nathangeology/karpenter-core that referenced this pull request Apr 17, 2026
nathangeology added a commit to nathangeology/karpenter-core that referenced this pull request Apr 17, 2026
nathangeology added a commit to nathangeology/karpenter-core that referenced this pull request Apr 17, 2026
…README, third-party detection, bounded labeling (kp-2iw)

Fix test compilation:
- Remove RankingStrategyRandom references (symbol removed)
- Use NewRankingEngine() with no args (API simplified)
- Compute BaseRank dynamically as -len(nodes) (not exported)
- Remove PodDeletionCostRankingStrategy from test setup (field removed)
- Add PodDeletionCostManagement to test FeatureGates

Remove orphaned IPVS files:
- Delete ipvs_steadystate.go referencing removed SteadyState labels
- Delete all ipvs_*_test.go files left from IPVS feature removal

Add third-party annotation conflict detection:
- Track lastAssignedValues per pod UID in AnnotationManager
- Detect when current value differs from what Karpenter last set
- Remove sentinel annotation on conflict, skip pod, emit warning event
- Clean up tracking map when pods are deleted

Add bounded node labeling with cleanup:
- Limit annotation updates to top 50 nodes per cycle
- Track previouslyLabeledNodes across reconciles
- Clean up pod annotations when nodes drop out of top-N

Add tests for new features:
- Third-party detection: value modified externally → sentinel removed
- Bounded labeling: nodes under limit → all annotated

Rewrite README to describe PodCount-only with three-tier drift ranking.
Remove all references to Random, LargestToSmallest, SmallestToLargest,
UnallocatedVCPUPerPodCost strategies.
…date README

- Fix test compilation: remove references to deleted strategy symbols
- Add third-party annotation conflict detection with lastAssignedValues tracking
- Add bounded node labeling (maxNodesPerCycle=50) with orphan cleanup
- Update README to PodCount-only with three-tier drift ranking
- Add tests for conflict detection and bounded labeling
- Clean up event reasons and test options
podutils.HasDoNotDisrupt does not exist on upstream main. Use direct
annotation check against v1.DoNotDisruptAnnotationKey instead. Also
removes unused podutils import and cleans up go.mod/go.sum.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants