AEP 9726 Capacity-Aware In-Place Updates by omerap12 · Pull Request #9757 · kubernetes/autoscaler

omerap12 · 2026-06-05T14:25:32Z

What type of PR is this?

/kind documentation

What this PR does / why we need it:

AEP for #9726

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

k8s-ci-robot · 2026-06-05T14:25:41Z

This issue is currently awaiting triage.

If SIG Autoscaling contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

omerap12 · 2026-06-05T14:25:42Z

/cc @adrianmoisey @maxcao13

k8s-ci-robot · 2026-06-05T14:25:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: omerap12

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~vertical-pod-autoscaler/enhancements/OWNERS~~ [omerap12]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

adrianmoisey · 2026-06-05T14:31:18Z

+```
+The function then evaluates whether the update mode is InPlace, the feature gate is enabled, and the node has sufficient allocatable capacity for the recommendation:
+```go
+if updateMode == vpa_types.UpdateModeInPlace && node != nil && features.Enabled(features.InPlaceCapacityAware) && !checkAllocatableNodeForInPlace(pod, recommendation, node.Status.Allocatable) {


node.Status.Allocatable contains the allocatable resources, not the available resources, so this value isn't sufficient to determine how much space is available on the node

https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable

'Allocatable' on a Kubernetes node is defined as the amount of compute resources that are available for pods.

Correct, but if pods are scheduled on the node, that value doesn't change, so this value is only useful for nodes that don't have other pods (with requests defined) scheduled on them

Not sure I understand. this is the computation I thought we should do: https://github.com/kubernetes/autoscaler/pull/9758/changes#diff-96a6b6ca90d2574cc4e580f87a977e5f37c2e98e51af6964bc48530d330b89f1R349

The original issue says:

Currently, the updater is unaware of the capacity constraints of the node on which a pod is running. As a result, it may attempt an in-place resize without verifying whether the node has sufficient available resources.

I'm arguing what "sufficient available resources" means.

node.Status.Allocatable is what is available for pods, after Kubernetes and the system have taken their share.
Normally this value is close-ish to the total size of the node.

Ie, on my 20GB RAM and 4 CPU VM, I have these:

Capacity: cpu: 4 memory: 20484252Ki Allocatable: cpu: 3920m memory: 17445020Ki

GKE has taken away 80 Millicores and 2968MiB from my node for various tasks.
Meaning that the maximum sized pod I can have is 3920 Millicores and 17036MiB.

If I schedule some pods on this node, the "node.Status.Allocatable" value doesn't change, but my available resources decreases (which isn't a field stored on the node).

If you run a kubectl describe on a node, you see 3 sets of values: capacity, allocatable and allocated.
I had assume that the plan was to check if there was "available" space (allocatable minus allocated).

This AEP seems to be doing its calculation on the allocatable value (which is what I think the PodReasonInfeasible status comes from).

The problem with this, if we cap the pod's resize to the allocatable value, when we'll stop getting PodReasonInfeasible from API server (which is good), but will likely start getting PodReasonDeferred. And I assume that if a single pod is scheduled on the node (that has resources allocated) the likelihood of this deferred situation ever resolving is very low, especially if the node had a DaemonSet pod schedule on it (which I assume is a common use case).

As a user I want the VPA to resize that pods to fill up the available space, so that my workloads are protected.

That's totally makes sense.

I say let's bring this topic to sig-node - maybe there is another solution here that we are missing. if there is none we can then debate what is the best way forward (which is using a cluster wide pod informer vs direct API calls to the Kube api server ). does that work for you?
@maxcao13

I say let's bring this topic to sig-node - maybe there is another solution here that we are missing.

Agreed, or if there isn't a solution currently, may be k/k can provide one? (ie: a status field on the pod with details about what size it could be?)

if there is none we can then debate what is the best way forward (which is using a cluster wide pod informer vs direct API calls to the Kube api server ). does that work for you?

I think we should take a step back and debate if this is even a problem needing to be solved, before we debate on what the solution is.

Agreed, or if there isn't a solution currently, may be k/k can provide one? (ie: a status field on the pod with details about what size it could be?

Yeah. I agree

I think I agree with Adrian is saying here, with that this may be a premature optimization. IMO I'd like to avoid making changes or giving more work to people (sig-node) without a clear benefit or a PoC showing the effects of the mitigation.

As Adrian mentioned, it makes sense for the VPA to fill available resource "holes" on a node. For example, if a node has 1 CPU available and the pod recommendation is 2 CPUs, and the VPA is configured with updatePolicy: InPlace, the updater could attempt an in-place resize to 1 CPU instead of simply marking the recommendation as infeasible and leaving that available capacity unused.

We could also reduce unnecessary API calls. Currently, we only discover that a resize attempt is infeasible by checking the status or receiving an admission error. Wouldn't it make sense for the updater to determine ahead of time that the resize would be infeasible and skip the attempt altogether (or cap it as mentioned above)?

Again, I'm not convinced these features should be implemented. This PR is also intended to start a discussion around whether they make sense and are desirable.

adrianmoisey · 2026-06-05T18:24:20Z

+- Tests are stable for 3 releases.
+- No open bugs against the feature gate.
+- Positive user feedback.


The description here says that this section needs to describe graduation from alpha to beta and then to GA.
The plan isn't clear from these bullet points

Agree. I should have opened this as a draft this is just a quick AEP for discussion.

adrianmoisey · 2026-06-05T18:25:48Z

+1. Reduce CPU cycles spent by the updater on infeasible resize attempts.
+2. Reduce API server load incurred by admission checks for infeasible in-place resize requests.


If the goal here is to reduce CPU cycles and API load, I think we need to have a measured before and after to ensure that we're achieving this goal

Yeah I should have opened this as a draft since this is just a PR mainly for discussion.

Not sure I understand, this is a discussion about the AEP

Yeah, what I meant that is just a draft.

AEP 9726 Capacity-Aware In-Place Updates

61a6284

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

k8s-ci-robot added area/vertical-pod-autoscaler Issues or PRs related to the Vertical Pod Autoscaler component do-not-merge/needs-area Indicates that a PR should not merge because it lacks an area label. labels Jun 5, 2026

k8s-ci-robot removed the do-not-merge/needs-area Indicates that a PR should not merge because it lacks an area label. label Jun 5, 2026

k8s-ci-robot requested review from kwiesmueller and raywainman June 5, 2026 14:25

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 5, 2026

k8s-ci-robot requested review from adrianmoisey and maxcao13 June 5, 2026 14:25

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 5, 2026

adrianmoisey reviewed Jun 5, 2026

View reviewed changes

omerap12 marked this pull request as draft June 5, 2026 22:41

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2026

		1. Reduce CPU cycles spent by the updater on infeasible resize attempts.
		2. Reduce API server load incurred by admission checks for infeasible in-place resize requests.

Conversation

omerap12 commented Jun 5, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Jun 5, 2026

Uh oh!

omerap12 commented Jun 5, 2026

Uh oh!

k8s-ci-robot commented Jun 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants