AEP 9726 Capacity-Aware In-Place Updates#9757
Conversation
Signed-off-by: Omer Aplatony <omerap12@gmail.com>
|
This issue is currently awaiting triage. If SIG Autoscaling contributors determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: omerap12 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| ``` | ||
| The function then evaluates whether the update mode is InPlace, the feature gate is enabled, and the node has sufficient allocatable capacity for the recommendation: | ||
| ```go | ||
| if updateMode == vpa_types.UpdateModeInPlace && node != nil && features.Enabled(features.InPlaceCapacityAware) && !checkAllocatableNodeForInPlace(pod, recommendation, node.Status.Allocatable) { |
There was a problem hiding this comment.
node.Status.Allocatable contains the allocatable resources, not the available resources, so this value isn't sufficient to determine how much space is available on the node
There was a problem hiding this comment.
https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable
'Allocatable' on a Kubernetes node is defined as the amount of compute resources that are available for pods.
There was a problem hiding this comment.
Correct, but if pods are scheduled on the node, that value doesn't change, so this value is only useful for nodes that don't have other pods (with requests defined) scheduled on them
There was a problem hiding this comment.
Not sure I understand. this is the computation I thought we should do: https://github.com/kubernetes/autoscaler/pull/9758/changes#diff-96a6b6ca90d2574cc4e580f87a977e5f37c2e98e51af6964bc48530d330b89f1R349
There was a problem hiding this comment.
The original issue says:
Currently, the updater is unaware of the capacity constraints of the node on which a pod is running. As a result, it may attempt an in-place resize without verifying whether the node has sufficient available resources.
I'm arguing what "sufficient available resources" means.
node.Status.Allocatable is what is available for pods, after Kubernetes and the system have taken their share.
Normally this value is close-ish to the total size of the node.
Ie, on my 20GB RAM and 4 CPU VM, I have these:
Capacity:
cpu: 4
memory: 20484252Ki
Allocatable:
cpu: 3920m
memory: 17445020Ki
GKE has taken away 80 Millicores and 2968MiB from my node for various tasks.
Meaning that the maximum sized pod I can have is 3920 Millicores and 17036MiB.
If I schedule some pods on this node, the "node.Status.Allocatable" value doesn't change, but my available resources decreases (which isn't a field stored on the node).
If you run a kubectl describe on a node, you see 3 sets of values: capacity, allocatable and allocated.
I had assume that the plan was to check if there was "available" space (allocatable minus allocated).
This AEP seems to be doing its calculation on the allocatable value (which is what I think the PodReasonInfeasible status comes from).
The problem with this, if we cap the pod's resize to the allocatable value, when we'll stop getting PodReasonInfeasible from API server (which is good), but will likely start getting PodReasonDeferred. And I assume that if a single pod is scheduled on the node (that has resources allocated) the likelihood of this deferred situation ever resolving is very low, especially if the node had a DaemonSet pod schedule on it (which I assume is a common use case).
There was a problem hiding this comment.
As a user I want the VPA to resize that pods to fill up the available space, so that my workloads are protected.
That's totally makes sense.
I say let's bring this topic to sig-node - maybe there is another solution here that we are missing. if there is none we can then debate what is the best way forward (which is using a cluster wide pod informer vs direct API calls to the Kube api server ). does that work for you?
@maxcao13
There was a problem hiding this comment.
I say let's bring this topic to sig-node - maybe there is another solution here that we are missing.
Agreed, or if there isn't a solution currently, may be k/k can provide one? (ie: a status field on the pod with details about what size it could be?)
if there is none we can then debate what is the best way forward (which is using a cluster wide pod informer vs direct API calls to the Kube api server ). does that work for you?
I think we should take a step back and debate if this is even a problem needing to be solved, before we debate on what the solution is.
There was a problem hiding this comment.
Agreed, or if there isn't a solution currently, may be k/k can provide one? (ie: a status field on the pod with details about what size it could be?
Yeah. I agree
There was a problem hiding this comment.
I think I agree with Adrian is saying here, with that this may be a premature optimization. IMO I'd like to avoid making changes or giving more work to people (sig-node) without a clear benefit or a PoC showing the effects of the mitigation.
There was a problem hiding this comment.
-
As Adrian mentioned, it makes sense for the VPA to fill available resource "holes" on a node. For example, if a node has 1 CPU available and the pod recommendation is 2 CPUs, and the VPA is configured with
updatePolicy: InPlace, the updater could attempt an in-place resize to 1 CPU instead of simply marking the recommendation as infeasible and leaving that available capacity unused. -
We could also reduce unnecessary API calls. Currently, we only discover that a resize attempt is infeasible by checking the status or receiving an admission error. Wouldn't it make sense for the updater to determine ahead of time that the resize would be infeasible and skip the attempt altogether (or cap it as mentioned above)?
Again, I'm not convinced these features should be implemented. This PR is also intended to start a discussion around whether they make sense and are desirable.
| - Tests are stable for 3 releases. | ||
| - No open bugs against the feature gate. | ||
| - Positive user feedback. |
There was a problem hiding this comment.
The description here says that this section needs to describe graduation from alpha to beta and then to GA.
The plan isn't clear from these bullet points
There was a problem hiding this comment.
Agree. I should have opened this as a draft this is just a quick AEP for discussion.
| 1. Reduce CPU cycles spent by the updater on infeasible resize attempts. | ||
| 2. Reduce API server load incurred by admission checks for infeasible in-place resize requests. |
There was a problem hiding this comment.
If the goal here is to reduce CPU cycles and API load, I think we need to have a measured before and after to ensure that we're achieving this goal
There was a problem hiding this comment.
Yeah I should have opened this as a draft since this is just a PR mainly for discussion.
There was a problem hiding this comment.
Not sure I understand, this is a discussion about the AEP
There was a problem hiding this comment.
Yeah, what I meant that is just a draft.
What type of PR is this?
/kind documentation
What this PR does / why we need it:
AEP for #9726
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: