[Resource quotas] Redistribute unclaimed capacity to similar node groups

/area cluster-autoscaler

https://github.com/kubernetes/autoscaler/pull/9494 discovered and fixed a bug in granular resource quotas, in which balancing across similar node groups didn't respect the resource quotas of those similar node groups. The fix was to cap the scale ups in the similar node groups to their corresponding quotas **after** the balancing.

As discussed in https://github.com/kubernetes/autoscaler/pull/9494#discussion_r3139063418, that leads to suboptimal results. Example scenario: we have CapacityQuotas set to 3 nodes per each zone, and CA grabs unschedulable pods that need 9 new nodes. Theoretically, it can be satisfied within one scale up loop, but applyLimits will limit the node count to 3. If I'm not mistaken, if node groups' max sizes were used instead of capacity quotas, each node group would get 3 new nodes. Similarly, if zone a has 5 nodes remaining in the quota, and zones b and c have 1 remaining node, the current scale up logic will:
- pick some node group as the best option (honestly I'm not sure which one, probably neither will have a better score than another)
- if zone a is picked, scale up will be capped to 5 due to quotas
- balancing will balance the scale up across the zones, so we will get something like (2, 2, 1)
- scale up in zone b will be capped to 1 due to quotas, so the final scale up will be (2, 1, 1)
- if zone b or c is picked instead in the 1st step, we get only 1 node in the scale up

We can see that the optimal scenario would be to claim all the remaining quota, and initiate a (5, 1, 1) scale up. This is how `NodeGroup.MaxSize()` logic works. We should probably throw away `applyLimits`, and handle quotas similarly as we handle node groups' max size:

- https://github.com/kubernetes/autoscaler/blob/91080c84cfdfdd29c24e767d26977311c5a17ab1/cluster-autoscaler/estimator/sng_capacity_threshold.go#L48
 - https://github.com/kubernetes/autoscaler/blob/91080c84cfdfdd29c24e767d26977311c5a17ab1/cluster-autoscaler/processors/nodegroupset/balancing_processor.go#L95

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Resource quotas] Redistribute unclaimed capacity to similar node groups #9567

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Resource quotas] Redistribute unclaimed capacity to similar node groups #9567

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions