Skip to content

Fix the equivalence.PodGroup's mutation during scale up simulations for skipped node groups#9827

Open
shaikenov wants to merge 1 commit into
kubernetes:masterfrom
shaikenov:shaikenov-fix-eg-mutation-during-extrascaleup-sumulations
Open

Fix the equivalence.PodGroup's mutation during scale up simulations for skipped node groups#9827
shaikenov wants to merge 1 commit into
kubernetes:masterfrom
shaikenov:shaikenov-fix-eg-mutation-during-extrascaleup-sumulations

Conversation

@shaikenov

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

In #9346 there is a bug which allows the mutation of equivalence pod group (eg) state to become Schedulable if there is a skipped node group which satisfies the pod predicates. Later down the line this eg is considered schedulable which is not true.

The fix is to do the extra SchedulablePodGroups() simulations on the cloned eg(s).

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 16, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If SIG Autoscaling contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/needs-area Indicates that a PR should not merge because it lacks an area label. area/cluster-autoscaler Issues or PRs related to the Cluster Autoscaler component needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 16, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Hi @shaikenov. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-area Indicates that a PR should not merge because it lacks an area label. label Jun 16, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shaikenov
Once this PR has been reviewed and has the lgtm label, please assign towca for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 16, 2026
…ions.

In kubernetes#9346 there is a bug which allows the mutation of equivalence pod group (eg) state to become Schedulable if there is a skipped node group which satisfies the pod predicates. Later down the line this eg is considered schedulable which is not true.

The fix is to do the extra SchedulablePodGroups() simulations on the cloned egs.
@shaikenov shaikenov force-pushed the shaikenov-fix-eg-mutation-during-extrascaleup-sumulations branch from 41d3b51 to ac5472f Compare June 16, 2026 19:55
SchedulableGroups: clonedGroups,
Schedulable: eg.Schedulable,
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a unit test for this func? To catch any cases that the orchestrator_test.go change will miss

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

SchedulingErrors: clonedErrors,
SchedulableGroups: clonedGroups,
Schedulable: eg.Schedulable,
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if a new field is added to PodGroup struct in the future? A future author may not realize they also have to add the field here and it will get dropped. Maybe you can start with a shallow copy of the struct and then overwrite the SchedulingErrors and SchedulableGroups?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rrangith I think that it might actually be more unsafe than forgetting to add the field here. If someone forgets to add a new field here, that ideally should fail fast and get caught by the tests, even if it slips through the UTs, it will be in general way more obvious where the problem is. With a shallow copy we might end up in the exact same situation we're fixing in this PR.

Comment on lines +2419 to +2425
podsAwaitEvaluation := []string{}
for _, pod := range scaleUpStatus.PodsAwaitEvaluation {
podsAwaitEvaluation = append(podsAwaitEvaluation, pod.Name)
}
// This assertion ensures that the skipped node group simulation did not mutate the Schedulable status of the original equivalence groups.
// Without cloning the "partial scale-up successful" case would fail as p1 would be marked as schedulable on ng1 and would be added to podsAwaitEvaluation.
assert.Empty(t, podsAwaitEvaluation)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct me if I'm wrong, but isn't this just checking assert.Empty(t, scaleUpStatus.PodsAwaitEvaluation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cluster-autoscaler Issues or PRs related to the Cluster Autoscaler component cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants