Skip to content

test/integration: synchronize scheduler shutdown to fix metrics data race#138312

Open
mm4tt wants to merge 1 commit intokubernetes:masterfrom
mm4tt:fix-scheduler-perf-race
Open

test/integration: synchronize scheduler shutdown to fix metrics data race#138312
mm4tt wants to merge 1 commit intokubernetes:masterfrom
mm4tt:fix-scheduler-perf-race

Conversation

@mm4tt
Copy link
Copy Markdown
Contributor

@mm4tt mm4tt commented Apr 10, 2026

What type of PR is this?

/kind failing-test
/kind flake

What this PR does / why we need it:

This PR fixes a data race detected in TestSchedulerPerf (e.g., during preemption scenarios as reported in #137328).

The race occurred because legacyregistry.Reset() was being called at the end of a workload while the scheduler goroutine (launched via sched.Run(tCtx)) was still active. Even after workload evaluation is done, the scheduler could still perform a few more operations like updating metrics. If these updates happened concurrently with Reset(), it triggered the race detector.

Which issue(s) this PR is related to:

Fixes #137328.

Special notes for your reviewer:

Key changes:
  • Refactored test/integration/util.StartScheduler into StartSchedulerWithDone. The new function returns a <-chan struct{} that is closed when the scheduler's goroutine actually terminates.
  • Updated mustSetupCluster, setupTestCase, and setupClusterForWorkload in the scheduler_perf package to propagate this synchronization channel.
  • Updated RunIntegrationPerfScheduling and RunBenchmarkPerfScheduling to explicitly wait for the scheduler to stop (<-schedulerDone) after cancelling the context and before calling legacyregistry.Reset().

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


…race

Wait for the scheduler goroutine to exit before resetting global metrics.
Previously, metrics updates from a still-running scheduler could race
with legacyregistry.Reset() at the end of a performance test workload.

This change refactors StartScheduler into StartSchedulerWithDone to
provide a synchronization channel that is closed when the scheduler
actually stops, ensuring a clean teardown before registry cleanup.

Fixes kubernetes#137328
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 10, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Keywords which can automatically close issues and hashtag(#) mentions are not allowed in commit messages.

The list of commits with invalid commit messages:

  • d0f7cf3 test/integration: synchronize scheduler shutdown to fix metrics data race
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the kind/flake Categorizes issue or PR as related to a flaky test. label Apr 10, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Please note that we're already in Code Freeze for the upcoming v1.36.0 release.

Adding the milestone to this PR is strictly prohibited without proper approval. If this PR needs to be included in the v1.36.0 release:

  1. Technical review: get the PR reviewed and approved as usual (/lgtm and /approve)
  2. Inclusion in release: ping @sig-release-leads on the #sig-release Slack channel and suggest to add the v1.36.0 milestone to the PR

We're also in Test Freeze for the release-1.36 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.36.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Fri Apr 10 05:33:33 UTC 2026.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Apr 10, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 10, 2026
@k8s-ci-robot k8s-ci-robot requested review from SataQiu and damemi April 10, 2026 08:17
@k8s-ci-robot k8s-ci-robot added area/test sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 10, 2026
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Apr 10, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mm4tt
Once this PR has been reviewed and has the lgtm label, please assign liggitt for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mm4tt
Copy link
Copy Markdown
Contributor Author

mm4tt commented Apr 10, 2026

/assign @macsko

@macsko
Copy link
Copy Markdown
Member

macsko commented Apr 10, 2026

/test pull-kubernetes-scheduler-perf
/test pull-kubernetes-integration-race

@mm4tt
Copy link
Copy Markdown
Contributor Author

mm4tt commented Apr 10, 2026

/retest

2 similar comments
@mm4tt
Copy link
Copy Markdown
Contributor Author

mm4tt commented Apr 10, 2026

/retest

@mm4tt
Copy link
Copy Markdown
Contributor Author

mm4tt commented Apr 10, 2026

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

Status: Needs Triage

Development

Successfully merging this pull request may close these issues.

DATA RACE (and panic?!): TestSchedulerPerf/PreemptionAsync/5Nodes_AsyncAPICallsEnabled

3 participants