Skip to content

fix: support multi-zone topology scheduling for persistent volumes#2743

Open
tallaxes wants to merge 12 commits intokubernetes-sigs:mainfrom
tallaxes:tallaxes/muli-zone-pv-scheduling
Open

fix: support multi-zone topology scheduling for persistent volumes#2743
tallaxes wants to merge 12 commits intokubernetes-sigs:mainfrom
tallaxes:tallaxes/muli-zone-pv-scheduling

Conversation

@tallaxes
Copy link
Copy Markdown
Contributor

Fixes #2742

Description

Fixes volume topology scheduling to correctly handle multiple topology terms in PersistentVolumes and StorageClasses.

  • Refactor VolumeTopology.getRequirements() to return NodeSelectorTerm slices instead of flat NodeSelectorRequirement slices to preserve OR semantics across multiple topology terms
  • Update getStorageClassRequirements() to process all allowed topologies, not just the first one
  • Update getPersistentVolumeRequirements() to process all node selector terms, not just the first one
  • Implement proper cartesian product computation of node affinity terms in the Inject() method
  • Add test coverage for multi-term topology scenarios

How was this change tested?

  • Unit tests

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 24, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 24, 2025
@coveralls
Copy link
Copy Markdown

coveralls commented Dec 24, 2025

Pull Request Test Coverage Report for Build 22113451647

Details

  • 108 of 108 (100.0%) changed or added relevant lines in 5 files are covered.
  • 7 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.07%) to 80.56%

Files with Coverage Reduction New Missed Lines %
pkg/controllers/provisioning/scheduling/preferences.go 7 88.76%
Totals Coverage Status
Change from base Build 22077957548: 0.07%
Covered Lines: 11918
Relevant Lines: 14794

💛 - Coveralls

@tallaxes tallaxes marked this pull request as ready for review December 25, 2025 02:33
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 25, 2025
@tallaxes tallaxes changed the title fix: support multi-zone topology scheduling for persistent volumes (and storage class) fix: support multi-zone topology scheduling for persistent volumes Dec 25, 2025
@jmdeal jmdeal self-assigned this Jan 22, 2026
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 22, 2026
Comment thread pkg/controllers/provisioning/scheduling/volumetopology.go Outdated
@tallaxes tallaxes force-pushed the tallaxes/muli-zone-pv-scheduling branch from 60a02dd to cc7a264 Compare February 9, 2026 05:13
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tallaxes
Once this PR has been reviewed and has the lgtm label, please ask for approval from jmdeal. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 9, 2026
@tallaxes
Copy link
Copy Markdown
Contributor Author

tallaxes commented Feb 9, 2026

Vulnerabilities flagged here and in other PRs addressed by #2844

Comment thread pkg/test/storage.go
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than updating the defaults and affecting all existing tests, can we just add a specific test for this format? That way we don't affect our existing test coverage for multi-value requirement tests, but we still get the necessary additional coverage.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done; added a way to pass selector terms into options fce197f

Comment on lines +73 to +85
// Cross product: alternatives = alternatives X volAlts
var newAlts []scheduling.Requirements
for _, existing := range alternatives {
for _, volReq := range volAlts {
merged := scheduling.NewRequirements()
if existing != nil {
merged.Add(existing.Values()...)
}
merged.Add(volReq.Values()...)
newAlts = append(newAlts, merged)
}
}
alternatives = newAlts
Copy link
Copy Markdown
Member

@jmdeal jmdeal Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a subtle issue here. Consider a case where we have two volumes, each associated with a storage class that has two AZs in it's allowed topologies.

allowedTopologies:
  - matchLabelExpressions:
      - key: topology.kubernetes.io/zone
        values:
          - us-east-1a
  - matchLabelExpressions:
      - key: topology.kubernetes.io/zone
        values:
          - us-east-1b

We'll end up with the following requirements when we take the cross product:

  • {zone IN us-east-1a} ∩ {zone IN us-east-1a} = {zone IN us-east-1a}
  • {zone IN us-east-1a} ∩ {zone IN us-east-1b} = {zone DOES_NOT_EXIST}
  • {zone IN us-east-1b} ∩ {zone IN us-east-1a} = {zone DOES_NOT_EXIST}
  • {zone IN us-east-1b} ∩ {zone IN us-east-1b} = {zone IN us-east-1b}

In practice this should be fine since we don't expect there to be any instance types compatible with the topology.kubernetes.io/zone DoesNotExist requirement. There will be some no-op iterations, but that's just a performance concern.

You could imagine this being an issue for a label which isn't present on every instance type. Since this would result in some DoesNotExist operators, we could select an instance that doesn't have a value defined for that label when in reality it's not compatible with any instance since it has a requirement on a disjoint set of values.

I don't think there's a method which creates this exact intersection with our requirements code today. We basically want to say take the intersection for any overlapping keys and pass through any non-overlapping keys. If the overlapping keys have disjoint values, error and discard the term. Here's a contrived example I came up with:

# Volume A SC:
allowedTopologies:
  - matchLabelExpressions:
      - key: topology.custom.csi/rack
        values: [rack-1]
  - matchLabelExpressions:
      - key: topology.custom.csi/rack
        values: [rack-2]

# Volume B SC:
allowedTopologies:
  - matchLabelExpressions:
      - key: topology.kubernetes.io/zone
        values: [us-east-1a]
      - key: topology.custom.csi/rack
        values: [rack-1]
  - matchLabelExpressions:
      - key: topology.kubernetes.io/zone
        values: [us-east-1b]
      - key: topology.custom.csi/rack
        values: [rack-2]
  • {rack IN rack-1} ∩ {zone IN 1a, rack IN rack-1} = {zone IN 1a, rack IN rack-1}
  • {rack IN rack-1} ∩ {zone IN 1b, rack IN rack-2} = {zone IN 1b, rack IN DOES_NOT_EXIST}
    • Discarded because the rack requirements are disjoint
  • {rack IN rack-2} ∩ {zone IN 1a, rack IN rack-1} = {zone IN 1a, rack IN DOES_NOT_EXIST}
    • Discarded because the rack requirements are disjoint
  • {zone IN rackl-2} ∩ {zone IN 1b, rack IN rack-2} = {zone IN 1b, rack IN rack-2}

Copy link
Copy Markdown
Member

@jmdeal jmdeal Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really, I think this is an issue with the intersection of disjoint sets resulting in a DoesNotExist requirement. It's not an issue in practice since wherever we do intersections we do compatibility checks first, but it leads to subtle issues like this. I'm not sure if this behavior is actually relied on anywhere, if not I think it's worth updating. That doesn't need to be scoped to this PR though.

Copy link
Copy Markdown
Contributor Author

@tallaxes tallaxes Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point, and disjoint alternatives could survive in the new codepath. I pulled the fix into stacked PR tallaxes#1 for a smaller review; it prunes incompatible branches before merging, threads the filtered pod set through provisioning, and adds regression coverage. Let me know if this makes sense / is directionally correct, and I will merge it here.

volumeAlternatives = []scheduling.Requirements{nil}
}

// Try each volume alternative. Choosing a zone for volumes affects topology checks.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed quite a few comments are making the assumption that topology will only be zones. Since we support arbitrary CSI drivers, we shouldn't make that assumption. I know at least one counter-example used by Karpenter today: storage classes for EKS Auto Mode's EBS CSI Driver should be configured with a eks.amazonaws.com/compute-type: auto selector in allowed topologies.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed 15e7404

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scheduling uses only the first term of PersistentVolume nodeAffinity

4 participants