fix: support multi-zone topology scheduling for persistent volumes#2743
fix: support multi-zone topology scheduling for persistent volumes#2743tallaxes wants to merge 12 commits intokubernetes-sigs:mainfrom
Conversation
|
Skipping CI for Draft Pull Request. |
Pull Request Test Coverage Report for Build 22113451647Details
💛 - Coveralls |
60a02dd to
cc7a264
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tallaxes The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Vulnerabilities flagged here and in other PRs addressed by #2844 |
There was a problem hiding this comment.
Rather than updating the defaults and affecting all existing tests, can we just add a specific test for this format? That way we don't affect our existing test coverage for multi-value requirement tests, but we still get the necessary additional coverage.
There was a problem hiding this comment.
Done; added a way to pass selector terms into options fce197f
| // Cross product: alternatives = alternatives X volAlts | ||
| var newAlts []scheduling.Requirements | ||
| for _, existing := range alternatives { | ||
| for _, volReq := range volAlts { | ||
| merged := scheduling.NewRequirements() | ||
| if existing != nil { | ||
| merged.Add(existing.Values()...) | ||
| } | ||
| merged.Add(volReq.Values()...) | ||
| newAlts = append(newAlts, merged) | ||
| } | ||
| } | ||
| alternatives = newAlts |
There was a problem hiding this comment.
There's a subtle issue here. Consider a case where we have two volumes, each associated with a storage class that has two AZs in it's allowed topologies.
allowedTopologies:
- matchLabelExpressions:
- key: topology.kubernetes.io/zone
values:
- us-east-1a
- matchLabelExpressions:
- key: topology.kubernetes.io/zone
values:
- us-east-1bWe'll end up with the following requirements when we take the cross product:
- {zone IN us-east-1a} ∩ {zone IN us-east-1a} = {zone IN us-east-1a}
- {zone IN us-east-1a} ∩ {zone IN us-east-1b} = {zone DOES_NOT_EXIST}
- {zone IN us-east-1b} ∩ {zone IN us-east-1a} = {zone DOES_NOT_EXIST}
- {zone IN us-east-1b} ∩ {zone IN us-east-1b} = {zone IN us-east-1b}
In practice this should be fine since we don't expect there to be any instance types compatible with the topology.kubernetes.io/zone DoesNotExist requirement. There will be some no-op iterations, but that's just a performance concern.
You could imagine this being an issue for a label which isn't present on every instance type. Since this would result in some DoesNotExist operators, we could select an instance that doesn't have a value defined for that label when in reality it's not compatible with any instance since it has a requirement on a disjoint set of values.
I don't think there's a method which creates this exact intersection with our requirements code today. We basically want to say take the intersection for any overlapping keys and pass through any non-overlapping keys. If the overlapping keys have disjoint values, error and discard the term. Here's a contrived example I came up with:
# Volume A SC:
allowedTopologies:
- matchLabelExpressions:
- key: topology.custom.csi/rack
values: [rack-1]
- matchLabelExpressions:
- key: topology.custom.csi/rack
values: [rack-2]
# Volume B SC:
allowedTopologies:
- matchLabelExpressions:
- key: topology.kubernetes.io/zone
values: [us-east-1a]
- key: topology.custom.csi/rack
values: [rack-1]
- matchLabelExpressions:
- key: topology.kubernetes.io/zone
values: [us-east-1b]
- key: topology.custom.csi/rack
values: [rack-2]- {rack IN rack-1} ∩ {zone IN 1a, rack IN rack-1} = {zone IN 1a, rack IN rack-1}
- {rack IN rack-1} ∩ {zone IN 1b, rack IN rack-2} = {zone IN 1b, rack IN DOES_NOT_EXIST}
- Discarded because the rack requirements are disjoint
- {rack IN rack-2} ∩ {zone IN 1a, rack IN rack-1} = {zone IN 1a, rack IN DOES_NOT_EXIST}
- Discarded because the rack requirements are disjoint
- {zone IN rackl-2} ∩ {zone IN 1b, rack IN rack-2} = {zone IN 1b, rack IN rack-2}
There was a problem hiding this comment.
Really, I think this is an issue with the intersection of disjoint sets resulting in a DoesNotExist requirement. It's not an issue in practice since wherever we do intersections we do compatibility checks first, but it leads to subtle issues like this. I'm not sure if this behavior is actually relied on anywhere, if not I think it's worth updating. That doesn't need to be scoped to this PR though.
There was a problem hiding this comment.
Yeah, good point, and disjoint alternatives could survive in the new codepath. I pulled the fix into stacked PR tallaxes#1 for a smaller review; it prunes incompatible branches before merging, threads the filtered pod set through provisioning, and adds regression coverage. Let me know if this makes sense / is directionally correct, and I will merge it here.
| volumeAlternatives = []scheduling.Requirements{nil} | ||
| } | ||
|
|
||
| // Try each volume alternative. Choosing a zone for volumes affects topology checks. |
There was a problem hiding this comment.
I've noticed quite a few comments are making the assumption that topology will only be zones. Since we support arbitrary CSI drivers, we shouldn't make that assumption. I know at least one counter-example used by Karpenter today: storage classes for EKS Auto Mode's EBS CSI Driver should be configured with a eks.amazonaws.com/compute-type: auto selector in allowed topologies.
Fixes #2742
Description
Fixes volume topology scheduling to correctly handle multiple topology terms in PersistentVolumes and StorageClasses.
VolumeTopology.getRequirements()to returnNodeSelectorTermslices instead of flatNodeSelectorRequirementslices to preserve OR semantics across multiple topology termsgetStorageClassRequirements()to process all allowed topologies, not just the first onegetPersistentVolumeRequirements()to process all node selector terms, not just the first oneInject()methodHow was this change tested?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.