You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The rollout controller only counts deleting bindings as "can be unavailable" if they are currently ready. Deleting bindings that are not ready are invisible to the availability math, causing incorrect maxUnavailable enforcement.
Problem
From pkg/controllers/rollout/controller.go, in the pickBindingsToRoll loop for both Unscheduled (line 422) and Bound (line 474) states:
} elseifbindingReady {
// it is being deleted, it can be removed from the cluster at any time, so it can be unavailable at any timecanBeUnavailableBindings=append(canBeUnavailableBindings, binding)
}
A deleting binding that is not ready (e.g., member agent failed cleanup, apply was in progress) is excluded from canBeUnavailableBindings. Since it is also not in readyBindings, it is invisible to calculateMaxToRemove:
⚠️This fix should be coordinated with #598 (Track cleanup of stale resources when clusters are de-selected).
If deleting bindings are counted as unavailable but a binding gets stuck deleting indefinitely (member agent down, network partition), the stuck binding permanently consumes maxUnavailable budget and blocks all rollout progress — a deadlock worse than the current behavior.
To safely fix this, one of the following is needed:
Fix both together — count deleting bindings as unavailable with a TTL so stuck deletions stop blocking after a timeout
Implement a bounded approach — count deleting bindings against the budget only up to a time limit
Proposed Solution
Count all deleting bindings as canBeUnavailable regardless of ready state, but with a TTL to prevent stuck deletions from permanently blocking rollout:
if!binding.GetDeletionTimestamp().IsZero() {
iftime.Since(binding.GetDeletionTimestamp().Time) <deletionTimeout {
canBeUnavailableBindings=append(canBeUnavailableBindings, binding)
}
// If past timeout, don't count — treat as gone
}
Acceptance Criteria
Deleting bindings are counted as unavailable regardless of ready state
Stuck deletions do not permanently block rollout progress
maxUnavailable is correctly enforced during rolling updates with concurrent deletions
Integration test covering deleting-but-not-ready bindings in availability calculation
Summary
The rollout controller only counts deleting bindings as "can be unavailable" if they are currently ready. Deleting bindings that are not ready are invisible to the availability math, causing incorrect
maxUnavailableenforcement.Problem
From
pkg/controllers/rollout/controller.go, in thepickBindingsToRollloop for both Unscheduled (line 422) and Bound (line 474) states:A deleting binding that is not ready (e.g., member agent failed cleanup, apply was in progress) is excluded from
canBeUnavailableBindings. Since it is also not inreadyBindings, it is invisible tocalculateMaxToRemove:This causes the math to be wrong in both directions depending on the scenario:
lowerBoundAvailableis artificially low → blocks legitimate removalsmaxUnavailablepermitsExample
5 bindings,
maxUnavailable: 1, target 5, 2 bindings deleting but not ready:readyBindings = 3,canBeUnavailableBindings = 0(deleting-not-ready bindings uncounted)lowerBoundAvailable = 3 - 0 = 3maxNumberToRemove = 3 - 4 = -1→ all removals blocked, even though the budget should allow someDependency on #598
If deleting bindings are counted as unavailable but a binding gets stuck deleting indefinitely (member agent down, network partition), the stuck binding permanently consumes
maxUnavailablebudget and blocks all rollout progress — a deadlock worse than the current behavior.To safely fix this, one of the following is needed:
Proposed Solution
Count all deleting bindings as
canBeUnavailableregardless of ready state, but with a TTL to prevent stuck deletions from permanently blocking rollout:Acceptance Criteria
maxUnavailableis correctly enforced during rolling updates with concurrent deletionsRelated