Skip to content

[BUG] Deleting bindings not counted as unavailable in rollout availability calculation #602

@weng271190436

Description

Summary

The rollout controller only counts deleting bindings as "can be unavailable" if they are currently ready. Deleting bindings that are not ready are invisible to the availability math, causing incorrect maxUnavailable enforcement.

Problem

From pkg/controllers/rollout/controller.go, in the pickBindingsToRoll loop for both Unscheduled (line 422) and Bound (line 474) states:

} else if bindingReady {
    // it is being deleted, it can be removed from the cluster at any time, so it can be unavailable at any time
    canBeUnavailableBindings = append(canBeUnavailableBindings, binding)
}

A deleting binding that is not ready (e.g., member agent failed cleanup, apply was in progress) is excluded from canBeUnavailableBindings. Since it is also not in readyBindings, it is invisible to calculateMaxToRemove:

lowerBoundAvailable = len(readyBindings) - len(canBeUnavailableBindings)
maxNumberToRemove   = lowerBoundAvailable - minAvailableNumber

This causes the math to be wrong in both directions depending on the scenario:

  • Too conservative: lowerBoundAvailable is artificially low → blocks legitimate removals
  • Too aggressive: missing unavailable count → allows more removals than maxUnavailable permits

Example

5 bindings, maxUnavailable: 1, target 5, 2 bindings deleting but not ready:

  • readyBindings = 3, canBeUnavailableBindings = 0 (deleting-not-ready bindings uncounted)
  • lowerBoundAvailable = 3 - 0 = 3
  • maxNumberToRemove = 3 - 4 = -1all removals blocked, even though the budget should allow some

Dependency on #598

⚠️ This fix should be coordinated with #598 (Track cleanup of stale resources when clusters are de-selected).

If deleting bindings are counted as unavailable but a binding gets stuck deleting indefinitely (member agent down, network partition), the stuck binding permanently consumes maxUnavailable budget and blocks all rollout progress — a deadlock worse than the current behavior.

To safely fix this, one of the following is needed:

  1. Fix [BUG] Track cleanup of stale resources when clusters are de-selected #598 first — detect and handle stuck deletions (timeout/eviction)
  2. Fix both together — count deleting bindings as unavailable with a TTL so stuck deletions stop blocking after a timeout
  3. Implement a bounded approach — count deleting bindings against the budget only up to a time limit

Proposed Solution

Count all deleting bindings as canBeUnavailable regardless of ready state, but with a TTL to prevent stuck deletions from permanently blocking rollout:

if !binding.GetDeletionTimestamp().IsZero() {
    if time.Since(binding.GetDeletionTimestamp().Time) < deletionTimeout {
        canBeUnavailableBindings = append(canBeUnavailableBindings, binding)
    }
    // If past timeout, don't count — treat as gone
}

Acceptance Criteria

  • Deleting bindings are counted as unavailable regardless of ready state
  • Stuck deletions do not permanently block rollout progress
  • maxUnavailable is correctly enforced during rolling updates with concurrent deletions
  • Integration test covering deleting-but-not-ready bindings in availability calculation
  • Coordinated with [BUG] Track cleanup of stale resources when clusters are de-selected #598 for stuck deletion handling

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions