Skip to content

[Feature Request] Support configurable batch/percentage rollout strategy (maxUnavailable) for large-scale clusters #470

@sangheee

Description

@sangheee

We are currently operating a large-scale Milvus cluster with over 100 QueryNodes and more than 100 million documents.
While the operator works reliably, we are facing a significant operational bottleneck during rolling updates (configuration changes or image updates).

As identified in the source code

if useRollingUpdate {
return appsv1.DeploymentStrategy{
Type: appsv1.RollingUpdateDeploymentStrategyType,
RollingUpdate: &appsv1.RollingUpdateDeployment{
MaxUnavailable: &intstr.IntOrString{Type: intstr.Int, IntVal: 0},
MaxSurge: &intstr.IntOrString{Type: intstr.Int, IntVal: 1},
},
}
}

// planScaleForRollout, if not hpa ,return nil
func (c *DeployControllerBizUtilImpl) planScaleForRollout(mc v1beta1.Milvus, currentDeployment, lastDeployment *appsv1.Deployment) scaleAction {
currentDeployReplicas := getDeployReplicas(currentDeployment)
lastDeployReplicas := getDeployReplicas(lastDeployment)
currentReplicas := currentDeployReplicas + lastDeployReplicas
expectedReplicas := int(ReplicasValue(c.component.GetReplicas(mc.Spec)))
if compareDeployResourceLimitEqual(currentDeployment, lastDeployment) {
switch {
case currentReplicas > expectedReplicas:
if lastDeployReplicas > 0 {
// continue rollout by scale in last deployment
return scaleAction{deploy: lastDeployment, replicaChange: -1}
}
// scale in is not allowed during a rollout
return noScaleAction
case currentReplicas == expectedReplicas:
if lastDeployReplicas == 0 {
// stable state
return noScaleAction
}
// continue rollout by scale out last deployment
return scaleAction{deploy: currentDeployment, replicaChange: 1}
default:
// case currentReplicas < expectedReplicas
// scale out
return scaleAction{deploy: currentDeployment, replicaChange: expectedReplicas - currentReplicas}
}
} else {
// Resource is changed.
// If the lastDeployReplicas have not been scaled down to 0, we need to first scale up the currentDeployReplicas to the maximum value among the expectedReplicas or the lastDeployReplicas.
// This ensures that during the subsequent scaling down process, pods will not experience out-of-memory (OOM) issues due to load balancing.
// We only begin scaling down the lastDeployReplicas when the currentDeployReplicas is no less than lastDeployReplicas.
// When the lastDeployReplicas reach 0, we need to ensure that the currentDeployReplicas are at their expected value.
if lastDeployReplicas > 0 {
if currentDeployReplicas < lastDeployReplicas || currentDeployReplicas < expectedReplicas {
// scale current deploy replica to max of lastDeployReplicas or expectedReplicas
if lastDeployReplicas < expectedReplicas {
return scaleAction{deploy: currentDeployment, replicaChange: expectedReplicas - currentDeployReplicas}
}
return scaleAction{deploy: currentDeployment, replicaChange: lastDeployReplicas - currentDeployReplicas}
}
// continue rollout by scale in last deployment
return scaleAction{deploy: lastDeployment, replicaChange: -1}
}
if currentDeployReplicas > expectedReplicas {
// scale current deploy replica to expected
return scaleAction{deploy: currentDeployment, replicaChange: -1}
} else if currentDeployReplicas < expectedReplicas {
// scale current deploy replica to expected
// This branch seems unlikely to occur.
return scaleAction{deploy: currentDeployment, replicaChange: expectedReplicas - currentDeployReplicas}
}
return noScaleAction
}
}

The planScaleForRollout logic adjusts replicas one by one (Old deploy -1, Current deploy +1).
Furthermore, the reconcile loop is hardcoded to requeue at unhealthySyncInterval/2 (unhealthySyncInterval is 30s) while a rollingUpdate is in progress.

For a cluster with just 35 pods, the theoretical minimum time for a rollout is 35 * 15 * 2= 1050s.
In our production environment with 100+ nodes and large data volumes, the recovery time for each QueryNode significantly extends this window, making deployments take several hours.

Describe the solution you'd like
I am aware of the discussion in #459 regarding the removal of twoDeployMode.
However, since that transition may take time and might require Kubernetes 1.34+ for certain native features, I would like to propose an interim enhancement for the current twoDeployMode.

I suggest allowing users to configure a rolloutStrategy within the Milvus CRD:

  components:
    rollingMode: 2
    queryNode:
      rolloutStrategy: 
        maxUnavailable: "20%" # Should support both percentage (e.g., "20%") and integers (e.g., 5)
  • The DeploymentStrategy.RollingUpdate.MaxUnavailable is set using this maxUnavailable value.

  • The planScaleForRollout logic could use this maxUnavailable value to calculate a larger replicaChange step, allowing multiple pods to be updated in a single reconcile cycle.

  • Making the requeue/sync intervals (like unhealthySyncInterval) configurable by the user would provide much-needed flexibility for different cluster scales.

I would love to hear your thoughts on this proposal. If the maintainers agree with this direction, I am more than happy to implement this feature and submit a Pull Request.
Thanks! 😊

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions