We are currently operating a large-scale Milvus cluster with over 100 QueryNodes and more than 100 million documents.
While the operator works reliably, we are facing a significant operational bottleneck during rolling updates (configuration changes or image updates).
As identified in the source code
|
if useRollingUpdate { |
|
return appsv1.DeploymentStrategy{ |
|
Type: appsv1.RollingUpdateDeploymentStrategyType, |
|
RollingUpdate: &appsv1.RollingUpdateDeployment{ |
|
MaxUnavailable: &intstr.IntOrString{Type: intstr.Int, IntVal: 0}, |
|
MaxSurge: &intstr.IntOrString{Type: intstr.Int, IntVal: 1}, |
|
}, |
|
} |
|
} |
|
// planScaleForRollout, if not hpa ,return nil |
|
func (c *DeployControllerBizUtilImpl) planScaleForRollout(mc v1beta1.Milvus, currentDeployment, lastDeployment *appsv1.Deployment) scaleAction { |
|
currentDeployReplicas := getDeployReplicas(currentDeployment) |
|
lastDeployReplicas := getDeployReplicas(lastDeployment) |
|
|
|
currentReplicas := currentDeployReplicas + lastDeployReplicas |
|
expectedReplicas := int(ReplicasValue(c.component.GetReplicas(mc.Spec))) |
|
if compareDeployResourceLimitEqual(currentDeployment, lastDeployment) { |
|
switch { |
|
case currentReplicas > expectedReplicas: |
|
if lastDeployReplicas > 0 { |
|
// continue rollout by scale in last deployment |
|
return scaleAction{deploy: lastDeployment, replicaChange: -1} |
|
} |
|
// scale in is not allowed during a rollout |
|
return noScaleAction |
|
case currentReplicas == expectedReplicas: |
|
if lastDeployReplicas == 0 { |
|
// stable state |
|
return noScaleAction |
|
} |
|
// continue rollout by scale out last deployment |
|
return scaleAction{deploy: currentDeployment, replicaChange: 1} |
|
default: |
|
// case currentReplicas < expectedReplicas |
|
// scale out |
|
return scaleAction{deploy: currentDeployment, replicaChange: expectedReplicas - currentReplicas} |
|
} |
|
} else { |
|
// Resource is changed. |
|
// If the lastDeployReplicas have not been scaled down to 0, we need to first scale up the currentDeployReplicas to the maximum value among the expectedReplicas or the lastDeployReplicas. |
|
// This ensures that during the subsequent scaling down process, pods will not experience out-of-memory (OOM) issues due to load balancing. |
|
// We only begin scaling down the lastDeployReplicas when the currentDeployReplicas is no less than lastDeployReplicas. |
|
// When the lastDeployReplicas reach 0, we need to ensure that the currentDeployReplicas are at their expected value. |
|
if lastDeployReplicas > 0 { |
|
if currentDeployReplicas < lastDeployReplicas || currentDeployReplicas < expectedReplicas { |
|
// scale current deploy replica to max of lastDeployReplicas or expectedReplicas |
|
if lastDeployReplicas < expectedReplicas { |
|
return scaleAction{deploy: currentDeployment, replicaChange: expectedReplicas - currentDeployReplicas} |
|
} |
|
return scaleAction{deploy: currentDeployment, replicaChange: lastDeployReplicas - currentDeployReplicas} |
|
} |
|
// continue rollout by scale in last deployment |
|
return scaleAction{deploy: lastDeployment, replicaChange: -1} |
|
} |
|
if currentDeployReplicas > expectedReplicas { |
|
// scale current deploy replica to expected |
|
return scaleAction{deploy: currentDeployment, replicaChange: -1} |
|
} else if currentDeployReplicas < expectedReplicas { |
|
// scale current deploy replica to expected |
|
// This branch seems unlikely to occur. |
|
return scaleAction{deploy: currentDeployment, replicaChange: expectedReplicas - currentDeployReplicas} |
|
} |
|
return noScaleAction |
|
} |
|
} |
The planScaleForRollout logic adjusts replicas one by one (Old deploy -1, Current deploy +1).
Furthermore, the reconcile loop is hardcoded to requeue at unhealthySyncInterval/2 (unhealthySyncInterval is 30s) while a rollingUpdate is in progress.
For a cluster with just 35 pods, the theoretical minimum time for a rollout is 35 * 15 * 2= 1050s.
In our production environment with 100+ nodes and large data volumes, the recovery time for each QueryNode significantly extends this window, making deployments take several hours.
Describe the solution you'd like
I am aware of the discussion in #459 regarding the removal of twoDeployMode.
However, since that transition may take time and might require Kubernetes 1.34+ for certain native features, I would like to propose an interim enhancement for the current twoDeployMode.
I suggest allowing users to configure a rolloutStrategy within the Milvus CRD:
components:
rollingMode: 2
queryNode:
rolloutStrategy:
maxUnavailable: "20%" # Should support both percentage (e.g., "20%") and integers (e.g., 5)
-
The DeploymentStrategy.RollingUpdate.MaxUnavailable is set using this maxUnavailable value.
-
The planScaleForRollout logic could use this maxUnavailable value to calculate a larger replicaChange step, allowing multiple pods to be updated in a single reconcile cycle.
-
Making the requeue/sync intervals (like unhealthySyncInterval) configurable by the user would provide much-needed flexibility for different cluster scales.
I would love to hear your thoughts on this proposal. If the maintainers agree with this direction, I am more than happy to implement this feature and submit a Pull Request.
Thanks! 😊
We are currently operating a large-scale Milvus cluster with over 100 QueryNodes and more than 100 million documents.
While the operator works reliably, we are facing a significant operational bottleneck during rolling updates (configuration changes or image updates).
As identified in the source code
milvus-operator/pkg/controllers/components.go
Lines 561 to 569 in 9181cb9
milvus-operator/pkg/controllers/deploy_ctrl_util.go
Lines 438 to 493 in 9181cb9
The
planScaleForRolloutlogic adjusts replicas one by one (Old deploy -1, Current deploy +1).Furthermore, the reconcile loop is hardcoded to requeue at
unhealthySyncInterval/2(unhealthySyncIntervalis 30s) while a rollingUpdate is in progress.For a cluster with just 35 pods, the theoretical minimum time for a rollout is
35 * 15 * 2= 1050s.In our production environment with 100+ nodes and large data volumes, the recovery time for each QueryNode significantly extends this window, making deployments take several hours.
Describe the solution you'd like
I am aware of the discussion in #459 regarding the removal of
twoDeployMode.However, since that transition may take time and might require Kubernetes 1.34+ for certain native features, I would like to propose an interim enhancement for the current twoDeployMode.
I suggest allowing users to configure a
rolloutStrategywithin the Milvus CRD:The
DeploymentStrategy.RollingUpdate.MaxUnavailableis set using thismaxUnavailablevalue.The
planScaleForRolloutlogic could use thismaxUnavailablevalue to calculate a largerreplicaChangestep, allowing multiple pods to be updated in a single reconcile cycle.Making the requeue/sync intervals (like
unhealthySyncInterval) configurable by the user would provide much-needed flexibility for different cluster scales.I would love to hear your thoughts on this proposal. If the maintainers agree with this direction, I am more than happy to implement this feature and submit a Pull Request.
Thanks! 😊