Is your feature request related to a problem? Please describe
Right now, OpenSearch has APIs that gives users some control over shard allocation and traffic based on node attributes, but it's all pretty low-level. With some combination of shard allocation filters and weighted routing, you can almost drain a node before taking it down for deployment. Then, if you're lucky, maybe after you've deployed to some nodes, you can bring them back up without routing traffic to them. I'm not really sure if you can send warming traffic to those nodes if you've dialed them down with weighted routing (since AFAIK, there's no "operator bypass").
Here are the challenges that someone currently faces when deploying to an OpenSearch cluster:
- Slow deployment due to node-at-a-time rollout, if we're trying to minimize shard movement.
- In-flight requests may fail as nodes go down. (Probably won't be customer visible, since the coordinator will retry on another replica.)
- The cluster flips yellow, which creates noise and potentially masks unexpected node failures.
- No validation of nodes before they begin to receive customer traffic.
- High latency on initial requests to newly-deployed nodes.
- High latency on nodes replicating shards during rebalance.
- Primary shards can get allocated to upgraded nodes before we’re sure indexing logic has been validated on the upgraded nodes. Once primary shards run on upgraded nodes, we cannot rollback (assuming we’re upgrading to a new OpenSearch version).
Operators have a bunch of knobs they can turn to try to minimize these problems when deploying, but what if we could just tell OpenSearch "I want to deploy to these nodes" and let it do the right thing?
Describe the solution you'd like
I would like to add "deployment APIs" to OpenSearch that walk the cluster through different phases in the deployment lifecycle. I think of this as a state machine, but others may have a different mental model. The following is the rough flow that I imagine:
To be clear, I do not want to turn OpenSearch into a deployment system. Process management is not OpenSearch's job (though it is arguably the responsibility of https://github.com/opensearch-project/opensearch-k8s-operator). Instead, what I'm proposing is adding APIs that a deployment system (like the k8s-operator) can call to make deployments graceful. I'm assuming that the deployment system is integrated with some sort of monitoring system. (I've now worked for two big tech companies and the "customizable deployment system that is integrated with the monitoring system" setup has been common to both.)
Walking through it:
- In steady state, we're just running as usual. There's no special behavior for shard allocation and operation routing.
- When we start a deployment, the deployment system will first
DRAIN the nodes that will be the target of the deployment by issuing an API call. This can be based on node attribute filters. This will stop future search requests from being routed to shards on the nodes and will demote any primary shards on the nodes.
- The external deployment system will take down the nodes and deploy to them, or replace the nodes with new ones that match the node attribute filters. When the nodes come up, they are still in the
DRAIN state and do not receive traffic right away. They are eligible to hold replica shards, but not primary shards.
- Once the deployment system sees that any shard recovery has stopped, it transitions the nodes to
SHADOW mode by issuing another API call. At this point, coordinators will select the shards on the nodes with normal weight, but will also route traffic to other shard copies. That is, if there are 3 copies of a shard in total, with one on a SHADOW node, then the shadow copy will still be picked 1/3 of the time, but we'll also send the request to one of the other copies (with a 50-50 weight between them). We return the result from the "real" node's shard copy, but we can monitor errors/latency from the SHADOW node.
5.After some time in SHADOW mode (assuming no errors and latency seems fine), the deployment system can start to DIALUP the deployed nodes. This assigns a non-zero probability of serving real customer traffic from the deployed nodes. The deployment system gradually increases to 100%, which is almost indistinguishable from the steady state. The exception is that we still don't move primary shards to the deployed nodes.
- After some amount of bake time at 100% dialup we can feel confident that the newly deployed nodes handle search traffic properly and the replicas handle indexing logic correctly (unless we're doing segment replication, in which case we haven't tested the indexing logic). Then we can finish the deployment and transition back to steady state (before moving onto the next batch of nodes in the cluster -- for the next batch(es), we can significantly reduce the bake time).
- If we discover a problem at any point in the
DRAIN/SHADOW/DIALUP flow, the deployment system can transition the system to ROLLBACK, which is functionally the same as DRAIN. To avoid serving traffic from cold shards after rolling back, we can still warm up by moving to ROLLBACK_SHADOW (which is functionally equivalent to SHADOW).
Additionally, I would like to propose the following behaviors during deployment:
- The cluster minimizes shard movement related to the deployed nodes. Since we can assume that the nodes under deployment will be back soon, we can avoid moving shards off of them. This will not be true if the deployment works via node replacement.
- If shards are unallocated because they are supposed to be on nodes that are currently deploying, we should still consider the cluster to be green. If another node goes down (i.e. not a deploying node), resulting in unallocated shards, then we should go yellow or red.
Related component
ShardManagement:Placement
Describe alternatives you've considered
We can try to fix the gaps in the current "low-level" shard placement and traffic routing APIs. Instead of a single "deployment" API, this could be a series of cluster and index settings that let us control where primary shards can be allocated and control search traffic to nodes. The shadowing thing could be added as a standalone feature.
Additional context
While I've focused on "rolling" style deployments, where we take down a subset of nodes, update them, and bring them back up, I believe this same approach offers value for folks who add new nodes to the cluster, move shards from old nodes to new nodes, then delete the old nodes. You can avoid the "cold start" problem on the new nodes.
The trick is that you "pre-DRAIN" the new nodes before you bring them up. With the proposed API, that should all work. Instead of moving shards, you should ideally increase the replica count so that shards "expand" onto the new nodes. Then you can SHADOW to warm up + validate the new nodes before they see any customer traffic. Finally, we'd need to tweak the DIALUP semantics to distribute customer traffic across the old and new nodes in a way that keeps them similarly "hot".
Essentially, you should be able to do this kind of deployment without customers even knowing that a deployment has happened -- the latency graph should look smooth.
Is your feature request related to a problem? Please describe
Right now, OpenSearch has APIs that gives users some control over shard allocation and traffic based on node attributes, but it's all pretty low-level. With some combination of shard allocation filters and weighted routing, you can almost drain a node before taking it down for deployment. Then, if you're lucky, maybe after you've deployed to some nodes, you can bring them back up without routing traffic to them. I'm not really sure if you can send warming traffic to those nodes if you've dialed them down with weighted routing (since AFAIK, there's no "operator bypass").
Here are the challenges that someone currently faces when deploying to an OpenSearch cluster:
Operators have a bunch of knobs they can turn to try to minimize these problems when deploying, but what if we could just tell OpenSearch "I want to deploy to these nodes" and let it do the right thing?
Describe the solution you'd like
I would like to add "deployment APIs" to OpenSearch that walk the cluster through different phases in the deployment lifecycle. I think of this as a state machine, but others may have a different mental model. The following is the rough flow that I imagine:
To be clear, I do not want to turn OpenSearch into a deployment system. Process management is not OpenSearch's job (though it is arguably the responsibility of https://github.com/opensearch-project/opensearch-k8s-operator). Instead, what I'm proposing is adding APIs that a deployment system (like the k8s-operator) can call to make deployments graceful. I'm assuming that the deployment system is integrated with some sort of monitoring system. (I've now worked for two big tech companies and the "customizable deployment system that is integrated with the monitoring system" setup has been common to both.)
Walking through it:
DRAINthe nodes that will be the target of the deployment by issuing an API call. This can be based on node attribute filters. This will stop future search requests from being routed to shards on the nodes and will demote any primary shards on the nodes.DRAINstate and do not receive traffic right away. They are eligible to hold replica shards, but not primary shards.SHADOWmode by issuing another API call. At this point, coordinators will select the shards on the nodes with normal weight, but will also route traffic to other shard copies. That is, if there are 3 copies of a shard in total, with one on a SHADOW node, then the shadow copy will still be picked 1/3 of the time, but we'll also send the request to one of the other copies (with a 50-50 weight between them). We return the result from the "real" node's shard copy, but we can monitor errors/latency from the SHADOW node.5.After some time in
SHADOWmode (assuming no errors and latency seems fine), the deployment system can start toDIALUPthe deployed nodes. This assigns a non-zero probability of serving real customer traffic from the deployed nodes. The deployment system gradually increases to 100%, which is almost indistinguishable from the steady state. The exception is that we still don't move primary shards to the deployed nodes.DRAIN/SHADOW/DIALUPflow, the deployment system can transition the system toROLLBACK, which is functionally the same asDRAIN. To avoid serving traffic from cold shards after rolling back, we can still warm up by moving toROLLBACK_SHADOW(which is functionally equivalent toSHADOW).Additionally, I would like to propose the following behaviors during deployment:
Related component
ShardManagement:Placement
Describe alternatives you've considered
We can try to fix the gaps in the current "low-level" shard placement and traffic routing APIs. Instead of a single "deployment" API, this could be a series of cluster and index settings that let us control where primary shards can be allocated and control search traffic to nodes. The shadowing thing could be added as a standalone feature.
Additional context
While I've focused on "rolling" style deployments, where we take down a subset of nodes, update them, and bring them back up, I believe this same approach offers value for folks who add new nodes to the cluster, move shards from old nodes to new nodes, then delete the old nodes. You can avoid the "cold start" problem on the new nodes.
The trick is that you "pre-
DRAIN" the new nodes before you bring them up. With the proposed API, that should all work. Instead of moving shards, you should ideally increase the replica count so that shards "expand" onto the new nodes. Then you canSHADOWto warm up + validate the new nodes before they see any customer traffic. Finally, we'd need to tweak theDIALUPsemantics to distribute customer traffic across the old and new nodes in a way that keeps them similarly "hot".Essentially, you should be able to do this kind of deployment without customers even knowing that a deployment has happened -- the latency graph should look smooth.