Gradual Rollover and Rollback Support

# Gradual Rollover and Rollback Support

## Overview

Gradual Rollover enables users to incrementally shift production traffic from their source cluster to OpenSearch — starting at 1%, gradually increasing to 100% — with continuous health monitoring and instant rollback capability at every stage.

## The Problem

Today, migrations require a "big bang" cutover — at some point, all traffic must switch from source to target simultaneously. For mission-critical systems (e-commerce search, financial analytics, security monitoring), this creates unacceptable risk:

- **No gradual validation**: You cannot verify target behavior with real production traffic at small scale before committing
- **No instant rollback**: If issues are discovered post-cutover, reverting requires another disruptive migration
- **All-or-nothing risk**: A single issue with the target cluster can cause complete service degradation

## What This Delivers

### Traffic Splitting Engine
- **Gradual routing**: Programmable traffic distribution from source to target (1% to 5% to 10% to 25% to 50% to 100%)
- **Per-route control**: Different traffic percentages for different index patterns or query types
- **Session stickiness**: Ensure individual users see consistent behavior during rollover

### Dual-Write Coordination
- **Synchronized writes**: Keep source and target clusters in sync during the rollover period
- **Conflict resolution**: Handle write conflicts when both clusters are accepting writes
- **Write-path validation**: Verify that writes to the target produce identical results

### Health Monitoring and Automatic Pause
- **Continuous health checks**: Monitor latency, error rates, and result quality on both clusters
- **Automatic pause**: If target health degrades beyond configurable thresholds, automatically halt traffic increase
- **Alerting integration**: Configurable alarms and notifications for rollover milestones and issues

### Automated Rollback
- **Instant revert**: One-command (or automatic) rollback to 100% source traffic if issues are detected
- **State preservation**: Rollback preserves all data and state, enabling retry after issue resolution
- **Partial rollback**: Option to reduce target traffic percentage rather than full revert

### Rollout Orchestration
- **Multi-stage deployment management**: Define rollout stages with gates between them
- **Time-based progression**: Automatically advance to next stage after configurable soak period with healthy metrics
- **Manual override**: Operators can pause, advance, or rollback at any stage

### Observability Dashboard
- **Real-time traffic distribution**: Visual display of current traffic split between source and target
- **Comparative metrics**: Side-by-side latency, throughput, and error rate comparison
- **Rollover history**: Timeline of all rollover actions, health events, and decisions

## Value

- **Zero-downtime migration**: Users can migrate mission-critical systems without any service interruption, opening migration opportunities that were previously impossible
- **Risk elimination**: Gradual rollover with instant rollback means migrations are reversible at every stage. There is no "point of no return," fundamentally changing the risk calculus for users
- **Faster decision-making**: Users who have been evaluating migration for months or years can finally proceed because the risk profile is now acceptable
- **Applies to all sources**: Gradual rollover works for Elasticsearch, Solr, and OpenSearch source migrations — it is a universal capability that increases the value of every other migration feature

## Target Timeline
- **Q2 2026**: Architecture design and traffic splitting engine (April-June)
- **Q3 2026**: Dual-write coordination and health monitoring (July-September)
- **Q4 2026**: Automated rollback, orchestration, and observability dashboard (October-December)


## Dependencies
- Migration Assistant 3.0 GA
- Validation at Scale (for health monitoring during rollover)
- Replayer and Capture infrastructure

## Jira Epics
- [MIGRATIONS-2875](https://opensearch.atlassian.net/browse/MIGRATIONS-2875): Gradual Switchover (Part 1)
- [MIGRATIONS-2877](https://opensearch.atlassian.net/browse/MIGRATIONS-2877): Gradual Switchover (Part 2)
- [MIGRATIONS-2914](https://opensearch.atlassian.net/browse/MIGRATIONS-2914): Gradual Switchover (Part 3)

## Related Issues
- #1092 - In-Flight Behavioral and Performance Validation
- #1072 - Auto-scaling Up and Down for Live Capture and Backfill
- #2444 - [RFC] OpenSearch Migration Companion: A Fully Autonomous Migration Experience


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradual Rollover and Rollback Support #2518

Gradual Rollover and Rollback Support

Overview

The Problem

What This Delivers

Traffic Splitting Engine

Dual-Write Coordination

Health Monitoring and Automatic Pause

Automated Rollback

Rollout Orchestration

Observability Dashboard

Value

Target Timeline

Dependencies

Jira Epics

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gradual Rollover and Rollback Support #2518

Description

Gradual Rollover and Rollback Support

Overview

The Problem

What This Delivers

Traffic Splitting Engine

Dual-Write Coordination

Health Monitoring and Automatic Pause

Automated Rollback

Rollout Orchestration

Observability Dashboard

Value

Target Timeline

Dependencies

Jira Epics

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions