Gradual Rollover and Rollback Support
Overview
Gradual Rollover enables users to incrementally shift production traffic from their source cluster to OpenSearch — starting at 1%, gradually increasing to 100% — with continuous health monitoring and instant rollback capability at every stage.
The Problem
Today, migrations require a "big bang" cutover — at some point, all traffic must switch from source to target simultaneously. For mission-critical systems (e-commerce search, financial analytics, security monitoring), this creates unacceptable risk:
- No gradual validation: You cannot verify target behavior with real production traffic at small scale before committing
- No instant rollback: If issues are discovered post-cutover, reverting requires another disruptive migration
- All-or-nothing risk: A single issue with the target cluster can cause complete service degradation
What This Delivers
Traffic Splitting Engine
- Gradual routing: Programmable traffic distribution from source to target (1% to 5% to 10% to 25% to 50% to 100%)
- Per-route control: Different traffic percentages for different index patterns or query types
- Session stickiness: Ensure individual users see consistent behavior during rollover
Dual-Write Coordination
- Synchronized writes: Keep source and target clusters in sync during the rollover period
- Conflict resolution: Handle write conflicts when both clusters are accepting writes
- Write-path validation: Verify that writes to the target produce identical results
Health Monitoring and Automatic Pause
- Continuous health checks: Monitor latency, error rates, and result quality on both clusters
- Automatic pause: If target health degrades beyond configurable thresholds, automatically halt traffic increase
- Alerting integration: Configurable alarms and notifications for rollover milestones and issues
Automated Rollback
- Instant revert: One-command (or automatic) rollback to 100% source traffic if issues are detected
- State preservation: Rollback preserves all data and state, enabling retry after issue resolution
- Partial rollback: Option to reduce target traffic percentage rather than full revert
Rollout Orchestration
- Multi-stage deployment management: Define rollout stages with gates between them
- Time-based progression: Automatically advance to next stage after configurable soak period with healthy metrics
- Manual override: Operators can pause, advance, or rollback at any stage
Observability Dashboard
- Real-time traffic distribution: Visual display of current traffic split between source and target
- Comparative metrics: Side-by-side latency, throughput, and error rate comparison
- Rollover history: Timeline of all rollover actions, health events, and decisions
Value
- Zero-downtime migration: Users can migrate mission-critical systems without any service interruption, opening migration opportunities that were previously impossible
- Risk elimination: Gradual rollover with instant rollback means migrations are reversible at every stage. There is no "point of no return," fundamentally changing the risk calculus for users
- Faster decision-making: Users who have been evaluating migration for months or years can finally proceed because the risk profile is now acceptable
- Applies to all sources: Gradual rollover works for Elasticsearch, Solr, and OpenSearch source migrations — it is a universal capability that increases the value of every other migration feature
Target Timeline
- Q2 2026: Architecture design and traffic splitting engine (April-June)
- Q3 2026: Dual-write coordination and health monitoring (July-September)
- Q4 2026: Automated rollback, orchestration, and observability dashboard (October-December)
Dependencies
- Migration Assistant 3.0 GA
- Validation at Scale (for health monitoring during rollover)
- Replayer and Capture infrastructure
Jira Epics
Related Issues
Gradual Rollover and Rollback Support
Overview
Gradual Rollover enables users to incrementally shift production traffic from their source cluster to OpenSearch — starting at 1%, gradually increasing to 100% — with continuous health monitoring and instant rollback capability at every stage.
The Problem
Today, migrations require a "big bang" cutover — at some point, all traffic must switch from source to target simultaneously. For mission-critical systems (e-commerce search, financial analytics, security monitoring), this creates unacceptable risk:
What This Delivers
Traffic Splitting Engine
Dual-Write Coordination
Health Monitoring and Automatic Pause
Automated Rollback
Rollout Orchestration
Observability Dashboard
Value
Target Timeline
Dependencies
Jira Epics
Related Issues