Skip to content

Gradual Rollover and Rollback Support #2518

@sumobrian

Description

@sumobrian

Gradual Rollover and Rollback Support

Overview

Gradual Rollover enables users to incrementally shift production traffic from their source cluster to OpenSearch — starting at 1%, gradually increasing to 100% — with continuous health monitoring and instant rollback capability at every stage.

The Problem

Today, migrations require a "big bang" cutover — at some point, all traffic must switch from source to target simultaneously. For mission-critical systems (e-commerce search, financial analytics, security monitoring), this creates unacceptable risk:

  • No gradual validation: You cannot verify target behavior with real production traffic at small scale before committing
  • No instant rollback: If issues are discovered post-cutover, reverting requires another disruptive migration
  • All-or-nothing risk: A single issue with the target cluster can cause complete service degradation

What This Delivers

Traffic Splitting Engine

  • Gradual routing: Programmable traffic distribution from source to target (1% to 5% to 10% to 25% to 50% to 100%)
  • Per-route control: Different traffic percentages for different index patterns or query types
  • Session stickiness: Ensure individual users see consistent behavior during rollover

Dual-Write Coordination

  • Synchronized writes: Keep source and target clusters in sync during the rollover period
  • Conflict resolution: Handle write conflicts when both clusters are accepting writes
  • Write-path validation: Verify that writes to the target produce identical results

Health Monitoring and Automatic Pause

  • Continuous health checks: Monitor latency, error rates, and result quality on both clusters
  • Automatic pause: If target health degrades beyond configurable thresholds, automatically halt traffic increase
  • Alerting integration: Configurable alarms and notifications for rollover milestones and issues

Automated Rollback

  • Instant revert: One-command (or automatic) rollback to 100% source traffic if issues are detected
  • State preservation: Rollback preserves all data and state, enabling retry after issue resolution
  • Partial rollback: Option to reduce target traffic percentage rather than full revert

Rollout Orchestration

  • Multi-stage deployment management: Define rollout stages with gates between them
  • Time-based progression: Automatically advance to next stage after configurable soak period with healthy metrics
  • Manual override: Operators can pause, advance, or rollback at any stage

Observability Dashboard

  • Real-time traffic distribution: Visual display of current traffic split between source and target
  • Comparative metrics: Side-by-side latency, throughput, and error rate comparison
  • Rollover history: Timeline of all rollover actions, health events, and decisions

Value

  • Zero-downtime migration: Users can migrate mission-critical systems without any service interruption, opening migration opportunities that were previously impossible
  • Risk elimination: Gradual rollover with instant rollback means migrations are reversible at every stage. There is no "point of no return," fundamentally changing the risk calculus for users
  • Faster decision-making: Users who have been evaluating migration for months or years can finally proceed because the risk profile is now acceptable
  • Applies to all sources: Gradual rollover works for Elasticsearch, Solr, and OpenSearch source migrations — it is a universal capability that increases the value of every other migration feature

Target Timeline

  • Q2 2026: Architecture design and traffic splitting engine (April-June)
  • Q3 2026: Dual-write coordination and health monitoring (July-September)
  • Q4 2026: Automated rollback, orchestration, and observability dashboard (October-December)

Dependencies

  • Migration Assistant 3.0 GA
  • Validation at Scale (for health monitoring during rollover)
  • Replayer and Capture infrastructure

Jira Epics

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Status

    Q4'2026

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions