Skip to content

Backfill Workflow

Andre Kurait edited this page Mar 16, 2026 · 14 revisions

Backfill Workflow

This page describes how to perform document backfill using the Workflow CLI. The backfill workflow migrates documents from your source cluster to your target cluster using snapshot-based reindexing (RFS).

The backfill mental model

A backfill migration follows this sequence:

  1. Snapshot — Create a point-in-time snapshot of source indexes
  2. Register — Make the snapshot accessible to the migration tooling
  3. Metadata — Transfer index mappings, settings, and templates to the target
  4. RFS Load — Reindex documents from the snapshot to the target cluster
  5. Cleanup — Remove temporary coordination state

Each phase completes before the next begins. Approval gates between phases let you verify progress before continuing.

Configuration categories

Run workflow configure sample to see all available options for your version. The configuration covers these categories:

Source and target clusters

Define endpoints, versions, and authentication for each cluster.

Snapshot repositories

Configure where snapshots are stored (S3 bucket, path, region, IAM role for AWS managed sources).

Index allowlists

Control which indexes are included at each phase:

  • Snapshot creation allowlist
  • Metadata migration allowlist
  • Document backfill allowlist

Resource allocation

Tune parallelism and resource limits for the migration pods.

Index allowlist syntax

Allowlist entries are matched as exact literal strings by default. Use the regex: prefix for pattern matching:

Entry Matches
my-index Only "my-index" (exact match)
* Only an index literally named "*" (not a wildcard)
regex:.* All indexes (regex wildcard)
regex:logs-.* "logs-app", "logs-web", etc.
regex:logs-.*-2024 "logs-app-2024", "logs-web-2024", etc.

Common mistake: Using * expecting it to match all indexes. Use regex:.* instead.

Using existing snapshots

If you already have a snapshot, reference it instead of creating a new one by setting externallyManagedSnapshot in your snapshot configuration. See workflow configure sample for the exact field path.

Verification

After the workflow completes, verify the migration:

Check document counts

Use the console's authenticated curl wrapper:

# Source cluster
console clusters curl source -- "/_cat/indices?v"

# Target cluster
console clusters curl target -- "/_cat/indices?v"

Compare specific indexes

console clusters curl target -- "/<index>/_count"
console clusters curl target -- "/<index>/_settings"

Test queries

Run representative queries against the target to verify data integrity.

Error recovery

When a workflow fails

  1. Check status to identify the failed step:

    workflow status
  2. View logs for the failed step:

    workflow output
  3. Fix the underlying issue (configuration, permissions, cluster health, etc.)

  4. Resubmit:

    workflow submit

RFS checkpoints

RFS tracks progress at the shard level. If a backfill fails partway through:

  • Completed shards are recorded in the coordination index
  • Resubmitting resumes from the last checkpoint
  • Already-migrated documents are not re-processed

This means you don't lose progress on large migrations if a failure occurs.

Common failure causes

Symptom Likely cause Resolution
Snapshot creation fails S3 permissions, missing IAM role Check s3RoleArn for AWS managed sources
Metadata migration fails Version incompatibility Review Migration Paths for supported versions
RFS stalls Target cluster overloaded Reduce parallelism, check cluster health
Authentication errors Invalid credentials Verify Kubernetes secrets exist and contain correct values

Parallelism and resource tuning

Parallelism

RFS runs multiple workers in parallel, each reading shard data directly from the snapshot in S3. Because workers read from object storage — not the source cluster — scaling up workers has zero impact on the source cluster. The only constraint on parallelism is the target cluster's indexing capacity and available Kubernetes resources.

Consider:

  • Target cluster indexing capacity (the usual bottleneck)
  • Available Kubernetes node resources (CPU, memory for workers)
  • S3 read throughput (rarely a bottleneck)

Start with defaults and increase if the target cluster has headroom.

Resource limits

Workflow pods use default resource requests and limits. For large migrations, you may need to adjust:

  • CPU and memory for RFS workers
  • Storage for temporary data
  • Pod count limits in Argo Workflows

Migration duration factors

Factor Impact
Total data volume Primary factor
Number of shards Determines maximum parallelism (1 worker per shard)
Document size Larger documents = slower indexing
Target cluster capacity Indexing throughput is usually the bottleneck
S3 read throughput Rarely a bottleneck; scales with worker count
Network bandwidth Data transfer speed

Monitoring during backfill

# Interactive TUI — view progress, approve steps, tail logs
workflow manage

# Check workflow status
workflow status

# Stream logs (use tab-completion on -l to discover available label filters)
workflow output --follow

Next steps

Clone this wiki locally