Skip to content

Add workflow pause, resume, and scale commands for K8s Deployments#2611

Draft
oeyh wants to merge 5 commits intoopensearch-project:mainfrom
oeyh:workflow-pause-resume-scale
Draft

Add workflow pause, resume, and scale commands for K8s Deployments#2611
oeyh wants to merge 5 commits intoopensearch-project:mainfrom
oeyh:workflow-pause-resume-scale

Conversation

@oeyh
Copy link
Copy Markdown
Collaborator

@oeyh oeyh commented Apr 2, 2026

Description

Add workflow pause, workflow resume, and workflow scale subcommands for managing long-running K8s Deployments (backfill and replay) created by Argo Workflows.

  • workflow pause [TASK_NAMES...] — stores current replica count as a K8s annotation (migrations.opensearch.org/pre-pause-replicas) on the Deployment, then scales to 0
  • workflow resume [TASK_NAMES...] — reads the pre-pause replica count from the annotation, restores it, and removes the annotation
  • workflow scale [TASK_NAMES...] --replicas N — sets replica count; scaling to 0 sets the annotation (like pause), scaling to >0 clears it (like resume). Rejects scaling beyond the max defined in the scaleable label.

Deployments are discovered via the label migrations.opensearch.org/scaleable. The label value encodes the max replica count: "0" for unlimited (backfill), or a positive integer like "1" (replayer). Presence of the label implies the Deployment is both pausable and scaleable. Task names support glob patterns (e.g., *.backfill). With no arguments, commands target all matching Deployments with a confirmation prompt (--yes to skip).

The pre-pause replica count is persisted as a K8s annotation on the Deployment itself, so it survives pod restarts and requires no external state store.

Issues Resolved

MIGRATIONS-3020

Testing

  • 48 unit tests covering DeploymentService logic, all CLI commands, and tab-completion
  • Manual testing against minikube with fake Deployments verifying full pause → resume round-trip, glob filtering, scale annotation management, max replica enforcement, and confirmation prompts

Check List

  • New functionality includes testing
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@oeyh oeyh had a problem deploying to migrations-cicd-require-approval April 2, 2026 21:34 — with GitHub Actions Failure
@oeyh oeyh had a problem deploying to migrations-cicd-require-approval April 2, 2026 21:34 — with GitHub Actions Failure
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 88.67188% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.35%. Comparing base (301de45) to head (e45f608).
⚠️ Report is 200 commits behind head on main.

Files with missing lines Patch % Lines
...nsole_link/workflow/services/deployment_service.py 82.05% 14 Missing ⚠️
...link/workflow/commands/autocomplete_deployments.py 85.71% 7 Missing ⚠️
...nsole_link/console_link/workflow/commands/scale.py 91.17% 3 Missing ⚠️
...nsole_link/workflow/commands/deployment_helpers.py 94.87% 2 Missing ⚠️
...sole_link/console_link/workflow/commands/resume.py 92.00% 2 Missing ⚠️
...nsole_link/console_link/workflow/commands/pause.py 96.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2611      +/-   ##
============================================
- Coverage     75.87%   72.35%   -3.52%     
- Complexity       64       65       +1     
============================================
  Files           664      702      +38     
  Lines         29435    32462    +3027     
  Branches       2338     2761     +423     
============================================
+ Hits          22333    23487    +1154     
- Misses         5931     7721    +1790     
- Partials       1171     1254      +83     
Flag Coverage Δ
gradle 68.37% <ø> (-4.07%) ⬇️
node 92.51% <ø> (-0.04%) ⬇️
python 76.87% <88.67%> (-2.63%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@oeyh oeyh marked this pull request as draft April 2, 2026 22:09
app: "replayer",
"workflows.argoproj.io/workflow": makeDirectTypeProxy(args.workflowName)
"workflows.argoproj.io/workflow": makeDirectTypeProxy(args.workflowName),
"migrations.opensearch.org/pausable": "true"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider making this "scaleable" maybe with a value for max scale, 1 for replayer, -1 for backfill (unlimited)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, makes sense.

What's the reason for limiting replayer replica to 1?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just have a single replayer, we are currently not vertically scaling it. We use speedup factor to make it replay faster

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the CR:

  • Use scaleable label instead of pausable label and encode max replica count. K8s label doesn't accept negative numbers, so "0" is used to indicate unlimited replica
  • Set backfill wait retry to 0 (unlimited)
  • Improve test coverage; also refactor to lower cognitive complexity to pass SonarQube

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the K8s error for "-1" label:

error: invalid label value: "migrations.opensearch.org/scaleable=-1": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue', or 'my_value', or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')

oeyh added 3 commits April 3, 2026 04:16
Signed-off-by: Hai Yan <oeyh@amazon.com>
Signed-off-by: Hai Yan <oeyh@amazon.com>
@oeyh oeyh had a problem deploying to migrations-cicd-require-approval April 3, 2026 20:20 — with GitHub Actions Error
@oeyh oeyh had a problem deploying to migrations-cicd-require-approval April 3, 2026 20:20 — with GitHub Actions Error
@oeyh
Copy link
Copy Markdown
Collaborator Author

oeyh commented Apr 3, 2026

Some command line examples:

workflow pause (confirmation needed when no task name is provided)

$ workflow pause
The following Deployments will be paused:
  - source1.target1.backfill (5 replicas)
  - source1.target1.replayer (1 replicas)
Proceed? [y/N]: y
  ✓ Paused source1.target1.backfill (was 5 replicas)
  ✓ Paused source1.target1.replayer (was 1 replicas)

Paused 2 of 2 Deployment(s).

workflow resume (confirmation needed when no task name is provided)

$ workflow resume
The following Deployments will be resumed:
  - source1.target1.backfill (will restore to 5 replicas)
  - source1.target1.replayer (will restore to 1 replicas)
Proceed? [y/N]: y
  ✓ Resumed source1.target1.backfill to 5 replicas
  ✓ Resumed source1.target1.replayer to 1 replicas

Resumed 2 of 2 Deployment(s).

workflow scale --replicas 3 (rejected — replayer max is 1)

$ workflow scale --replicas 3
The following Deployments will be scaled to 3 replicas:
  - source1.target1.backfill (currently 5 replicas)
  - source1.target1.replayer (currently 1 replicas)
Error: source1.target1.replayer cannot scale to 3 replicas (max: 1)

workflow scale --replicas 1

$ workflow scale --replicas 1
The following Deployments will be scaled to 1 replicas:
  - source1.target1.backfill (currently 5 replicas)
  - source1.target1.replayer (currently 1 replicas)
Proceed? [y/N]: y
  ✓ Scaled source1.target1.backfill to 1 replicas
  ✓ Scaled source1.target1.replayer to 1 replicas

Scaled 2 of 2 Deployment(s).

workflow pause -y (skip confirmation)

$ workflow pause -y
The following Deployments will be paused:
  - source1.target1.backfill (1 replicas)
  - source1.target1.replayer (1 replicas)
  ✓ Paused source1.target1.backfill (was 1 replicas)
  ✓ Paused source1.target1.replayer (was 1 replicas)

Paused 2 of 2 Deployment(s).

workflow resume -y (skip confirmation)

$ workflow resume -y
The following Deployments will be resumed:
  - source1.target1.backfill (will restore to 1 replicas)
  - source1.target1.replayer (will restore to 1 replicas)
  ✓ Resumed source1.target1.backfill to 1 replicas
  ✓ Resumed source1.target1.replayer to 1 replicas

Resumed 2 of 2 Deployment(s).

Signed-off-by: Hai Yan <oeyh@amazon.com>
@oeyh oeyh requested a deployment to migrations-cicd-require-approval April 3, 2026 20:30 — with GitHub Actions Waiting
@oeyh oeyh requested a deployment to migrations-cicd-require-approval April 3, 2026 20:30 — with GitHub Actions Waiting
@oeyh oeyh marked this pull request as ready for review April 3, 2026 20:31
@AndreKurait AndreKurait marked this pull request as draft April 6, 2026 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants