Add workflow pause, resume, and scale commands for K8s Deployments#2611
Add workflow pause, resume, and scale commands for K8s Deployments#2611oeyh wants to merge 5 commits intoopensearch-project:mainfrom
Conversation
Signed-off-by: Hai Yan <oeyh@amazon.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2611 +/- ##
============================================
- Coverage 75.87% 72.35% -3.52%
- Complexity 64 65 +1
============================================
Files 664 702 +38
Lines 29435 32462 +3027
Branches 2338 2761 +423
============================================
+ Hits 22333 23487 +1154
- Misses 5931 7721 +1790
- Partials 1171 1254 +83
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| app: "replayer", | ||
| "workflows.argoproj.io/workflow": makeDirectTypeProxy(args.workflowName) | ||
| "workflows.argoproj.io/workflow": makeDirectTypeProxy(args.workflowName), | ||
| "migrations.opensearch.org/pausable": "true" |
There was a problem hiding this comment.
Should we consider making this "scaleable" maybe with a value for max scale, 1 for replayer, -1 for backfill (unlimited)
There was a problem hiding this comment.
Sure, makes sense.
What's the reason for limiting replayer replica to 1?
There was a problem hiding this comment.
We just have a single replayer, we are currently not vertically scaling it. We use speedup factor to make it replay faster
There was a problem hiding this comment.
Updated the CR:
- Use scaleable label instead of pausable label and encode max replica count. K8s label doesn't accept negative numbers, so "0" is used to indicate unlimited replica
- Set backfill wait retry to 0 (unlimited)
- Improve test coverage; also refactor to lower cognitive complexity to pass SonarQube
There was a problem hiding this comment.
This is the K8s error for "-1" label:
error: invalid label value: "migrations.opensearch.org/scaleable=-1": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue', or 'my_value', or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')
Signed-off-by: Hai Yan <oeyh@amazon.com>
Signed-off-by: Hai Yan <oeyh@amazon.com>
Signed-off-by: Hai Yan <oeyh@amazon.com>
Some command line examples:
|
Signed-off-by: Hai Yan <oeyh@amazon.com>
Description
Add
workflow pause,workflow resume, andworkflow scalesubcommands for managing long-running K8s Deployments (backfill and replay) created by Argo Workflows.workflow pause [TASK_NAMES...]— stores current replica count as a K8s annotation (migrations.opensearch.org/pre-pause-replicas) on the Deployment, then scales to 0workflow resume [TASK_NAMES...]— reads the pre-pause replica count from the annotation, restores it, and removes the annotationworkflow scale [TASK_NAMES...] --replicas N— sets replica count; scaling to 0 sets the annotation (like pause), scaling to >0 clears it (like resume). Rejects scaling beyond the max defined in thescaleablelabel.Deployments are discovered via the label
migrations.opensearch.org/scaleable. The label value encodes the max replica count:"0"for unlimited (backfill), or a positive integer like"1"(replayer). Presence of the label implies the Deployment is both pausable and scaleable. Task names support glob patterns (e.g.,*.backfill). With no arguments, commands target all matching Deployments with a confirmation prompt (--yesto skip).The pre-pause replica count is persisted as a K8s annotation on the Deployment itself, so it survives pod restarts and requires no external state store.
Issues Resolved
MIGRATIONS-3020
Testing
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.