[Fix] remove transient failureConditions and fix AOSS pipeline cluster deploy by jugal-chauhan · Pull Request #2721 · opensearch-project/opensearch-migrations

jugal-chauhan · 2026-04-15T02:08:22Z

Fix 1: Remove transient failureConditions from Kafka and cert-manager readiness waiters

The waitForKafkaClusterReady and waitForCertReady Argo resource steps had failureCondition expressions that matched transient startup states:

Kafka: failureCondition: "status.conditions.0.type == NotReady" — Kafka briefly enters NotReady during normal Strimzi startup before transitioning to Ready.
Cert-manager: failureCondition: "status.conditions.0.status == False" — certificates start with status == False while being issued.

Argo resource steps continuously evaluate both successCondition and failureCondition against the live resource. When a failureCondition matches, the step fails immediately (exit code 64). With retryPolicy: "OnError", Argo does not retry failureCondition matches — it treats them as deliberate hard failures. This caused the Kafka waiter to fail on the first check during normal startup, which cascaded: readKafkaConnectionProfile couldn't find the Kafka CR, createProxy never completed, and the capture-proxy/replayer pods were never scheduled.

Root Cause

These failureCondition expressions were incorrect — they matched transient states, not permanent failures. Per the Argo resource step model:

If successCondition matches → step succeeds
If failureCondition matches → step fails immediately
If neither matches → step blocks and waits (no error, no retry needed)

The correct behavior is to let the step block until the successCondition is met. If the resource truly fails permanently, the step will time out via the Argo workflow's global timeout.

Changes :

Remove the transient failureCondition from both templates. The existing successCondition is sufficient:

Template	File	Removed failureCondition	Remaining successCondition
`waitForKafkaClusterReady`	`resourceManagement.ts`	`status.conditions.0.type == NotReady`	`status.listeners, metadata.annotations.migration-configChecksum == <expected>`
`waitForCertReady`	`setupCapture.ts`	`status.conditions.0.status == False`	`status.conditions.0.status == True`

No changes to retry strategies, no changes to apply/CRUD templates, no impact on the VAP approval flow.

Fix 2: Fix AOSS CDC pipeline deploying source and AOSS target in separate CDK calls

eksCdcAossIntegPipeline was calling awsDeployCluster.sh twice with the same --stage:

First call deployed the ES source domain
Second call deployed the AOSS collection

The second CDK call overwrote the first call's CloudFormation stack, leaving only the AOSS collection and no source domain — causing the test to fail when trying to connect to the source cluster.

Changes

Merge both clusters into a single awsDeployCluster.sh call with one context file containing both source and target entries. Bootstrap MA first (sequential, since the AOSS deploy needs env.maVpcId and env.eksClusterName from bootstrap), then deploy both clusters together in one CDK stack update.

Also bumped the overall pipeline timeout and test stage timeout to account for the sequential bootstrap + deploy.

Issues Resolved

N/A

Testing

Check List

New functionality includes testing
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Jugal Chauhan <jugaldc@amazon.com>

codecov · 2026-04-15T02:09:47Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.46%. Comparing base (8bd941c) to head (272edb5).

Additional details and impacted files

@@            Coverage Diff            @@
##               main    #2721   +/-   ##
=========================================
  Coverage     73.45%   73.46%           
  Complexity      106      106           
=========================================
  Files           721      721           
  Lines         33441    33441           
  Branches       2918     2915    -3     
=========================================
+ Hits          24564    24566    +2     
+ Misses         7537     7535    -2     
  Partials       1340     1340

Flag	Coverage Δ
gradle	`69.89% <ø> (ø)`
node	`90.97% <ø> (ø)`
python	`77.77% <ø> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Jugal Chauhan <jugaldc@amazon.com>

gregschohn · 2026-04-15T03:16:31Z

orchestrationSpecs/packages/migration-workflow-templates/src/workflowTemplates/setupCapture.ts

                },
                conditions: {
                    successCondition: "status.conditions.0.status == True",
-                    failureCondition: "status.conditions.0.status == False",


do we know if status == false is a non-terminal state?

The failureCondition is harmful here because False is the starting state, not a terminal state. Without the failureCondition, the Argo resource step just blocks and polls every 5s until successCondition matches (this is native behavior from matchConditions() where it returns retryable when neither condition matches)

The failureCondition was causing an immediate fail on the first poll because the cert hadn't been issued yet.

Signed-off-by: Jugal Chauhan <jugaldc@amazon.com>

Fix polling Argo resource templates to use retryPolicy Always

d881d7c

Signed-off-by: Jugal Chauhan <jugaldc@amazon.com>

jugal-chauhan requested review from AndreKurait, gregschohn and sumobrian as code owners April 15, 2026 02:08

jugal-chauhan had a problem deploying to migrations-cicd April 15, 2026 02:08 — with GitHub Actions Error

AndreKurait added the run-cdc-tests This label will run CDC test relevant jobs on Jenkins label Apr 15, 2026

AndreKurait temporarily deployed to migrations-cicd April 15, 2026 02:23 — with GitHub Actions Inactive

AndreKurait had a problem deploying to migrations-cicd April 15, 2026 02:23 — with GitHub Actions Error

jugal-chauhan had a problem deploying to migrations-cicd April 15, 2026 03:10 — with GitHub Actions Error

Remove transient failureConditions from Kafka and cert readiness waiters

eb0dae6

Signed-off-by: Jugal Chauhan <jugaldc@amazon.com>

jugal-chauhan force-pushed the add-k8s-polling-retry branch from 173d8be to eb0dae6 Compare April 15, 2026 03:12

jugal-chauhan temporarily deployed to migrations-cicd April 15, 2026 03:12 — with GitHub Actions Inactive

gregschohn reviewed Apr 15, 2026

View reviewed changes

jugal-chauhan enabled auto-merge April 15, 2026 04:30

jugal-chauhan mentioned this pull request Apr 15, 2026

Default capture proxy to TLS with self-signed cert #2717

Open

2 tasks

jugal-chauhan changed the title ~~Use retryPolicy Always for transient Argo resource polling~~ Remove transient failureConditions from Kafka and cert-manager readiness waiters Apr 15, 2026

jugal-chauhan added 2 commits April 15, 2026 00:41

Fix AOSS CDC pipeline: deploy source and AOSS target in single CDK call

3ccc10e

Signed-off-by: Jugal Chauhan <jugaldc@amazon.com>

Merge remote-tracking branch 'origin/main' into add-k8s-polling-retry

272edb5

jugal-chauhan temporarily deployed to migrations-cicd April 15, 2026 05:42 — with GitHub Actions Inactive

jugal-chauhan deployed to migrations-cicd April 15, 2026 05:42 — with GitHub Actions Active

jugal-chauhan temporarily deployed to migrations-cicd April 15, 2026 05:42 — with GitHub Actions Inactive

jugal-chauhan changed the title ~~Remove transient failureConditions from Kafka and cert-manager readiness waiters~~ [Fix] remove transient failureConditions and fix AOSS pipeline cluster deploy Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] remove transient failureConditions and fix AOSS pipeline cluster deploy#2721

[Fix] remove transient failureConditions and fix AOSS pipeline cluster deploy#2721
jugal-chauhan wants to merge 4 commits intoopensearch-project:mainfrom
jugal-chauhan:add-k8s-polling-retry

jugal-chauhan commented Apr 15, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

gregschohn Apr 15, 2026

Uh oh!

jugal-chauhan Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jugal-chauhan commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix 1: Remove transient failureConditions from Kafka and cert-manager readiness waiters

Root Cause

Changes :

Fix 2: Fix AOSS CDC pipeline deploying source and AOSS target in separate CDK calls

Changes

Issues Resolved

Testing

Check List

Uh oh!

codecov bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gregschohn Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

jugal-chauhan Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jugal-chauhan commented Apr 15, 2026 •

edited

Loading

codecov bot commented Apr 15, 2026 •

edited

Loading