Skip to content

[Fix] remove transient failureConditions and fix AOSS pipeline cluster deploy#2721

Open
jugal-chauhan wants to merge 4 commits intoopensearch-project:mainfrom
jugal-chauhan:add-k8s-polling-retry
Open

[Fix] remove transient failureConditions and fix AOSS pipeline cluster deploy#2721
jugal-chauhan wants to merge 4 commits intoopensearch-project:mainfrom
jugal-chauhan:add-k8s-polling-retry

Conversation

@jugal-chauhan
Copy link
Copy Markdown
Collaborator

@jugal-chauhan jugal-chauhan commented Apr 15, 2026

Fix 1: Remove transient failureConditions from Kafka and cert-manager readiness waiters

The waitForKafkaClusterReady and waitForCertReady Argo resource steps had failureCondition expressions that matched transient startup states:

  • Kafka: failureCondition: "status.conditions.0.type == NotReady" — Kafka briefly enters NotReady during normal Strimzi startup before transitioning to Ready.
  • Cert-manager: failureCondition: "status.conditions.0.status == False" — certificates start with status == False while being issued.

Argo resource steps continuously evaluate both successCondition and failureCondition against the live resource. When a failureCondition matches, the step fails immediately (exit code 64). With retryPolicy: "OnError", Argo does not retry failureCondition matches — it treats them as deliberate hard failures. This caused the Kafka waiter to fail on the first check during normal startup, which cascaded: readKafkaConnectionProfile couldn't find the Kafka CR, createProxy never completed, and the capture-proxy/replayer pods were never scheduled.

Root Cause

These failureCondition expressions were incorrect — they matched transient states, not permanent failures. Per the Argo resource step model:

  • If successCondition matches → step succeeds
  • If failureCondition matches → step fails immediately
  • If neither matches → step blocks and waits (no error, no retry needed)

The correct behavior is to let the step block until the successCondition is met. If the resource truly fails permanently, the step will time out via the Argo workflow's global timeout.

Changes :

Remove the transient failureCondition from both templates. The existing successCondition is sufficient:

Template File Removed failureCondition Remaining successCondition
waitForKafkaClusterReady resourceManagement.ts status.conditions.0.type == NotReady status.listeners, metadata.annotations.migration-configChecksum == <expected>
waitForCertReady setupCapture.ts status.conditions.0.status == False status.conditions.0.status == True

No changes to retry strategies, no changes to apply/CRUD templates, no impact on the VAP approval flow.


Fix 2: Fix AOSS CDC pipeline deploying source and AOSS target in separate CDK calls

eksCdcAossIntegPipeline was calling awsDeployCluster.sh twice with the same --stage:

  1. First call deployed the ES source domain
  2. Second call deployed the AOSS collection

The second CDK call overwrote the first call's CloudFormation stack, leaving only the AOSS collection and no source domain — causing the test to fail when trying to connect to the source cluster.

Changes

Merge both clusters into a single awsDeployCluster.sh call with one context file containing both source and target entries. Bootstrap MA first (sequential, since the AOSS deploy needs env.maVpcId and env.eksClusterName from bootstrap), then deploy both clusters together in one CDK stack update.

Also bumped the overall pipeline timeout and test stage timeout to account for the sequential bootstrap + deploy.

Issues Resolved

N/A

Testing

Check List

  • New functionality includes testing
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Jugal Chauhan <jugaldc@amazon.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.46%. Comparing base (8bd941c) to head (272edb5).

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #2721   +/-   ##
=========================================
  Coverage     73.45%   73.46%           
  Complexity      106      106           
=========================================
  Files           721      721           
  Lines         33441    33441           
  Branches       2918     2915    -3     
=========================================
+ Hits          24564    24566    +2     
+ Misses         7537     7535    -2     
  Partials       1340     1340           
Flag Coverage Δ
gradle 69.89% <ø> (ø)
node 90.97% <ø> (ø)
python 77.77% <ø> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Jugal Chauhan <jugaldc@amazon.com>
},
conditions: {
successCondition: "status.conditions.0.status == True",
failureCondition: "status.conditions.0.status == False",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know if status == false is a non-terminal state?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failureCondition is harmful here because False is the starting state, not a terminal state. Without the failureCondition, the Argo resource step just blocks and polls every 5s until successCondition matches (this is native behavior from matchConditions() where it returns retryable when neither condition matches)

The failureCondition was causing an immediate fail on the first poll because the cert hadn't been issued yet.

@jugal-chauhan jugal-chauhan enabled auto-merge April 15, 2026 04:30
@jugal-chauhan jugal-chauhan changed the title Use retryPolicy Always for transient Argo resource polling Remove transient failureConditions from Kafka and cert-manager readiness waiters Apr 15, 2026
@jugal-chauhan jugal-chauhan deployed to migrations-cicd April 15, 2026 05:42 — with GitHub Actions Active
@jugal-chauhan jugal-chauhan changed the title Remove transient failureConditions from Kafka and cert-manager readiness waiters [Fix] remove transient failureConditions and fix AOSS pipeline cluster deploy Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-cdc-tests This label will run CDC test relevant jobs on Jenkins

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants