[Fix] remove transient failureConditions and fix AOSS pipeline cluster deploy#2721
[Fix] remove transient failureConditions and fix AOSS pipeline cluster deploy#2721jugal-chauhan wants to merge 4 commits intoopensearch-project:mainfrom
Conversation
Signed-off-by: Jugal Chauhan <jugaldc@amazon.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2721 +/- ##
=========================================
Coverage 73.45% 73.46%
Complexity 106 106
=========================================
Files 721 721
Lines 33441 33441
Branches 2918 2915 -3
=========================================
+ Hits 24564 24566 +2
+ Misses 7537 7535 -2
Partials 1340 1340
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Jugal Chauhan <jugaldc@amazon.com>
173d8be to
eb0dae6
Compare
| }, | ||
| conditions: { | ||
| successCondition: "status.conditions.0.status == True", | ||
| failureCondition: "status.conditions.0.status == False", |
There was a problem hiding this comment.
do we know if status == false is a non-terminal state?
There was a problem hiding this comment.
The failureCondition is harmful here because False is the starting state, not a terminal state. Without the failureCondition, the Argo resource step just blocks and polls every 5s until successCondition matches (this is native behavior from matchConditions() where it returns retryable when neither condition matches)
The failureCondition was causing an immediate fail on the first poll because the cert hadn't been issued yet.
Signed-off-by: Jugal Chauhan <jugaldc@amazon.com>
Fix 1: Remove transient failureConditions from Kafka and cert-manager readiness waiters
The
waitForKafkaClusterReadyandwaitForCertReadyArgo resource steps hadfailureConditionexpressions that matched transient startup states:failureCondition: "status.conditions.0.type == NotReady"— Kafka briefly entersNotReadyduring normal Strimzi startup before transitioning to Ready.failureCondition: "status.conditions.0.status == False"— certificates start withstatus == Falsewhile being issued.Argo resource steps continuously evaluate both
successConditionandfailureConditionagainst the live resource. When afailureConditionmatches, the step fails immediately (exit code 64). WithretryPolicy: "OnError", Argo does not retry failureCondition matches — it treats them as deliberate hard failures. This caused the Kafka waiter to fail on the first check during normal startup, which cascaded:readKafkaConnectionProfilecouldn't find the Kafka CR,createProxynever completed, and the capture-proxy/replayer pods were never scheduled.Root Cause
These
failureConditionexpressions were incorrect — they matched transient states, not permanent failures. Per the Argo resource step model:successConditionmatches → step succeedsfailureConditionmatches → step fails immediatelyThe correct behavior is to let the step block until the
successConditionis met. If the resource truly fails permanently, the step will time out via the Argo workflow's global timeout.Changes :
Remove the transient
failureConditionfrom both templates. The existingsuccessConditionis sufficient:waitForKafkaClusterReadyresourceManagement.tsstatus.conditions.0.type == NotReadystatus.listeners, metadata.annotations.migration-configChecksum == <expected>waitForCertReadysetupCapture.tsstatus.conditions.0.status == Falsestatus.conditions.0.status == TrueNo changes to retry strategies, no changes to apply/CRUD templates, no impact on the VAP approval flow.
Fix 2: Fix AOSS CDC pipeline deploying source and AOSS target in separate CDK calls
eksCdcAossIntegPipelinewas callingawsDeployCluster.shtwice with the same--stage:The second CDK call overwrote the first call's CloudFormation stack, leaving only the AOSS collection and no source domain — causing the test to fail when trying to connect to the source cluster.
Changes
Merge both clusters into a single
awsDeployCluster.shcall with one context file containing bothsourceandtargetentries. Bootstrap MA first (sequential, since the AOSS deploy needsenv.maVpcIdandenv.eksClusterNamefrom bootstrap), then deploy both clusters together in one CDK stack update.Also bumped the overall pipeline timeout and test stage timeout to account for the sequential bootstrap + deploy.
Issues Resolved
N/A
Testing
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.