Stabilize EKS production nodepool: WhenEmpty consolidation, 30m cooldown, 4xlarge cap#2714
Open
AndreKurait wants to merge 1 commit intoopensearch-project:mainfrom
Open
Stabilize EKS production nodepool: WhenEmpty consolidation, 30m cooldown, 4xlarge cap#2714AndreKurait wants to merge 1 commit intoopensearch-project:mainfrom
AndreKurait wants to merge 1 commit intoopensearch-project:mainfrom
Conversation
…ge cap - workloadsNodePool: change consolidation from WhenEmptyOrUnderutilized to WhenEmpty with 30m cooldown, cap instance size at 4xlarge - valuesEks: enable useCustomKarpenterNodePool by default for EKS deploys - aws-bootstrap.sh: add --use-general-node-pool flag to opt out of custom nodepool and use EKS Auto Mode general-purpose pool instead - resolveBootstrap.groovy: plumb useGeneralNodePool option through to --use-general-node-pool bootstrap flag - All EKS Jenkins pipelines: pass useGeneralNodePool: true so CI tests continue using the general-purpose pool (no resource constraints) Signed-off-by: Andre Kurait <andrekurait@gmail.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2714 +/- ##
=========================================
Coverage 72.51% 72.51%
Complexity 106 106
=========================================
Files 723 723
Lines 33560 33560
Branches 2911 2908 -3
=========================================
Hits 24336 24336
Misses 7913 7913
Partials 1311 1311
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
jugal-chauhan
approved these changes
Apr 15, 2026
Collaborator
jugal-chauhan
left a comment
There was a problem hiding this comment.
Thanks for bringing in these changes!
Comment on lines
+1069
to
+1073
| # Override custom nodepool when --use-general-node-pool is set | ||
| NODEPOOL_HELM_FLAGS="" | ||
| if [[ "$use_general_node_pool" == "true" ]]; then | ||
| NODEPOOL_HELM_FLAGS="--set cluster.useCustomKarpenterNodePool=false" | ||
| fi |
Collaborator
There was a problem hiding this comment.
Can we also add a mutual exclusion check here ? TO prevent both CLI flags to be used together
if [[ "$disable_general_purpose_pool" == "true" && "$use_general_node_pool" == "true" ]]; then
echo "ERROR: --disable-general-purpose-pool and --use-general-node-pool are mutually exclusive"
exit 1
fi
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Hardens the custom Karpenter NodePool (
general-work-pool) for production EKS deployments to prevent unexpected node churn and runaway instance scaling.Problem
The current nodepool uses
WhenEmptyOrUnderutilizedconsolidation with a 5-minute cooldown and no instance size cap. In production, this can cause:Changes
NodePool hardening (
workloadsNodePool.yaml):WhenEmptyOrUnderutilized→WhenEmpty— nodes only scale down when fully drained300s→30m— prevents rapid node cyclingeks.amazonaws.com/instance-sizeconstraint limiting tomediumthrough4xlargeDefault for EKS (
valuesEks.yaml):useCustomKarpenterNodePool: true— the conservative custom nodepool is now the default for all EKS deploymentsOpt-out for CI (
aws-bootstrap.sh):--use-general-node-poolflag that overridescluster.useCustomKarpenterNodePool=falsevia helm, reverting to the EKS Auto Mode general-purpose pool for environments that don't need production-safe settingsJenkins pipeline plumbing:
resolveBootstrap.groovy: newuseGeneralNodePooloption that passes--use-general-node-poolto the bootstrap scripteksIntegPipeline,eksAOSSIntegPipeline,eksBYOSIntegPipeline,eksSolutionsCFNTest): passuseGeneralNodePool: trueso CI tests continue using the unconstrained general-purpose poolTesting
general-work-poolNodePool has correct settings--use-general-node-pool(uses general-purpose pool, no constraints)Net Effect
Production deployments get a conservative, stable nodepool by default. Jenkins CI tests opt out and continue using the general-purpose pool with no resource constraints.