Skip to content

Stabilize EKS production nodepool: WhenEmpty consolidation, 30m cooldown, 4xlarge cap#2714

Open
AndreKurait wants to merge 1 commit intoopensearch-project:mainfrom
AndreKurait:stable-prod-nodepool
Open

Stabilize EKS production nodepool: WhenEmpty consolidation, 30m cooldown, 4xlarge cap#2714
AndreKurait wants to merge 1 commit intoopensearch-project:mainfrom
AndreKurait:stable-prod-nodepool

Conversation

@AndreKurait
Copy link
Copy Markdown
Member

Description

Hardens the custom Karpenter NodePool (general-work-pool) for production EKS deployments to prevent unexpected node churn and runaway instance scaling.

Problem

The current nodepool uses WhenEmptyOrUnderutilized consolidation with a 5-minute cooldown and no instance size cap. In production, this can cause:

  • Nodes being consolidated while workloads are still running (underutilized != unused)
  • Karpenter spinning up very large instances (8xlarge+) that are expensive and slow to provision
  • Frequent node replacement cycles that disrupt long-running migration tasks

Changes

NodePool hardening (workloadsNodePool.yaml):

  • Consolidation policy: WhenEmptyOrUnderutilizedWhenEmpty — nodes only scale down when fully drained
  • Consolidation cooldown: 300s30m — prevents rapid node cycling
  • Instance size cap: added eks.amazonaws.com/instance-size constraint limiting to medium through 4xlarge

Default for EKS (valuesEks.yaml):

  • useCustomKarpenterNodePool: true — the conservative custom nodepool is now the default for all EKS deployments

Opt-out for CI (aws-bootstrap.sh):

  • New --use-general-node-pool flag that overrides cluster.useCustomKarpenterNodePool=false via helm, reverting to the EKS Auto Mode general-purpose pool for environments that don't need production-safe settings

Jenkins pipeline plumbing:

  • resolveBootstrap.groovy: new useGeneralNodePool option that passes --use-general-node-pool to the bootstrap script
  • All 4 EKS Jenkins pipelines (eksIntegPipeline, eksAOSSIntegPipeline, eksBYOSIntegPipeline, eksSolutionsCFNTest): pass useGeneralNodePool: true so CI tests continue using the unconstrained general-purpose pool

Testing

  • Deploy to a test EKS cluster and verify general-work-pool NodePool has correct settings
  • Verify Jenkins tests pass with --use-general-node-pool (uses general-purpose pool, no constraints)
  • Verify production deploy without the flag uses the hardened custom nodepool

Net Effect

Production deployments get a conservative, stable nodepool by default. Jenkins CI tests opt out and continue using the general-purpose pool with no resource constraints.

…ge cap

- workloadsNodePool: change consolidation from WhenEmptyOrUnderutilized to
  WhenEmpty with 30m cooldown, cap instance size at 4xlarge
- valuesEks: enable useCustomKarpenterNodePool by default for EKS deploys
- aws-bootstrap.sh: add --use-general-node-pool flag to opt out of custom
  nodepool and use EKS Auto Mode general-purpose pool instead
- resolveBootstrap.groovy: plumb useGeneralNodePool option through to
  --use-general-node-pool bootstrap flag
- All EKS Jenkins pipelines: pass useGeneralNodePool: true so CI tests
  continue using the general-purpose pool (no resource constraints)

Signed-off-by: Andre Kurait <andrekurait@gmail.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.51%. Comparing base (775b16d) to head (788023c).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #2714   +/-   ##
=========================================
  Coverage     72.51%   72.51%           
  Complexity      106      106           
=========================================
  Files           723      723           
  Lines         33560    33560           
  Branches       2911     2908    -3     
=========================================
  Hits          24336    24336           
  Misses         7913     7913           
  Partials       1311     1311           
Flag Coverage Δ
gradle 68.92% <ø> (ø)
node 90.97% <ø> (ø)
python 76.59% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Collaborator

@jugal-chauhan jugal-chauhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing in these changes!

Comment on lines +1069 to +1073
# Override custom nodepool when --use-general-node-pool is set
NODEPOOL_HELM_FLAGS=""
if [[ "$use_general_node_pool" == "true" ]]; then
NODEPOOL_HELM_FLAGS="--set cluster.useCustomKarpenterNodePool=false"
fi
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a mutual exclusion check here ? TO prevent both CLI flags to be used together

if [[ "$disable_general_purpose_pool" == "true" && "$use_general_node_pool" == "true" ]]; then
  echo "ERROR: --disable-general-purpose-pool and --use-general-node-pool are mutually exclusive"
  exit 1
fi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants