Stabilize EKS production nodepool: WhenEmpty consolidation, 30m cooldown, 4xlarge cap by AndreKurait · Pull Request #2714 · opensearch-project/opensearch-migrations

AndreKurait · 2026-04-14T17:19:53Z

Description

Hardens the custom Karpenter NodePool (general-work-pool) for production EKS deployments to prevent unexpected node churn and runaway instance scaling.

Problem

The current nodepool uses WhenEmptyOrUnderutilized consolidation with a 5-minute cooldown and no instance size cap. In production, this can cause:

Nodes being consolidated while workloads are still running (underutilized != unused)
Karpenter spinning up very large instances (8xlarge+) that are expensive and slow to provision
Frequent node replacement cycles that disrupt long-running migration tasks

Changes

NodePool hardening (workloadsNodePool.yaml):

Consolidation policy: WhenEmptyOrUnderutilized → WhenEmpty — nodes only scale down when fully drained
Consolidation cooldown: 300s → 30m — prevents rapid node cycling
Instance size cap: added eks.amazonaws.com/instance-size constraint limiting to medium through 4xlarge

Default for EKS (valuesEks.yaml):

useCustomKarpenterNodePool: true — the conservative custom nodepool is now the default for all EKS deployments

Opt-out for CI (aws-bootstrap.sh):

New --use-general-node-pool flag that overrides cluster.useCustomKarpenterNodePool=false via helm, reverting to the EKS Auto Mode general-purpose pool for environments that don't need production-safe settings

Jenkins pipeline plumbing:

resolveBootstrap.groovy: new useGeneralNodePool option that passes --use-general-node-pool to the bootstrap script
All 4 EKS Jenkins pipelines (eksIntegPipeline, eksAOSSIntegPipeline, eksBYOSIntegPipeline, eksSolutionsCFNTest): pass useGeneralNodePool: true so CI tests continue using the unconstrained general-purpose pool

Testing

Deploy to a test EKS cluster and verify general-work-pool NodePool has correct settings
Verify Jenkins tests pass with --use-general-node-pool (uses general-purpose pool, no constraints)
Verify production deploy without the flag uses the hardened custom nodepool

Net Effect

Production deployments get a conservative, stable nodepool by default. Jenkins CI tests opt out and continue using the general-purpose pool with no resource constraints.

…ge cap - workloadsNodePool: change consolidation from WhenEmptyOrUnderutilized to WhenEmpty with 30m cooldown, cap instance size at 4xlarge - valuesEks: enable useCustomKarpenterNodePool by default for EKS deploys - aws-bootstrap.sh: add --use-general-node-pool flag to opt out of custom nodepool and use EKS Auto Mode general-purpose pool instead - resolveBootstrap.groovy: plumb useGeneralNodePool option through to --use-general-node-pool bootstrap flag - All EKS Jenkins pipelines: pass useGeneralNodePool: true so CI tests continue using the general-purpose pool (no resource constraints) Signed-off-by: Andre Kurait <andrekurait@gmail.com>

codecov · 2026-04-14T17:22:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.51%. Comparing base (775b16d) to head (788023c).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff            @@
##               main    #2714   +/-   ##
=========================================
  Coverage     72.51%   72.51%           
  Complexity      106      106           
=========================================
  Files           723      723           
  Lines         33560    33560           
  Branches       2911     2908    -3     
=========================================
  Hits          24336    24336           
  Misses         7913     7913           
  Partials       1311     1311

Flag	Coverage Δ
gradle	`68.92% <ø> (ø)`
node	`90.97% <ø> (ø)`
python	`76.59% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

jugal-chauhan

Thanks for bringing in these changes!

jugal-chauhan · 2026-04-15T01:16:56Z

deployment/k8s/aws/aws-bootstrap.sh

+# Override custom nodepool when --use-general-node-pool is set
+NODEPOOL_HELM_FLAGS=""
+if [[ "$use_general_node_pool" == "true" ]]; then
+  NODEPOOL_HELM_FLAGS="--set cluster.useCustomKarpenterNodePool=false"
+fi


Can we also add a mutual exclusion check here ? TO prevent both CLI flags to be used together

if [[ "$disable_general_purpose_pool" == "true" && "$use_general_node_pool" == "true" ]]; then echo "ERROR: --disable-general-purpose-pool and --use-general-node-pool are mutually exclusive" exit 1 fi

AndreKurait requested review from gregschohn, jugal-chauhan and sumobrian as code owners April 14, 2026 17:19

AndreKurait had a problem deploying to migrations-cicd April 14, 2026 17:20 — with GitHub Actions Error

AndreKurait added the run-eks-tests label Apr 14, 2026

AndreKurait temporarily deployed to migrations-cicd April 14, 2026 17:22 — with GitHub Actions Inactive

jugal-chauhan self-assigned this Apr 15, 2026

jugal-chauhan approved these changes Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize EKS production nodepool: WhenEmpty consolidation, 30m cooldown, 4xlarge cap#2714

Stabilize EKS production nodepool: WhenEmpty consolidation, 30m cooldown, 4xlarge cap#2714
AndreKurait wants to merge 1 commit intoopensearch-project:mainfrom
AndreKurait:stable-prod-nodepool

AndreKurait commented Apr 14, 2026

Uh oh!

codecov bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

jugal-chauhan left a comment

Uh oh!

jugal-chauhan Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AndreKurait commented Apr 14, 2026

Description

Problem

Changes

Testing

Net Effect

Uh oh!

codecov bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jugal-chauhan left a comment

Choose a reason for hiding this comment

Uh oh!

jugal-chauhan Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Apr 14, 2026 •

edited

Loading