Description
What problem are you trying to solve?
Currently, Karpenter's drift remediation triggers node replacement immediately upon detecting drift (e.g. AMI change, NodePool spec update). There is no way to control the cadence or interval at which detected drift is acted upon.
In production environments, this causes unplanned pod disruptions outside of maintenance windows whenever a new AMI is released or a NodePool spec changes. Our current workaround is to manually toggle disruption.budgets to nodes: "0" to block all disruption, then open it back up during maintenance windows — which is operationally cumbersome and error-prone.
What we want is a drift.interval or similar field on the NodePool disruption spec that defers drift remediation to a configured cadence, while allowing drift detection to continue running as normal. For example:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general
spec:
disruption:
budgets:
- nodes: "20%"
drift:
interval: 24h # only remediate drift once every 24 hours
This would allow teams to:
- Contain AMI drift remediation to planned maintenance windows
- Coordinate node replacements with application release schedules
- Reduce blast radius of large-scale drift events in production clusters
Omitting the field should preserve existing behavior (immediate remediation) for full backwards compatibility.
How important is this feature to you?
High. We operate multiple production EKS clusters and uncontrolled drift remediation timing is one of the main friction points when managing node lifecycle at scale. Without this, we are forced into manual operational toil every time an AMI update rolls out.
Description
What problem are you trying to solve?
Currently, Karpenter's drift remediation triggers node replacement immediately upon detecting drift (e.g. AMI change, NodePool spec update). There is no way to control the cadence or interval at which detected drift is acted upon.
In production environments, this causes unplanned pod disruptions outside of maintenance windows whenever a new AMI is released or a NodePool spec changes. Our current workaround is to manually toggle
disruption.budgetstonodes: "0"to block all disruption, then open it back up during maintenance windows — which is operationally cumbersome and error-prone.What we want is a
drift.intervalor similar field on the NodePool disruption spec that defers drift remediation to a configured cadence, while allowing drift detection to continue running as normal. For example:This would allow teams to:
Omitting the field should preserve existing behavior (immediate remediation) for full backwards compatibility.
How important is this feature to you?
High. We operate multiple production EKS clusters and uncontrolled drift remediation timing is one of the main friction points when managing node lifecycle at scale. Without this, we are forced into manual operational toil every time an AMI update rolls out.