Description
What problem are you trying to solve?
Summary: We'd like to have an option on our NodePools to terminate instances before creating new ones as a drift resolution mechanism while respecting PDBs and node disruption budgets as normal.
Scenario: We leverage a set of GPU instances which are quite expensive to acquire and so get them as part of Capacity reservations (ODCRs) associated with Karpenter NodePools. We also maintain little to no buffer on these instances due to cost. As a result, when we try to do drift resolutions via Karpenter, Karpenter is never able to automate the resolutions because it can't scale up a new instance (capacity is just not available) before shutting down the old one. This means a human or other system now has to come in and handle node replacements for just this type of instance.
How important is this feature to you?
Currently, for our rarely available GPU instances, it requires humans to be in the loop to manage AMI updates which is a significant operational pain point. For our other capacity, Karpenter significantly simplifies upgrades thanks to its excellent drift detection and resolution capabilities
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Description
What problem are you trying to solve?
Summary: We'd like to have an option on our NodePools to terminate instances before creating new ones as a drift resolution mechanism while respecting PDBs and node disruption budgets as normal.
Scenario: We leverage a set of GPU instances which are quite expensive to acquire and so get them as part of Capacity reservations (ODCRs) associated with Karpenter NodePools. We also maintain little to no buffer on these instances due to cost. As a result, when we try to do drift resolutions via Karpenter, Karpenter is never able to automate the resolutions because it can't scale up a new instance (capacity is just not available) before shutting down the old one. This means a human or other system now has to come in and handle node replacements for just this type of instance.
How important is this feature to you?
Currently, for our rarely available GPU instances, it requires humans to be in the loop to manage AMI updates which is a significant operational pain point. For our other capacity, Karpenter significantly simplifies upgrades thanks to its excellent drift detection and resolution capabilities