Skip to content

fix: avoid selecting subnets with insufficient available IP address + test#7623

Closed
Summonair wants to merge 12 commits intoaws:mainfrom
Summonair:merge-with-vacant
Closed

fix: avoid selecting subnets with insufficient available IP address + test#7623
Summonair wants to merge 12 commits intoaws:mainfrom
Summonair:merge-with-vacant

Conversation

@Summonair
Copy link
Copy Markdown

Fixes #5234 , #2921

Description
its a feature that when launchInstance select the subnet which has the most available ip count by @Vacant2333
i added a test around it so we can merge it
How was this change tested?

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

merged with #7310,

@Summonair Summonair requested a review from a team as a code owner January 22, 2025 21:48
@netlify
Copy link
Copy Markdown

netlify bot commented Jan 22, 2025

Deploy Preview for karpenter-docs-prod ready!

Name Link
🔨 Latest commit 6942ac2
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/67ba4453472bdf000833ad44
😎 Deploy Preview https://deploy-preview-7623--karpenter-docs-prod.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@Summonair
Copy link
Copy Markdown
Author

Summonair commented Jan 22, 2025

Hey @Vacant2333 @saurav-agarwalla
merged #7310 into my test, also the ci should work now

so we can close #7549 and #7310

@Summonair Summonair changed the title Merge with vacant fix: avoid selecting subnets with insufficient available IP address + test Jan 22, 2025
@Summonair
Copy link
Copy Markdown
Author

I added the missing import @saurav-agarwalla

@saurav-agarwalla
Copy link
Copy Markdown
Contributor

Thanks for bringing all the changes together. I discussed this at length with the team yesterday and it seems like with this change, we risk breaking customers who might be using secondary networking with Karpenter. This isn't something that Karpenter is explicitly aware of today but it still supports it in the sense that we don't actively block anyone from using it.

Making Karpenter become aware of secondary networking is more of a design question so I don't think we will be able to resolve it in the scope of this PR.

That said, even with this fix, you might make the issue less likely to happen but the underlying root cause still won't be fixed (i.e. the subnet running out of IPs).

As the next step, I'd recommend the following:

  1. Figure out a way to ensure that IP exhaustion doesn't happen on the subnets (or maybe use secondary networking to mitigate the impact when this does happen). https://aws.github.io/aws-eks-best-practices/networking/ip-optimization-strategies/ has some references around how to prevent this.
  2. Attend the working group meetings if you want to discuss other solutions to this. This seems like a great problem to discuss with the community.

Please don't hesitate to engage me if you have any other questions.

@Vacant2333
Copy link
Copy Markdown
Contributor

Thanks for bringing all the changes together. I discussed this at length with the team yesterday and it seems like with this change, we risk breaking customers who might be using secondary networking with Karpenter. This isn't something that Karpenter is explicitly aware of today but it still supports it in the sense that we don't actively block anyone from using it.

Making Karpenter become aware of secondary networking is more of a design question so I don't think we will be able to resolve it in the scope of this PR.

That said, even with this fix, you might make the issue less likely to happen but the underlying root cause still won't be fixed (i.e. the subnet running out of IPs).

As the next step, I'd recommend the following:

  1. Figure out a way to ensure that IP exhaustion doesn't happen on the subnets (or maybe use secondary networking to mitigate the impact when this does happen). https://aws.github.io/aws-eks-best-practices/networking/ip-optimization-strategies/ has some references around how to prevent this.
  2. Attend the working group meetings if you want to discuss other solutions to this. This seems like a great problem to discuss with the community.

Please don't hesitate to engage me if you have any other questions.

In fact, I do not agree with the reasoning for not accepting this solution. The issue of IP exhaustion needs to be addressed from both the user’s and Karpenter’s perspectives. From Karpenter’s perspective, if it is already known that a particular subnet might not have any available IP addresses, Karpenter should not continue to use that subnet or create new nodes in that availability zone (AZ).

@maxforasteiro
Copy link
Copy Markdown
Contributor

maxforasteiro commented Jan 24, 2025 via email

@Summonair
Copy link
Copy Markdown
Author

I think of it like spot instances, if there are no spot instances available you can either fail (if on-demand is not allowed) or fallback tk on-demand. On this case, by having 3 available subnet, we just need to fallback to one that have available IPs. Sent from Proton Mail Android

-------- Original Message --------
On 1/24/25 13:54, Vacant2333 wrote: > Thanks for bringing all the changes together. I discussed this at length with the team yesterday and it seems like with this change, we risk breaking customers who might be using secondary networking with Karpenter. This isn't something that Karpenter is explicitly aware of today but it still supports it in the sense that we don't actively block anyone from using it. > > Making Karpenter become aware of secondary networking is more of a design question so I don't think we will be able to resolve it in the scope of this PR. > > That said, even with this fix, you might make the issue less likely to happen but the underlying root cause still won't be fixed (i.e. the subnet running out of IPs). > > As the next step, I'd recommend the following: > > - Figure out a way to ensure that IP exhaustion doesn't happen on the subnets (or maybe use secondary networking to mitigate the impact when this does happen). https://aws.github.io/aws-eks-best-practices/networking/ip-optimization-strategies/ has some references around how to prevent this. > - Attend the working group meetings if you want to discuss other solutions to this. This seems like a great problem to discuss with the community. > > Please don't hesitate to engage me if you have any other questions. In fact, I do not agree with the reasoning for not accepting this solution. The issue of IP exhaustion needs to be addressed from both the user’s and Karpenter’s perspectives. From Karpenter’s perspective, if it is already known that a particular subnet might not have any available IP addresses, Karpenter should not continue to use that subnet or create new nodes in that availability zone (AZ). — Reply to this email directly, [view it on GitHub](#7623 (comment)), or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Thanks for bringing all the changes together. I discussed this at length with the team yesterday and it seems like with this change, we risk breaking customers who might be using secondary networking with Karpenter. This isn't something that Karpenter is explicitly aware of today but it still supports it in the sense that we don't actively block anyone from using it.
Making Karpenter become aware of secondary networking is more of a design question so I don't think we will be able to resolve it in the scope of this PR.
That said, even with this fix, you might make the issue less likely to happen but the underlying root cause still won't be fixed (i.e. the subnet running out of IPs).
As the next step, I'd recommend the following:

  1. Figure out a way to ensure that IP exhaustion doesn't happen on the subnets (or maybe use secondary networking to mitigate the impact when this does happen). https://aws.github.io/aws-eks-best-practices/networking/ip-optimization-strategies/ has some references around how to prevent this.
  2. Attend the working group meetings if you want to discuss other solutions to this. This seems like a great problem to discuss with the community.

Please don't hesitate to engage me if you have any other questions.

In fact, I do not agree with the reasoning for not accepting this solution. The issue of IP exhaustion needs to be addressed from both the user’s and Karpenter’s perspectives. From Karpenter’s perspective, if it is already known that a particular subnet might not have any available IP addresses, Karpenter should not continue to use that subnet or create new nodes in that availability zone (AZ).

couldn't agree more 👍🏼

@saurav-agarwalla
Copy link
Copy Markdown
Contributor

I definitely agree that we need to improve the handling of this on Karpenter's side. My call out was regarding the fact that taking this PR in its current form will break customers who use secondary/custom networking since in those cases even if the node's subnet doesn't have free IPs, pods can still launch.

The other reason this is a slightly more complex problem and different from the spot one is that Karpenter can only estimate how many IPs are going to be used. Ultimately it is the kube-scheduler which schedules the pods so there's no guarantee that the pods will be scheduled on the node that Karpenter thinks they might be.

@Talbalash-legit
Copy link
Copy Markdown
Contributor

I definitely agree that we need to improve the handling of this on Karpenter's side. My call out was regarding the fact that taking this PR in its current form will break customers who use secondary/custom networking since in those cases even if the node's subnet doesn't have free IPs, pods can still launch.

The other reason this is a slightly more complex problem and different from the spot one is that Karpenter can only estimate how many IPs are going to be used. Ultimately it is the kube-scheduler which schedules the pods so there's no guarantee that the pods will be scheduled on the node that Karpenter thinks they might be.

What if we introduce this feature as a configurable variable, disabled by default? This way, it won't impact users relying on their custom networking, but others can enable it if needed. Thoughts?

@saurav-agarwalla
Copy link
Copy Markdown
Contributor

I discussed this in the community meeting today. The general consensus is that we need a design for this problem in order to decide what's the right way forward. Adding a configuration flag can be one of the solutions as part of that proposal.

@sftim
Copy link
Copy Markdown

sftim commented Jan 27, 2025

(contributing to the future design discussion)

Rather than "restrict" or "avoid", how about an opt-in weighting approach? For example, make Karpenter $e^2:1$ more likely to prefer a zone and subnet where there are free IP addresses, but occasionally attempt a launch even where no address appears free. The retryies ought to cover it.

@enxebre
Copy link
Copy Markdown

enxebre commented Jan 27, 2025

I would expect this to be an opt-in/out semantic (flag or api input) that let the fix to be non disruptive for other networking scenarios.

@GnatorX
Copy link
Copy Markdown

GnatorX commented Feb 8, 2025

I am not sure I understand the scenario where it "breaks" a user. Is the situation where all subnets have less than predictedIPsUsed? If this PR just does a second round of evaluation where it considers (from the most available to the least) amongst the "not enough" subnets as a last ditch attempt before giving up would that be good enough?

The outcome should be the same as the current behavior but you just get a better first try which aims for only subnets that should fit and likely shouldn't need a config flag to flip on

@saurav-agarwalla
Copy link
Copy Markdown
Contributor

I am not sure I understand the scenario where it "breaks" a user. Is the situation where all subnets have less than predictedIPsUsed? If this PR just does a second round of evaluation where it considers (from the most available to the least) amongst the "not enough" subnets as a last ditch attempt before giving up would that be good enough?

The outcome should be the same as the current behavior but you just get a better first try which aims for only subnets that should fit and likely shouldn't need a config flag to flip on

See some of the concerns in #7623 (comment).

But I agree with you and @sftim: some sort of a weighted approach seems like a reasonable way to solve this without breaking existing customers who use secondary networking.

@saurav-agarwalla
Copy link
Copy Markdown
Contributor

@Summonair thanks for the update but like I mentioned above, the working group was against a flag-based approach without exploring other options (like the weighted-based approach suggested above). Were you able to explore other options and if so, can we get that reviewed with the community?

@Summonair
Copy link
Copy Markdown
Author

@Summonair thanks for the update but like I mentioned above, the working group was against a flag-based approach without exploring other options (like the weighted-based approach suggested above). Were you able to explore other options and if so, can we get that reviewed with the community?

Tried 2 weeks to figure a nice and readable way to prefer one subnet over the other without great success, please consider this solution as a flag

@ronbutbul
Copy link
Copy Markdown

ronbutbul commented Feb 27, 2025

I've added the feature flag 👍🏼 👍🏼 @saurav-agarwalla @Vacant2333 @maxforasteiro @enxebre really hope this will do it has we really want this feature

just so i know, is the flag available currently? or not.
i do not see it in latest's charts.
it would be very helpful if i could use this workaround.

@maxforasteiro
Copy link
Copy Markdown
Contributor

I've added the feature flag 👍🏼 👍🏼 @saurav-agarwalla @Vacant2333 @maxforasteiro @enxebre really hope this will do it has we really want this feature

just so i know, is the flag available currently? or not. i do not see it in latest's charts. it would be very helpful if i could use this workaround.

it is not. Some people uses a different set of Subnets to spin up nodes and a different set of Subnets to spin up pods, so base the node scheduling on the availability of IPs for pods will not work for these people. Adding it under a feature flag was one option but the Maintainers did not accepted it.

@ITBeyder
Copy link
Copy Markdown

Please Approve @Summonair Pr , It will really help in my company also

@ronbutbul
Copy link
Copy Markdown

I've added the feature flag 👍🏼 👍🏼 @saurav-agarwalla @Vacant2333 @maxforasteiro @enxebre really hope this will do it has we really want this feature

just so i know, is the flag available currently? or not. i do not see it in latest's charts. it would be very helpful if i could use this workaround.

it is not. Some people uses a different set of Subnets to spin up nodes and a different set of Subnets to spin up pods, so base the node scheduling on the availability of IPs for pods will not work for these people. Adding it under a feature flag was one option but the Maintainers did not accepted it.

got you.
its unfortunate that the only workaround currently is to maintain the workload manually.
but if the flag avoidEmptySubnets=true is only optional, so what is really the problem?
am i missing something?

@Summonair
Copy link
Copy Markdown
Author

I've added the feature flag 👍🏼 👍🏼 @saurav-agarwalla @Vacant2333 @maxforasteiro @enxebre really hope this will do it has we really want this feature

just so i know, is the flag available currently? or not. i do not see it in latest's charts. it would be very helpful if i could use this workaround.

it is not. Some people uses a different set of Subnets to spin up nodes and a different set of Subnets to spin up pods, so base the node scheduling on the availability of IPs for pods will not work for these people. Adding it under a feature flag was one option but the Maintainers did not accepted it.

got you. its unfortunate that the only workaround currently is to maintain the workload manually. but if the flag avoidEmptySubnets=true is only optional, so what is really the problem? am i missing something?

I totally agree with you 100%, it's optional and default false, can't hurt anyone and help many others

@ronbutbul
Copy link
Copy Markdown

@saurav-agarwalla could you please approve?
are we agreeing that the flag is only optional and can not hurt anyone?

@saurav-agarwalla
Copy link
Copy Markdown
Contributor

The flag is optional but we want to avoid blowing up these configuration options because it'll be hard to maintain and confuse everyone. There were some suggestions to either do a weight based approach or perform multiple passes with these being picked up in the second pass.

@Summonair were you able to try these?

If not, I'll try to prioritize working on this in the next few weeks.

@ronbutbul
Copy link
Copy Markdown

The flag is optional but we want to avoid blowing up these configuration options because it'll be hard to maintain and confuse everyone. There were some suggestions to either do a weight based approach or perform multiple passes with these being picked up in the second pass.

@Summonair were you able to try these?

If not, I'll try to prioritize working on this in the next few weeks.

Thank you.
Im sure you already know, but its one of the most common issues with karpenter.
I wish you priorities this and provide a fine solution 💯

@carl-reverb
Copy link
Copy Markdown

FYI to those suffering from this issue, you may be able to reclaim up to 2x of your IP addresses by setting WARM_ENI_TARGET=0. The aws-vpc-cni managed addon sets this to WARM_ENI_TARGET=1 by default, which means on a node with ENIs supporting 15 IPs each, 30 IPs will be consumed because 'warm' means 'ENI with no assigned pods'. Meaning you keep an entire ENI full of addresses on standby just so you can have possibly slightly-faster pod scheduling, rather than waiting for an ENI to be provisioned if you need 16 addresses instead of 15.

doc reference
This one weird trick saved us from our /20 subnets' frequent IP address exhaustion.

@maxforasteiro
Copy link
Copy Markdown
Contributor

FYI to those suffering from this issue, you may be able to reclaim up to 2x of your IP addresses by setting WARM_ENI_TARGET=0. The aws-vpc-cni managed addon sets this to WARM_ENI_TARGET=1 by default, which means on a node with ENIs supporting 15 IPs each, 30 IPs will be consumed because 'warm' means 'ENI with no assigned pods'. Meaning you keep an entire ENI full of addresses on standby just so you can have possibly slightly-faster pod scheduling, rather than waiting for an ENI to be provisioned if you need 16 addresses instead of 15.

doc reference This one weird trick saved us from our /20 subnets' frequent IP address exhaustion.

I would recommend setting the WARM_IP_TARGET to 5 at least to have a pool of warm IPs and not have to wait for CNI to allocate IPs everytime a new pod is spun

@devinburnette
Copy link
Copy Markdown

@saurav-agarwalla were you and team able to make a decision on the best design for this yet? I think the most important thing at this moment is reducing friction for everyone and this PR in its current form looks to do just that in a safe way. I don't think this one boolean alone is creating any additional unmanageable cruft. I'm sure you and team will continue to iterate and make this better in future versions, but please consider allowing this to be a temporary solution to make our lives a bit easier and buy you and the team more time to consider what the right solution should look like.

@Vacant2333
Copy link
Copy Markdown
Contributor

@saurav-agarwalla @jonathan-innis
Can we go ahead and continue with this PR now?

@matanryngler
Copy link
Copy Markdown

we really need this 😢

@saurav-agarwalla
Copy link
Copy Markdown
Contributor

Just as an update, this hasn't dropped off our radar and we have been discussing this internally. I am handing this over to @jmdeal since my bandwidth has been a little constrained to be able to pay full attention to this but based on our discussions offline, we may have a path to getting this merged with some updates.

@jmdeal will share more details. Thanks @jmdeal for your help.

@jonathan-innis jonathan-innis added the needs-design Design required label Apr 17, 2025
@jmdeal
Copy link
Copy Markdown
Contributor

jmdeal commented Apr 22, 2025

I'm aiming to give an update here by the beginning of next week, I want to make sure the rest of the maintainer team is aligned before providing feedback to minimize churn.

@sebas-w
Copy link
Copy Markdown

sebas-w commented May 5, 2025

@jmdeal , Any Update?

@dhavaln-able
Copy link
Copy Markdown

dhavaln-able commented May 13, 2025

@jmdeal any update on this

#7623 (comment)

and do we know estimated time that this PR will be merged and new version get released?

@jmdeal
Copy link
Copy Markdown
Contributor

jmdeal commented May 14, 2025

I've reviewed with the team, and we do not see a good path forward for this at the moment. I've summarized our rationale and the potential steps forward here:

Summary of current state

Karpenter combines two sources to estimate the number of IPs available in a given subnet:

  • The value from DescribeSubnets
  • The minimum number of IPs consumed by the set of instances sent to CreateFleet (upper-bound based on maxPods value for the smallest instance)

When new values are discovered via DescribeSubnets, the tracked in-flight IPs are reset for that subnet. This estimate is not a lower-bound for the number of IPs in the subnet for two reasons:

  • Karpenter currently doesn't update the in-flight IPs based on the resolved instance from CreateFleet (the resolved maxPods value may be higher).
  • When IP values are reset based on the DescribeSubnets results, this discards any information about the maximum IP consumption by launched instances and only reflects the currently consumed IPs

Currently this information is only used to prioritize a subnet within a given zone, so it not being a strict lower-bound isn't problematic. However, that changes if we begin to use this estimate to filter out subnets entirely.

What are the issues with the proposed solution?

Because the tracked IP count is not a lower-bound, the change made in this PR will not prevent cases where pods fail to start due to insufficient IPs. If Karpenter overestimates the number of IPs available in a subnet, the same issue will continue to occur. The current change may mitigate the issue in some cases, but due to the way we currently resync IP count from EC2 it will only be effective if nodes are consistently at max capacity.

Additionally, this change would be overly pessimistic for the majority of users. If node density is dictated by something other than maxPods (e.g. CPU or memory resources), this is going to overestimate the number of IPs that could be in use in a subnet. If this change was enabled for all users, this would result in scaling failures when they would have otherwise succeeded. This is the reason we resync IPs the way we do - it reflects the true IP consumption in the subnet.

What could we do in Karpenter to solve this issue?

If we ensure Karpenter's IP estimate is a lower-bound for the number of IPs available in a cluster, we should be able to prevent this issue altogether. By ensuring that the estimate is a lower-bound, we ensure that no combination of pods scheduled to nodes in the subnet would exceed the number of IPs available in the subnets. However, this would be far too pessimistic for most workloads and is not a behavior we would want to support in Karpenter - it's a massive footgun. Without carefully tuning maxPods values on a per-instance basis to reflect real pod density, this would result in scaling failures which would not have occurred otherwise. For this reason, we would not want to go forward with this solution.

What can we do?

We acknowledge that this is an issue Karpenter users face, but we don't believe there is an adequate solution in Karpenter alone. There's two reasons for this:

  • Karpenter is not the component responsible for binding pods to nodes - kube-scheduler is
  • Karpenter is not a good source of truth for IP availability. The EC2 APIs are eventually consistent and we will never have anything better than a prediction.

To solve this issue, I believe we need a solution which involves both the CNI and kube-scheduler. In an ideal world, the scheduler would not bind a pod to a node without the CNI first allocating an IP for that pod. Alternatively, we could have a construct for resource limits within a topology (and introduce a subnet topology key). I imagine either of these proposals would need to take the form of a KEP. This is an issue we would like to tackle, but don't have timelines to share at the moment wrt when it would be prioritized.

There are also architectural changes that can be considered which wouldn't require changes to Karpenter or upstream Kubernetes:

  • Using custom networking to place pods in a separate, less constrained subnet
  • Tuning CNI configuration options to reduce unnecessary IP use (e.g. WARM_IP_TARGET, WARM_ENI_TARGET, and MINIMUM_IP_TARGET)
  • Single subnet nodepools with pod limits

Next Steps

As I mentioned earlier, we do understand this is an issue facing some Karpenter users and architectural improvements aren't always an option. We do plan on working on this issue, but our focus will be addressing it upstream rather than implementing a mitigation within Karpenter.

While I don't see a great path for the approach in this PR, we would still review an RFC that addresses the short-comings and sharp-edges discussed above. Opening an RFC does not guarantee that we would move forward with the change - it would still need to be reviewed and approved - but it is the potential path forward for a mitigation within Karpenter.

@dhavaln-able
Copy link
Copy Markdown

Can you please explain more about this 'our focus will be addressing it upstream rather than implementing a mitigation within Karpenter.' Thank you.

@jmdeal

@jmdeal
Copy link
Copy Markdown
Contributor

jmdeal commented Dec 12, 2025

Can you please explain more about this 'our focus will be addressing it upstream rather than implementing a mitigation within Karpenter.' Thank you.

To address this issue holistically, there really needs to be a change to kube-scheduler. That's what I mean by upstream. One idea I've been thinking of is a similar mechanism to binding conditions in DRA - the scheduler would have a mechanism to ensure an IP is preallocated before scheduling the pod. Maybe this mechanism could even use DRA directly.

GIven that we won't be going forward with this solution, I'm going to close this PR out but leave the issue open.

@jmdeal jmdeal closed this Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-design Design required

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid subnets that don't have available IP Addresses