fix: avoid selecting subnets with insufficient available IP address + test#7623
fix: avoid selecting subnets with insufficient available IP address + test#7623
Conversation
Signed-off-by: Vacant2333 <vacant2333@gmail.com>
…r-aws into merge-with-vacant pull main
…ip-count-subnet Feat select the most availalbe ip count subnet
✅ Deploy Preview for karpenter-docs-prod ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
|
Hey @Vacant2333 @saurav-agarwalla |
|
I added the missing import @saurav-agarwalla |
|
Thanks for bringing all the changes together. I discussed this at length with the team yesterday and it seems like with this change, we risk breaking customers who might be using secondary networking with Karpenter. This isn't something that Karpenter is explicitly aware of today but it still supports it in the sense that we don't actively block anyone from using it. Making Karpenter become aware of secondary networking is more of a design question so I don't think we will be able to resolve it in the scope of this PR. That said, even with this fix, you might make the issue less likely to happen but the underlying root cause still won't be fixed (i.e. the subnet running out of IPs). As the next step, I'd recommend the following:
Please don't hesitate to engage me if you have any other questions. |
In fact, I do not agree with the reasoning for not accepting this solution. The issue of IP exhaustion needs to be addressed from both the user’s and Karpenter’s perspectives. From Karpenter’s perspective, if it is already known that a particular subnet might not have any available IP addresses, Karpenter should not continue to use that subnet or create new nodes in that availability zone (AZ). |
|
I think of it like spot instances, if there are no spot instances available you can either fail (if on-demand is not allowed) or fallback tk on-demand.
On this case, by having 3 available subnet, we just need to fallback to one that have available IPs.
Sent from Proton Mail Android
…-------- Original Message --------
On 1/24/25 13:54, Vacant2333 wrote:
> Thanks for bringing all the changes together. I discussed this at length with the team yesterday and it seems like with this change, we risk breaking customers who might be using secondary networking with Karpenter. This isn't something that Karpenter is explicitly aware of today but it still supports it in the sense that we don't actively block anyone from using it.
>
> Making Karpenter become aware of secondary networking is more of a design question so I don't think we will be able to resolve it in the scope of this PR.
>
> That said, even with this fix, you might make the issue less likely to happen but the underlying root cause still won't be fixed (i.e. the subnet running out of IPs).
>
> As the next step, I'd recommend the following:
>
> - Figure out a way to ensure that IP exhaustion doesn't happen on the subnets (or maybe use secondary networking to mitigate the impact when this does happen). https://aws.github.io/aws-eks-best-practices/networking/ip-optimization-strategies/ has some references around how to prevent this.
> - Attend the [working group](https://karpenter.sh/docs/contributing/working-group/) meetings if you want to discuss other solutions to this. This seems like a great problem to discuss with the community.
>
> Please don't hesitate to engage me if you have any other questions.
In fact, I do not agree with the reasoning for not accepting this solution. The issue of IP exhaustion needs to be addressed from both the user’s and Karpenter’s perspectives. From Karpenter’s perspective, if it is already known that a particular subnet might not have any available IP addresses, Karpenter should not continue to use that subnet or create new nodes in that availability zone (AZ).
—
Reply to this email directly, [view it on GitHub](#7623 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AEOL5GCGF4XTKTVRYVAPNLL2MIZZVAVCNFSM6AAAAABVV5TCMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJSGQ3DOMZRGM).
You are receiving this because you commented.Message ID: ***@***.***>
|
couldn't agree more 👍🏼 |
|
I definitely agree that we need to improve the handling of this on Karpenter's side. My call out was regarding the fact that taking this PR in its current form will break customers who use secondary/custom networking since in those cases even if the node's subnet doesn't have free IPs, pods can still launch. The other reason this is a slightly more complex problem and different from the spot one is that Karpenter can only estimate how many IPs are going to be used. Ultimately it is the kube-scheduler which schedules the pods so there's no guarantee that the pods will be scheduled on the node that Karpenter thinks they might be. |
What if we introduce this feature as a configurable variable, disabled by default? This way, it won't impact users relying on their custom networking, but others can enable it if needed. Thoughts? |
|
I discussed this in the community meeting today. The general consensus is that we need a design for this problem in order to decide what's the right way forward. Adding a configuration flag can be one of the solutions as part of that proposal. |
|
(contributing to the future design discussion) Rather than "restrict" or "avoid", how about an opt-in weighting approach? For example, make Karpenter |
|
I would expect this to be an opt-in/out semantic (flag or api input) that let the fix to be non disruptive for other networking scenarios. |
|
I am not sure I understand the scenario where it "breaks" a user. Is the situation where all subnets have less than predictedIPsUsed? If this PR just does a second round of evaluation where it considers (from the most available to the least) amongst the "not enough" subnets as a last ditch attempt before giving up would that be good enough? The outcome should be the same as the current behavior but you just get a better first try which aims for only subnets that should fit and likely shouldn't need a config flag to flip on |
See some of the concerns in #7623 (comment). But I agree with you and @sftim: some sort of a weighted approach seems like a reasonable way to solve this without breaking existing customers who use secondary networking. |
|
@Summonair thanks for the update but like I mentioned above, the working group was against a flag-based approach without exploring other options (like the weighted-based approach suggested above). Were you able to explore other options and if so, can we get that reviewed with the community? |
Tried 2 weeks to figure a nice and readable way to prefer one subnet over the other without great success, please consider this solution as a flag |
just so i know, is the flag available currently? or not. |
it is not. Some people uses a different set of Subnets to spin up nodes and a different set of Subnets to spin up pods, so base the node scheduling on the availability of IPs for pods will not work for these people. Adding it under a feature flag was one option but the Maintainers did not accepted it. |
|
Please Approve @Summonair Pr , It will really help in my company also |
got you. |
I totally agree with you 100%, it's optional and default false, can't hurt anyone and help many others |
|
@saurav-agarwalla could you please approve? |
|
The flag is optional but we want to avoid blowing up these configuration options because it'll be hard to maintain and confuse everyone. There were some suggestions to either do a weight based approach or perform multiple passes with these being picked up in the second pass. @Summonair were you able to try these? If not, I'll try to prioritize working on this in the next few weeks. |
Thank you. |
|
FYI to those suffering from this issue, you may be able to reclaim up to 2x of your IP addresses by setting doc reference |
I would recommend setting the |
|
@saurav-agarwalla were you and team able to make a decision on the best design for this yet? I think the most important thing at this moment is reducing friction for everyone and this PR in its current form looks to do just that in a safe way. I don't think this one boolean alone is creating any additional unmanageable cruft. I'm sure you and team will continue to iterate and make this better in future versions, but please consider allowing this to be a temporary solution to make our lives a bit easier and buy you and the team more time to consider what the right solution should look like. |
|
@saurav-agarwalla @jonathan-innis |
|
we really need this 😢 |
|
Just as an update, this hasn't dropped off our radar and we have been discussing this internally. I am handing this over to @jmdeal since my bandwidth has been a little constrained to be able to pay full attention to this but based on our discussions offline, we may have a path to getting this merged with some updates. @jmdeal will share more details. Thanks @jmdeal for your help. |
|
I'm aiming to give an update here by the beginning of next week, I want to make sure the rest of the maintainer team is aligned before providing feedback to minimize churn. |
|
@jmdeal , Any Update? |
|
@jmdeal any update on this and do we know estimated time that this PR will be merged and new version get released? |
|
I've reviewed with the team, and we do not see a good path forward for this at the moment. I've summarized our rationale and the potential steps forward here: Summary of current stateKarpenter combines two sources to estimate the number of IPs available in a given subnet:
When new values are discovered via DescribeSubnets, the tracked in-flight IPs are reset for that subnet. This estimate is not a lower-bound for the number of IPs in the subnet for two reasons:
Currently this information is only used to prioritize a subnet within a given zone, so it not being a strict lower-bound isn't problematic. However, that changes if we begin to use this estimate to filter out subnets entirely. What are the issues with the proposed solution?Because the tracked IP count is not a lower-bound, the change made in this PR will not prevent cases where pods fail to start due to insufficient IPs. If Karpenter overestimates the number of IPs available in a subnet, the same issue will continue to occur. The current change may mitigate the issue in some cases, but due to the way we currently resync IP count from EC2 it will only be effective if nodes are consistently at max capacity. Additionally, this change would be overly pessimistic for the majority of users. If node density is dictated by something other than maxPods (e.g. CPU or memory resources), this is going to overestimate the number of IPs that could be in use in a subnet. If this change was enabled for all users, this would result in scaling failures when they would have otherwise succeeded. This is the reason we resync IPs the way we do - it reflects the true IP consumption in the subnet. What could we do in Karpenter to solve this issue?If we ensure Karpenter's IP estimate is a lower-bound for the number of IPs available in a cluster, we should be able to prevent this issue altogether. By ensuring that the estimate is a lower-bound, we ensure that no combination of pods scheduled to nodes in the subnet would exceed the number of IPs available in the subnets. However, this would be far too pessimistic for most workloads and is not a behavior we would want to support in Karpenter - it's a massive footgun. Without carefully tuning What can we do?We acknowledge that this is an issue Karpenter users face, but we don't believe there is an adequate solution in Karpenter alone. There's two reasons for this:
To solve this issue, I believe we need a solution which involves both the CNI and kube-scheduler. In an ideal world, the scheduler would not bind a pod to a node without the CNI first allocating an IP for that pod. Alternatively, we could have a construct for resource limits within a topology (and introduce a subnet topology key). I imagine either of these proposals would need to take the form of a KEP. This is an issue we would like to tackle, but don't have timelines to share at the moment wrt when it would be prioritized. There are also architectural changes that can be considered which wouldn't require changes to Karpenter or upstream Kubernetes:
Next StepsAs I mentioned earlier, we do understand this is an issue facing some Karpenter users and architectural improvements aren't always an option. We do plan on working on this issue, but our focus will be addressing it upstream rather than implementing a mitigation within Karpenter. While I don't see a great path for the approach in this PR, we would still review an RFC that addresses the short-comings and sharp-edges discussed above. Opening an RFC does not guarantee that we would move forward with the change - it would still need to be reviewed and approved - but it is the potential path forward for a mitigation within Karpenter. |
|
Can you please explain more about this 'our focus will be addressing it upstream rather than implementing a mitigation within Karpenter.' Thank you. |
To address this issue holistically, there really needs to be a change to GIven that we won't be going forward with this solution, I'm going to close this PR out but leave the issue open. |
Fixes #5234 , #2921
Description
its a feature that when launchInstance select the subnet which has the most available ip count by @Vacant2333
i added a test around it so we can merge it
How was this change tested?
Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
merged with #7310,