Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter support for prioritizing AWS Capacity Reservations #3042

Open
dkheyman opened this issue Dec 15, 2022 · 9 comments · May be fixed by #5716
Open

Karpenter support for prioritizing AWS Capacity Reservations #3042

dkheyman opened this issue Dec 15, 2022 · 9 comments · May be fixed by #5716
Labels
feature New feature or request needs-design Design required v1.x Issues prioritized for post-1.0

Comments

@dkheyman
Copy link

dkheyman commented Dec 15, 2022

Tell us about your request

As a EKS platform engineer, I would like to reduce my platform cost and increase resiliency to capacity outages by provisioning AWS EC2 instances through the EC2 Fleet API by prioritizing Capacity Reservations. Capacity reservations are a mechanism for us to reserve capacity of particular instance types in particular AZs. This is done to reduce cost and increase resiliency to issues in acquiring capacity at large scale.

I would like to be able to specify in the, e.g. AWSTemplate.spec.on-demand-options.CapacityReservations and specify use-capacity-reservations-first in order to make sure the instances that launch are placed in targeted capacity reservations in that account. Without this support, we could end up paying for reserved capacity and On-Demand instances through EC2 Fleet.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

The problem is this workload is critical, and has a set SLA that we need to meet. Capacity reservations help us achieve tremendous vCPU counts without worrying about capacity, and currently we cannot use Karpenter's scalability and advantages at all.

Reserved instances and open capacity reservations help, but this puts the responsibility of the EKS platform engineer to always have to monitor and provide the AZs and empty reservation slots to the provisioner, which makes the application team less autonomous and adds complexity in managing the reservations. In addition, with targeted reservations opening the door for specific components of an application to take advantage of different reservation amounts in a single account, there is no current way in the provisioner to specify the reservation ID either.

Are you currently working around this issue?

There is no workaround; we don't have a mechanism to use Karpenter because it does not support specifying Capacity Reservations. Cluster Autoscaler also does not do this, so we have no way of guarranteeing the use of CRs with EKS currently on AWS unless we own the Ec2 fleet calls.

Additional Context

No response

Attachments

No response

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@dkheyman dkheyman added the feature New feature or request label Dec 15, 2022
@spring1843
Copy link
Contributor

We have something very similar explained here: https://karpenter.sh/preview/concepts/scheduling/#savings-plans-and-reserved-instances

@dkheyman
Copy link
Author

Ryan,

While I appreciate this reserved instance approach, this won’t work with “targeted” capacity reservations because those need to be “pointed to” by the EC2 Fleet API. The current approach you mentioned also won’t factor in availability of capacity reservations by AZ even if the reservation is of type “open”.

If I reserve 20 instances of c4.large each in 5 AZs, with a total of 100, those 100 instances can be launched by Karpenter in any AZ without regard for placing them evenly across each AZ. Similarly, if I had 10 reservations available across 4 AZs, there’s no mechanism for Karpenter to place them in those AZs accordingly.

Let me know if I'm missing something.

Thanks,
David

@FernandoMiguel
Copy link
Contributor

if I had 10 reservations available across 4 AZs, there’s no mechanism for Karpenter to place them in those AZs accordingly.

your provisioner can specify which AZs to deploy to. Worst case, you could have a provisioner per AZ

@dkheyman
Copy link
Author

I understand that, but how do I know I have 10 reservations across 4 AZs? There's no way the provisioner would know those exist, you as the infrastructure engineer would have to know how many are in each AZ, and the number could fluctuate due to other workloads or components taking up those reserved slots. Why not let EC2 Fleet figure it out, instead of putting onus on the infrastructure engineers?

And this still does not address targeted Capacity Reservations. If you have multiple workloads or components of a workload in the same account and have different reservations for each via reservation IDs, you are SOL.

@yaroslav-nakonechnikov
Copy link

I can add some additions:
as we are hosting our servers in Frankfurt, we see lots of issues with capacity.
So we created capacity reservations and noticed, that karpenter ignores it, and creates instance with defined instance type, but in different AZ.
So, it would be great if there will be way to tell karpenter, that on first priority - load reserved spots, and then - cost(or whatever else)-effective.

@jeffspahr
Copy link

I also need the ability to reference targeted on demand capacity requests. We use ODCRs for instance types that are capacity constrained where we need a guarantee of being able to launch those instances.

@billrayburn billrayburn added the v1 Issues requiring resolution by the v1 milestone label Sep 27, 2023
@billrayburn billrayburn added v1.x Issues prioritized for post-1.0 and removed v1 Issues requiring resolution by the v1 milestone labels Nov 22, 2023
@garvinp-stripe
Copy link
Contributor

garvinp-stripe commented Feb 14, 2024

I think it make sense to support capacity reservation but I am not sure it make sense to go down a priority within the same Node Pool. I think Karpenter has a solve for this already with weights. So you would have your ODCR NodePool at a higher weight and if that fails you would fall into the next node pool (maybe spot) and so on. I think its difficult to support the prioritization because ODCR is tied to a launch template, which Karpenter creates in the back. In order to support this fallback behavior, I think Karpenter would have to have a set of Launch Templates which I suspect is too complicated? Note that Spot is supported because CreateFleet supports choosing the more optimized allocation strategy

@ellistarn
Copy link
Contributor

Fwiw, we maintain many launch templates under the hood per node pool. It's required for things like labels and architectures.

That said, we model "preferential fallback" using the weight feature, so I don't think it's crazy to reuse that here. However, LT combinatorics don't seem like they'll be a design constraint on this question.

@garvinp-stripe
Copy link
Contributor

garvinp-stripe commented Feb 16, 2024

Fwiw, we maintain many launch templates under the hood per node pool. It's required for things like labels and architectures.

Ya I realize I don't actually know that code that well and I was suspecting I was wrong after writing this. But I think I agree that it's likely better to push that logic to weights instead. I am trying to write up a brief design on implementing adding capacity reservation to EC2NodeClasses so I wanted to make sure that we want to keep LTs under nodeclass clean/ minimal for the sake for just launching the right node and not for prioriziation

@garvinp-stripe garvinp-stripe linked a pull request Feb 23, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request needs-design Design required v1.x Issues prioritized for post-1.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants