Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: RFC Supporting ODCR in Karpenter #5716

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

garvinp-stripe
Copy link
Contributor

@garvinp-stripe garvinp-stripe commented Feb 23, 2024

Starts the fix but doesn't completely fixes #3042

Description
Narrowly scoped design doc for support ODCR in Karpenter.

How was this change tested?

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link

netlify bot commented Feb 23, 2024

Deploy Preview for karpenter-docs-prod canceled.

Name Link
🔨 Latest commit 6760e96
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/65d7e03e3d1e3f00089b777d

Copy link

netlify bot commented Feb 23, 2024

Deploy Preview for karpenter-docs-prod ready!

Name Link
🔨 Latest commit fca1abf
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/665e39be4bf34d000850af24
😎 Deploy Preview https://deploy-preview-5716--karpenter-docs-prod.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@garvinp-stripe garvinp-stripe marked this pull request as ready for review February 23, 2024 21:21
@garvinp-stripe garvinp-stripe requested a review from a team as a code owner February 23, 2024 21:21
Copy link
Contributor

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice doc! Some high-level comments

designs/odcr.md Outdated Show resolved Hide resolved
designs/odcr.md Outdated Show resolved Hide resolved
designs/odcr.md Outdated Show resolved Hide resolved
designs/odcr.md Outdated

_We are keeping the scope of this design very targeted so even if these could be things we eventually support, we aren't scoping them into this design_
- Supporting prioritization when launching nodes
- Supporting any pre-calcaluation of ODCR capacity utilization before node launch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we think that pre-calculation should be out of scope here? How do we handle the case where we try to launch into a capacity reservation that doesn't have any available instance types and it fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be quite complicated to implement. Because checking if capacity exist then launching against can't be done atomically, there is no guarantee that we won't run into capacity issue. I felt that allowing this to fail and we fallback into unable to launch additional capacity from this node pool would be reasonable and push users toward using weights on different node pools to pickup any insufficient capacity issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I just wonder if we can do this optimistically, with ICE-ing as a back-stop. I worry that if we don't pre-check we'll just keep slamming the API because we think that the NodePool still has instance type availability ever 3m.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mm that make sense. I think its reasonable to call the API to check. Do you think we should cache this as part of the existing instance type offering cache?

designs/odcr.md Outdated
## Non-Goals

_We are keeping the scope of this design very targeted so even if these could be things we eventually support, we aren't scoping them into this design_
- Supporting prioritization when launching nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't Karpenter already support this through its NodePool "weight" concept?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant within a single node pool. I can update

designs/odcr.md Outdated
Cons:
- This implementation is overly restrictive causing it to be difficult to use. Its possible that the restrictions makes it too hard for users to use EC2NodeClasses effectively.

Both a pro and a con:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think this is a good thing. When it comes to fallback and prioritization, I think this makes sense. You have a set of instance types that you prioritize for usage with the ODCR and then you can create a lower-weighted NodePool that has a broader set of instance types or perhaps uses the same instance type that you had for the ODCR but with spot and on-demand to try to get that type again just like from a normal request

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed


#### Labels

When a node is launched against a CapacityReservation we will expose Capacity Reservation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anytime we are thinking about setting a label, we have to think if a user would select against it. Are you anticipating that users would select against the capacity reservation ID. I could perhaps see that, but I think that the reservation setting is significantly less valuable. Were you thinking of just having this there for observability information?

In either case, I think that we should callout how we expect users to use these new labels if we are going to propose them here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on the setting, I think Id is much more helpful than setting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a dedicated NodePool for an ODCR or set of ODCRs, it does seem like you would just want to label those nodes and have a node selector on that instead of the ODCR ID.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I track that my nodeclaim is part of a targeted capacity reservation that no longer exists, but I have another targeted capacity reservation that has available capacity, I need to terminate my nodeclaim and bring up a new on on the unused capacity reservation

designs/odcr.md Outdated

#### Status

We will expose this information both in the NodeClaim's, EC2NodeClass' and NodePool's Condition Status. _I know that currently NodeClasses and NodePool status do not have conditions but I wanted to see if we are opened to adding Conditions to these resources_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely open to adding conditions, see: kubernetes-sigs/karpenter#493

designs/odcr.md Outdated
Status:
Conditions:
Last Transition Time: 2024-02-22T03:46:42Z
Status: LimitExceeded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that we need this status condition if Karpenter is aware of the available capacity and it can just know when to fallback to other NodePools since it ran out of capacity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. From an operator's perspective it would be good to know currently this nodeclass hit its limit and we are going to fallback into other node pools

designs/odcr.md Show resolved Hide resolved
@coveralls
Copy link

Pull Request Test Coverage Report for Build 8025230145

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 82.579%

Totals Coverage Status
Change from base Build 8025066475: 0.0%
Covered Lines: 5053
Relevant Lines: 6119

💛 - Coveralls

garvinp-stripe and others added 3 commits March 20, 2024 10:55
Co-authored-by: Jonathan Innis <jonathan.innis.ji@gmail.com>
Co-authored-by: Jonathan Innis <jonathan.innis.ji@gmail.com>
@garvinp-stripe
Copy link
Contributor Author

Sorry for the slow down. I will try to get back to this PR but atm I got reshuffled in work so need some time to get back to this

@jonathan-innis jonathan-innis self-assigned this Apr 16, 2024
designs/odcr.md Outdated
Comment on lines 48 to 52
capacityReservationSpec:
capacityReservationPreference: open | none | None # Cannot be defined if capacityReservationTarget is specified
capacityReservationTarget: # Cannot be defined if capacityReservationPreference is specified
capacityReservationId: String | None
capacityReservationResourceGroupArn: String | None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep this similar to existing design of how to find objects?
It seems using selectorTerms is the current design (f.e.: https://github.com/aws/karpenter-provider-aws/blob/main/pkg/apis/crds/karpenter.k8s.aws_ec2nodeclasses.yaml#L60-L102)

Suggested change
capacityReservationSpec:
capacityReservationPreference: open | none | None # Cannot be defined if capacityReservationTarget is specified
capacityReservationTarget: # Cannot be defined if capacityReservationPreference is specified
capacityReservationId: String | None
capacityReservationResourceGroupArn: String | None
capacityReservationSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"
environment: test
- id: cr-1234567 # arn
- preference: open | none | None
capacityReservationResourceGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"
environment: test
- id: arn:aws:resource-groups:us-west-2:123456789012:group/my-resource-group
- preference: open | none | None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd skeptical of a need to specify multiple ODCRs that we can select on in a single EC2NodeClass. What are you imagining the use-case is for wanting to specify multiple ODCRs in an EC2NodeClass? The only one I can think of is multi-AZ support. Other than than -- I figure you get an ODCR because you have a particular type of capacity that you are planning to use for your workload and you want to target that specifically. Do we have an example of how a pod would target this capacity? Would we imagine that these pods target instance types with preferences and then fallback?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the use of multiple ODCRs is because additional capacity is (usually) allocated via a new ODCR. so if you are growing your capacity, you are doing so by adding additional ODCRs. This is why its recommended to use the resource group which acts as the container, and as new capacity is allocated via additional ODCRs, those are associated to the resource group which expands the pool of available capacity

The use of the resource group helps to decouple the act of adding capacity from the consumers of the capacity (i.e. - I don't need to keep updating my manifests when capacity is added. Might be less of a concern for Karpenter, but can be quite disruptive and harder to consume with node groups that are utilizing launch templates)

designs/odcr.md Outdated

The main failure scenario is when Capacity Reservation limit is hit and no new nodes can be launched from any Capacity Reservation the launch template targets.

#### Status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this, we need a documentation that we will update the status of EC2NodeClass for the found capacity reservations / resource groups, similar how we do it for AMI or Subnets.

designs/odcr.md Outdated

Pros:
- This puts the onus of checking on the user to verify capatbility of their NodeClasses and requirements. It makes no assumption about what a Capacity Reservation can do nor implies can't do (Open to checking if Capacity is at limit prior to calling CreateFleet).
- This forces users who wish to leverage fallback and prioritization to use NodePool weights or Capacity Reservation Group rather than relying on EC2NodeClass.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only works for Capacity Reservations
As far as I understand Capacity Reservation Resource Groups have an automatic fallback to on-demand? for the initial design are we for that reason than only scoping in Capacity Reservations?

designs/odcr.md Outdated

## Background

In AWS [ODCR](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-reservations.html) allows users to reserve compute capacity to mitigate against the risk of being
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In AWS [ODCR](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-reservations.html) allows users to reserve compute capacity to mitigate against the risk of being
In AWS [ODCR](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-reservations.html) allows users to reserve compute capacity to mitigate the risk of

designs/odcr.md Outdated
## Background

In AWS [ODCR](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-reservations.html) allows users to reserve compute capacity to mitigate against the risk of being
unabled to get on demand capacity. This is very helpful during seasonal holidays where higher traffic is expected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unabled to get on demand capacity. This is very helpful during seasonal holidays where higher traffic is expected.
getting on-demand capacity. This is very helpful during seasonal holidays where higher traffic is expected.

designs/odcr.md Outdated
## Background

In AWS [ODCR](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-reservations.html) allows users to reserve compute capacity to mitigate against the risk of being
unabled to get on demand capacity. This is very helpful during seasonal holidays where higher traffic is expected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unabled to get on demand capacity. This is very helpful during seasonal holidays where higher traffic is expected.
unabled to get on demand capacity. This is very helpful during seasonal holidays where higher traffic is expected or for reserving highly-desired instance types, like the `p5.48xlarge` or other large GPU instance types.

AWS also supports grouping Capacity Reservation into Capacity Reservation groups.
Both these entities are supported in Launch Template's CapacityReservationTarget [definitions](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-ec2-launchtemplate-capacityreservationtarget.html).

## Goals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to lay out a use-cases section here for the need and usage of ODCRs? Knowing these use-cases will help set the context and also help us think about how we should design the solution

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional goal I might (strongly) suggest - the ability to define "fairness" in some form when an ODCR is shared amongst multiple nodepools/clusters. The scenario:

  • An ODCR for 500 instances is created
  • That ODCR is shared across x-number of AWS accounts (say 5)
  • Each account contains a cluster that should be entitled to 1/x (roughly) percent of the reservation at minimum (each cluster at minimum is guaranteed access to launch 100 instances)

The most basic form could be simply providing a max/ceiling on the number of instances a nodepool can launch from a reservation and let users set that value.

This is used in conjunction with the resource group - as more capacity is needed, more ODCRs are created and added into the "pool" of capacity, but that "pool" of capacity is shared amongst multiple Karpener nodepools across clusters/accounts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you already do this by having an ODCR NodePool with max vCPUs configured?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional goal I might (strongly) suggest - the ability to define "fairness" in some form when an ODCR is shared amongst multiple nodepools/clusters. The scenario:

  • An ODCR for 500 instances is created
  • That ODCR is shared across x-number of AWS accounts (say 5)
  • Each account contains a cluster that should be entitled to 1/x (roughly) percent of the reservation at minimum (each cluster at minimum is guaranteed access to launch 100 instances)

The most basic form could be simply providing a max/ceiling on the number of instances a nodepool can launch from a reservation and let users set that value.

This is used in conjunction with the resource group - as more capacity is needed, more ODCRs are created and added into the "pool" of capacity, but that "pool" of capacity is shared amongst multiple Karpener nodepools across clusters/accounts

What you are describing is outside of the current capabilities of AWS itself, once those capabilities are available we can update or create another design document to incorporate this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you already do this by having an ODCR NodePool with max vCPUs configured?

Agreed, seems like a quick win, but is management overhead for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are describing is outside of the current capabilities of AWS itself

Which capability(s) exactly? You can share ODCRs across accounts

Can't you already do this by having an ODCR NodePool with max vCPUs configured?

In theory, yes - but for most of the workloads that are utilizing GPUs via ODCRs, you aren't using vCPU as a proxy for your GPU allotment. p5s have double the vCPU capacity as p4s - so not a great proxy value

Not saying every use case has to be considered and implemented on the first go - but it seems we are missing the mark on ODCRs and how they are intended to be used:

  • ODCRs should be added to a resource group and treated like a pool of compute resources. This pool of compute will expand/contract by adding/removing ODCRs
  • ODCRs are not always 1:1 to a cluster/account/team (they rarely are 1:1) - again, its a pool of resources and that pool may need to be divided/shared amongst clusters/accounts/teams
  • If you have a fixed pool of available resources, you will inevitably need to consider priority and fairness in terms of resource allocation. On day 1 you may have two teams who each consume half the pool by verbally agreeing to use x number of instances. As workloads grow in size, as is the norm for ML workloads, teams will need more resources and now you need to give the whole pool of resources to one team at a time but still ensure teams have equal time with the pool of resources

This is an RFC so not looking for every configuration or use case to be provided initially, but just want to ensure we are looking at the use cases and workloads and thinking about the implementation holistically - at least ensuring we have considered those common uses cases and can support them in the future without too much churn/disruption to the API

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those cases were considered however we didn't choose the implement them on the first pass for a few reasons.

  • Resource groups for capacity reservation makes the problem considerably more complicated because of its fallback logic to OD if the Capacity reservation group (CRG)'s ODCRs can't satisfy the request. This is complicated because it requires more complicated reconciliation because at this time we assume any ODCR will cost zero and have a fixed limit of instances. That isn't to say this won't be supported but to move this project forward we needed to narrow in a basic use case before we tackle the harder ones.
  • Multi-account usage is supported in the selectors however you are right that we lack any fairness. Once again because we didn't get any early feedback from users we didn't implement any fairness between "shared" ODCRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please join Karpenter working group on June 6th 2PM PST to discuss (https://karpenter.sh/docs/contributing/working-group/)

designs/odcr.md Outdated
- Support associating ODCR to EC2NodeClass
- Define Karpenter's behavior when launching nodes into Capacity Reservation
- Define Karpenter's behavior when encountering errors when attempting to launch nodes into Capacity Reservation
- Define Karpenter's behavior when consolidating nodes in Capacit Reservation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Define Karpenter's behavior when consolidating nodes in Capacit Reservation
- Define Karpenter's behavior when consolidating nodes in Capacity Reservation

designs/odcr.md Outdated
##### Consolidating Capacity Reserved Instances

During consolidation pricing does matter as it affects which candidate will be [prioritized](https://github.com/kubernetes-sigs/karpenter/blob/75826eb51589e546fffb594bfefa91f3850e6c82/pkg/controllers/disruption/consolidation.go#L156). Since all capacity instances are paid ahead of time, their cost is already incurred. Users would likely want to prioritize filling their capacition
reservation first then fall back into other instances. Because of this reserved instances should likely show up as 0 dollar pricing when we calculate the instance pricing. Since each candidate is tied to a NodePool and NodeClass, we should be able to safely override the pricing per node under a capacity reservation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd view this as a special type of "offering" in the InstanceTypes returned by the CloudProvider. This is an ODCR offering that has price 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the cost is already a sunk cost (bought and paid for up front, no savings to be had by consolidating), we still need to consider consolidation from the perspective of shared resources. Meaning, there are other consumers (nodepools) that will potentially want access to these resources if they are available


##### Consolidating into Capacity Reserved Instances

If we track Capacity Reservation usage, we can optimize the cluster configuration by moving non-Capacity Reserved instances into
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should just work out-of-the-box if the user has prioritization set-up with weights on their NodePools and we model the pricing correctly. I'd rather just assume that this works and then discuss what the interaction is going to look like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on some of the usage patterns we've seen so far - a form of priority and fairness needs to be considered. We don't want one nodepool, or a few nodepools, consuming the majority of the reservation and leaving other nodepools starved for resources because they are unable to provision instances

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean "move"? Do you mean between the NodePools? Because if the ODCR is open, it will automatically bind even if you didn't target it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this meant multiple node pool has the same targetted ODCR

#### Labels

When a node is launched against a CapacityReservation we will expose Capacity Reservation
information as labels `karpenter.k8s.aws/capacity-reservation-id`. This is helpful for users to identify those nodes that are being used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question: If we're tying one capacity reservation ID to an EC2NodeClass, we may not need this label at all because we can just use something like karpenter.k8s.aws/ec2nodeclass to know which CR is being used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is bringing in a conflicting detail that most users would not need to be aware of:

  1. If we add this label, then we are creating a 1:1 of capacity reservation to nodepool; this conflicts with the statements above about using resource groups where multiple ODCR IDs are added to the group, and the group is what is used for creating capacity, not the ODCRs directly
  2. This is not a "characteristic" a workload would typically look for in terms of label selectors. Workloads will look to target things such as OS platform, device manufacturer or model, etc. I would be curious to know the use cases where consumers are aware of the ODCR IDs and why they would use those for selectors over other characteristics provided by the instance details (i.e. - think NFD or GFD)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a 1 ODCR per NodePool restriction? If so, that's not clear to me. What if your CR selector terms match multiple ODCRs??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the label is attached to the nodeclaim and Node, not Nodepool or EC2NodeClass. Its a fact that gets added to the nodeclaim when it was brought up within a capacity reservation.

designs/odcr.md Outdated

#### Status

We will expose this information both in the NodeClaim's, EC2NodeClass' and NodePool's Condition Status. _I know that currently NodeClasses and NodePool status do not have conditions but I wanted to see if we are opened to adding Conditions to these resources_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may only need/want to expose this in the EC2NodeClass status condition. On top of that, this gets trickier depending on if we are talking about supporting a single capacity reservation or multiple. Obviously, only supporting one means that we can create a status condition to represent that one, but what does it mean when we have multiple? Do we just document the list of specified capacity reservations that aren't available?

designs/odcr.md Outdated
We will avoid updating (unavailableOfferingCache)[https://github.com/aws/karpenter-provider-aws/blob/main/pkg/providers/instance/instance.go#L239C41-L239C58] because the pool is different than rest of AWS. However we may want to create a new unavailable offering cache keyed against Capacity Reservations. _Not sure if we want to support to this during the first iteration_

## Open Questions
- The UX of adding Capacity Reservation feels wrong because NodeClasses previously didn't fully restrict instance types but with Capacity Reservation it kind of does. There isn't a good primative in Karpenter to expose these kinds of restrictions (I think?) specifically around preflight or static analysis that tells you your NodePool may not be able to launch any nodes because of X, Y and Z in your EC2NodeClass. I believe this is already an issue where if a NodeClass selects for x86 architecture AMIs but the NodePool allows for ARM architecture instance types that Karpenter may just quietly never spawn ARM instances?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a primitive. When you are resolving the GetInstanceTypes() call, you can just tell it not to return back certain instance types based on the EC2NodeClass used. We do have some slight mutation behavior around some EC2NodeClasses specified today already, like when using Windows, this scopes down the supported instance types that we return back.

In AWS [ODCR](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-reservations.html) allows users to reserve compute capacity to mitigate against the risk of being
unabled to get on demand capacity. This is very helpful during seasonal holidays where higher traffic is expected.

### Capacity Reservations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI - there are (currently) two forms of reservation - on-demand capacity reservations (ODCR) and ML capacity block reservations (CBR) - either the terminology used here will need to be updated to be specific to ODCR, or the implementation should consider both forms of reservations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that it's worth considering, but it does also add the complication that capacity blocks are also their own capacity type (i.e. like Spot and OD).

Copy link

@GnatorX GnatorX Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc is specific to ODCR so for now we won't support CBR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please join Karpenter working group on June 6th 2PM PST to discuss (https://karpenter.sh/docs/contributing/working-group/)

designs/odcr.md Outdated
However in the case where a Capacity Reservation Group is used it will fall back into On Demand nodes when the Reservation Group limit is hit.

We won't introduce any other prioritization nor fallback when launching nodes against EC2NodeClasses with a Capacity Reservation. This means that it is the user's responsibility to ensure the Capacity Reservation associated to a EC2NodeClasses will include instance types, availability zone and platform the EC2NodeClasses should launch.
This also means that we will not allow an EC2NodeClass to create Spot instances when user specified a Capacity Reservation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not?

designs/odcr.md Outdated
Status: LimitExceeded
Type: CapacityReservation
```
The condition will reset if new nodes were able to launch and the Status will return to `Available`. It may also be helpful to show the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about when new ODCR ID(s) are added to the resource group - shouldn't this trigger an update?

garvinp-stripe and others added 2 commits May 21, 2024 09:19
In AWS [ODCR](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-reservations.html) allows users to reserve compute capacity to mitigate against the risk of being
unabled to get on demand capacity. This is very helpful during seasonal holidays where higher traffic is expected.

### Capacity Reservations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that it's worth considering, but it does also add the complication that capacity blocks are also their own capacity type (i.e. like Spot and OD).

- Define Karpenter's behavior when encountering errors when attempting to launch nodes into Capacity Reservation
- Define Karpenter's behavior when consolidating nodes to Capacity Reservation when available

## Non-Goals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to add Capacity Blocks to this list. I think Karpenter should support Capacity Blocks, but you may want to separate that out (or else explicitly call it out in the goals).

Copy link

@GnatorX GnatorX Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is CBR part of ODCR? The doc's title is specific to ODCR so I didn't think we would include any info around CBR because that isn't an ODCR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CBR is a separate reservation process from ODCR. I think the main point is just being explicitly clear about what this RFC is addressing, and also clear in the terminology used throughout to avoid any confusion between the two forms. For example Capacity Reservation is generically used throughout - its not specific enough to disambiguate which form of capacity reservation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capacity Blocks are a type of Capacity Reservation. It makes more sense for the API to be around "capacity reservations" (which is what the EC2 APIs are around) versus "ODCRs", which is a short hand for a capacity reservation that is typed to on-demand. I see that your RFC calls out "capacityReservationSelectorTerms", which does not sound ODCR specific to me. As a result, does not seem to me to be obvious that Capacity Blocks are out of scope, hence my comment that it should probably be in your non-goal list and is probably something to think about in the design (even if you don't implement it now) because I'm guessing some folks will be interested in Capacity Blocks support.

AWS also supports grouping Capacity Reservation into Capacity Reservation groups.
Both these entities are supported in Launch Template's CapacityReservationTarget [definitions](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-ec2-launchtemplate-capacityreservationtarget.html).

## Goals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you already do this by having an ODCR NodePool with max vCPUs configured?

# dedicated - The Capacity Reservation is created on single-tenant hardware that is dedicated to a single Amazon Web Services account.
tenancy: String | None
status:
capacityReservations:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you purposely excluding start/end date? What about platform, owner id, reservation type, tags, create date, fleet id?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure we can add all of them, basically the whole struct

totalInstanceCount: Integer
```

This follows closely (does not implement all fields) how EC2 [DescribeCapacityReservations](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeCapacityReservations.html) can filter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is in reference to the selector terms? So I assume you're purposely excluding instance-match-criteria, placement-group-arn, and start/end dates?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instanceMatchCriteria is only needed if we support something outside of target for this first implementation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see a use case where you filter based on start/end date for the selector terms, as we by default only filter for active ones


### Supporting new capacity-type "capacity-reservation" for NodePool

Adding a new `karpenter.sh/capacity-type: capacity-reservation` allows us to have a EC2NodeClass that does not automatically fallback to `on-demand` if `capacity-reservation` is not available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would strongly suggest you incorporate the reservation type into this so you can differentiate an ODCR from a Capacity Block.

Are we sure capacity-type is the right way to model this? Right now, capacity-type aligns with EC2 instance metadata for market type. This would be a notable divergence from that existing philosophy. I'm wondering if we can't add a new top-level requirement instead of overloading capacity-type?

Comment on lines +171 to +172
During consolidation pricing does matter as it affects which candidate will be [prioritized](https://github.com/kubernetes-sigs/karpenter/blob/75826eb51589e546fffb594bfefa91f3850e6c82/pkg/controllers/disruption/consolidation.go#L156). Since all capacity instances are paid ahead of time, their cost is already incurred. Users would likely want to prioritize filling their reserved capacity.
reservation first then fall back into other instances. Because of this reserved instances should likely show up as 0 dollar pricing when we calculate the instance pricing. Since each candidate is tied to a NodePool and EC2NodeClass, we should be able to safely override the pricing per node under a capacity reservation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about if the ODCR gets canceled? The instance will still be running but no longer "0 dollars" - will Karpenter poll to account for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


##### Consolidating into Capacity Reserved Instances

If we track Capacity Reservation usage, we can optimize the cluster configuration by moving non-Capacity Reserved instances into
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean "move"? Do you mean between the NodePools? Because if the ODCR is open, it will automatically bind even if you didn't target it.


#### Labels

When a node is launched against a CapacityReservation we will expose Capacity Reservation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a dedicated NodePool for an ODCR or set of ODCRs, it does seem like you would just want to label those nodes and have a node selector on that instead of the ODCR ID.

#### Labels

When a node is launched against a CapacityReservation we will expose Capacity Reservation
information as labels `karpenter.k8s.aws/capacity-reservation-id`. This is helpful for users to identify those nodes that are being used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a 1 ODCR per NodePool restriction? If so, that's not clear to me. What if your CR selector terms match multiple ODCRs??

designs/odcr.md Outdated

_We are keeping the scope of this design very targeted so even if these could be things we eventually support, we aren't scoping them into this design_
- Supporting open Capacity Reservations. _The behaviour of indirectly linked nodes to an open ODCR can cause rotation, adding unnecessary node rotation into the cluster_
- Supporting changes in scaling behavior when ODCR is associated to a NodeClass. _We won't bring up N nodes to match an N node capacity reservation_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here that we'd be too reactive? I'd like to understand a bit more why this is hard, and shouldn't be in the MVP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we do this, that means the ODCRs can not be shared, this should be an additional feature if really needed outside of this initial approach

metadata:
name: example-node-class
spec:
capacityReservationSelectorTerms:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filtering of DesribeCacpacityReservations isn't great, so I filter it mostly after the fact using the properties in the object available
https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/service/ec2@v1.162.1/types#CapacityReservation

designs/odcr.md Outdated
Comment on lines 76 to 80
# The platform of operating system for which the Capacity Reservation reserves capacity
id: String | None
# The type of operating system for which the Capacity Reservation reserves capacity
instancePlatform: String | None
# The type of operating system for which the Capacity Reservation reserves capacity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth updating these comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Karpenter support for prioritizing AWS Capacity Reservations
8 participants