feat: RFC Implementation Supporting ODCR #6198

tvonhacht-apple · 2024-05-14T05:57:11Z

This is a collaboration on implementing the RFC #5716 Supporting ODCRs

Progress

Description

Supporting associating ODCR to EC2NodeClass

Add a new field capacityReservationSelectorTerms to EC2NodeClass

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: example-node-class
spec:
  capacityReservationSelectorTerms:
    - # The Availability Zone of the Capacity Reservation
      availabilityZone: String | None
      # The platform of operating system for which the Capacity Reservation reserves capacity
      id: String | None
      # The type of operating system for which the Capacity Reservation reserves capacity
      instancePlatform: String | None
      # The type of operating system for which the Capacity Reservation reserves capacity
      instanceType: String | None
      # The ID of the Amazon Web Services account that owns the Capacity Reservation
      ownerId: String | None
      # Tags is a map of key/value tags used to select subnets
      # Specifying '*' for a value selects all values for a given tag key.
      tags: Map | None
      # Indicates the tenancy of the Capacity Reservation.
      # A Capacity Reservation can have one of the following tenancy 'default' or 'dedicated':
      #   default - The Capacity Reservation is created on hardware that is shared with other Amazon Web Services accounts.
      #   dedicated - The Capacity Reservation is created on single-tenant hardware that is dedicated to a single Amazon Web Services account.
      tenancy: String | None

This follows closely (does not implement all fields) how EC2 DescribeCapacityReservations can filter.

Karpenter will perform validation against the spec to ensure there isn't any violation prior to creating the LaunchTemplates.

Supporting new capacity-type capacity-reservation

Adding a new karpenter.sh/capacity-type: capacity-reservation allows us to have a EC2NodeClass that does not automatically fallback to on-demand if capacity-reservation is not available.

only allow capacity-reservations

apiVersion: karpenter.sh/v1beta1
kind: NodePool
spec:
  template:
    spec:
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - capacity-reservation

capacity-reservation-fleet (falling back to on-demand if capacity-reservation not available)

apiVersion: karpenter.sh/v1beta1
kind: NodePool
spec:
  template:
    spec:
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - capacity-reservation
        - on-demand

How was this change tested?

Does this change impact docs?

Yes, PR includes docs updates
Yes, issue opened: #
No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

netlify · 2024-05-14T05:57:28Z

✅ Deploy Preview for karpenter-docs-prod canceled.

Name	Link
🔨 Latest commit	`5471bc4`
🔍 Latest deploy log	https://app.netlify.com/sites/karpenter-docs-prod/deploys/664c35ed65311e0008595635

tvonhacht-apple · 2024-05-20T21:27:06Z

pkg/providers/amifamily/resolver.go

@@ -154,12 +157,63 @@ func (r Resolver) Resolve(ctx context.Context, nodeClass *v1beta1.EC2NodeClass,
 				maxPods: int(instanceType.Capacity.Pods().Value()),
 			}
 		})
+
+		zones := scheduling.NewNodeSelectorRequirementsWithMinValues(nodeClaim.Spec.Requirements...).Get(v1.LabelTopologyZone)
+		capacityReservations := []v1beta1.CapacityReservation{}


I think we can handle this within the if below, we dont need to create this variable beforehand

…d update status when found

jonathan-innis · 2024-06-04T07:39:25Z

pkg/cloudprovider/cloudprovider.go

@@ -96,6 +96,9 @@ func (c *CloudProvider) Create(ctx context.Context, nodeClaim *corev1beta1.NodeC
 	}
 	instance, err := c.instanceProvider.Create(ctx, nodeClass, nodeClaim, instanceTypes)
 	if err != nil {
+		if cloudprovider.IsInsufficientCapacityError(err) {


If we already get an ICE error back, we shouldn't have to wrap the error again, when we do a check down the line, it should be able to identify that the error is an ICE error, so long as one of the wrapped errors is.

jonathan-innis · 2024-06-04T07:44:15Z

pkg/providers/amifamily/resolver.go

+
+		zones := scheduling.NewNodeSelectorRequirementsWithMinValues(nodeClaim.Spec.Requirements...).Get(v1.LabelTopologyZone)
+		capacityReservations := []v1beta1.CapacityReservation{}
+		if capacityType == "capacity-reservation" {


If we select a capacity reservation NodeClaim, should we just best effort the NodeClaim launch here and then have it get deleted with an ICE error from Fleet if there isn't any available capacity. The next iteration of GetInstanceTypes should have the updated capacity reservation availability so we shouldn't try and launch with the same offering again on the second attempt

jonathan-innis · 2024-06-04T07:44:37Z

pkg/providers/amifamily/resolver.go

+		zones := scheduling.NewNodeSelectorRequirementsWithMinValues(nodeClaim.Spec.Requirements...).Get(v1.LabelTopologyZone)
+		capacityReservations := []v1beta1.CapacityReservation{}
+		if capacityType == "capacity-reservation" {
+			for _, capacityReservation := range nodeClass.Status.CapacityReservations {


lo.Filter?

jonathan-innis · 2024-06-04T07:48:25Z

pkg/providers/amifamily/resolver.go

+				return nil, cloudprovider.NewInsufficientCapacityError(fmt.Errorf("trying to resolve capacity-reservation but no available capacity reservations available"))
+			}
+		}
+
 		for params, instanceTypes := range paramsToInstanceTypes {


When we group paramsToInstanceTypes above should we also group these with the capacity reservations in mind? This would allow us to keep the same logic on L178 where we iterate through the params and instance types, and I think this may work better as well since there should only be specific instance types that are valid for a given capacity reservation

jonathan-innis · 2024-06-04T07:50:31Z

website/content/en/preview/getting-started/migrating-from-cas/scripts/step04-controller-iam.sh

@@ -71,6 +71,12 @@ cat << EOF > controller-policy.json
            "Resource": "arn:${AWS_PARTITION}:eks:${AWS_REGION}:${AWS_ACCOUNT_ID}:cluster/${CLUSTER_NAME}",
            "Sid": "EKSClusterEndpointLookup"
        },
+        {
+            "Effect": "Allow",
+            "Action": "eks:DescribeCapacityReservations",


Suggested change

"Action": "eks:DescribeCapacityReservations",

"Action": "ec2:DescribeCapacityReservations",

jonathan-innis · 2024-06-04T08:11:25Z

pkg/errors/errors.go

@@ -48,6 +48,7 @@ var (
 		"UnfulfillableCapacity",
 		"Unsupported",
 		"InsufficientFreeAddressesInSubnet",
+		"ReservationCapacityExceeded",


When we return this type of ICE error, should we short-circuit and update the capacity reservation that we launched with in-place so we don't have to wait for another iteration of the capacity reservation polling to update the instance availability.

If we didn't want to directly update this to 0, we could also use this as a trigger to re-call DescribeCapacityReservations since we know that something has changed since we made the launch decision originally

jonathan-innis · 2024-06-04T08:11:56Z

pkg/apis/v1beta1/ec2nodeclass_status.go

+	AvailableInstanceCount int `json:"availableInstanceCount"`
+	// Instance Match Criteria of the Capacity Reservation
+	// +required
+	InstanceMatchCriteria string `json:"instanceMatchCriteria"`


What is instance match criteria?

jonathan-innis · 2024-06-04T08:12:08Z

pkg/apis/v1beta1/ec2nodeclass_status.go

+	InstanceMatchCriteria string `json:"instanceMatchCriteria"`
+	// Instance Platform of the Capacity Reservation
+	// +required
+	InstancePlatform string `json:"instancePlatform"`


What's the instance platform?

jonathan-innis · 2024-06-04T08:12:16Z

pkg/apis/v1beta1/ec2nodeclass_status.go

+	InstanceType string `json:"instanceType"`
+	// Owner Id of the Capacity Reservation
+	// +required
+	OwnerID string `json:"ownerId"`


Do we need to add OwnerID here?

jonathan-innis · 2024-06-04T08:14:02Z

pkg/apis/v1beta1/ec2nodeclass.go

@@ -175,6 +179,39 @@ type AMISelectorTerm struct {
 	Owner string `json:"owner,omitempty"`
 }

+// CapacityReservationSelectorTerm defines selection logic for a Capacity Reservation used by Karpenter to launch nodes.
+// If multiple fields are used for selection, the requirements are ANDed.
+type CapacityReservationSelectorTerm struct {


How many of these selector fields do we think we truly need for the initial run? We can always add more fields later but the second we introduce something, we're stuck with it. High-level, I can see tags, id, and instance type making sense. Do we think the other fields would be commonly used?

tvonhacht-apple requested a review from a team as a code owner May 14, 2024 05:57

tvonhacht-apple requested a review from jonathan-innis May 14, 2024 05:57

tvonhacht-apple force-pushed the feature/odcr branch 3 times, most recently from 5c02862 to 6867ca8 Compare May 16, 2024 16:50

tvonhacht-apple mentioned this pull request May 17, 2024

feat: RFC Implementation Supporting AWS On-Demand Capacity Reservations kubernetes-sigs/karpenter#1263

Open

tvonhacht-apple force-pushed the feature/odcr branch 2 times, most recently from acd61c4 to 9cdb574 Compare May 20, 2024 20:55

tvonhacht-apple commented May 20, 2024

View reviewed changes

tvonhacht-apple force-pushed the feature/odcr branch from 9cdb574 to 96951d1 Compare May 21, 2024 05:47

Add initial implementation to add capacityReservationSelectorTerms an…

5471bc4

…d update status when found

tvonhacht-apple force-pushed the feature/odcr branch from 96951d1 to 5471bc4 Compare May 21, 2024 05:49

jonathan-innis reviewed Jun 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: RFC Implementation Supporting ODCR #6198

feat: RFC Implementation Supporting ODCR #6198

tvonhacht-apple commented May 14, 2024 •

edited

netlify bot commented May 14, 2024 •

edited

tvonhacht-apple May 20, 2024

jonathan-innis Jun 4, 2024

jonathan-innis Jun 4, 2024

jonathan-innis Jun 4, 2024

jonathan-innis Jun 4, 2024

jonathan-innis Jun 4, 2024

jonathan-innis Jun 4, 2024

jonathan-innis Jun 4, 2024

jonathan-innis Jun 4, 2024

jonathan-innis Jun 4, 2024

jonathan-innis Jun 4, 2024

	"Action": "eks:DescribeCapacityReservations",
	"Action": "ec2:DescribeCapacityReservations",

feat: RFC Implementation Supporting ODCR #6198

Are you sure you want to change the base?

feat: RFC Implementation Supporting ODCR #6198

Conversation

tvonhacht-apple commented May 14, 2024 • edited

Progress

Supporting associating ODCR to EC2NodeClass

Supporting new capacity-type capacity-reservation

netlify bot commented May 14, 2024 • edited

✅ Deploy Preview for karpenter-docs-prod canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tvonhacht-apple commented May 14, 2024 •

edited

netlify bot commented May 14, 2024 •

edited