-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: RFC Implementation Supporting ODCR #6198
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for karpenter-docs-prod canceled.
|
5c02862
to
6867ca8
Compare
acd61c4
to
9cdb574
Compare
@@ -154,12 +157,63 @@ func (r Resolver) Resolve(ctx context.Context, nodeClass *v1beta1.EC2NodeClass, | |||
maxPods: int(instanceType.Capacity.Pods().Value()), | |||
} | |||
}) | |||
|
|||
zones := scheduling.NewNodeSelectorRequirementsWithMinValues(nodeClaim.Spec.Requirements...).Get(v1.LabelTopologyZone) | |||
capacityReservations := []v1beta1.CapacityReservation{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can handle this within the if below, we dont need to create this variable beforehand
9cdb574
to
96951d1
Compare
…d update status when found
96951d1
to
5471bc4
Compare
@@ -96,6 +96,9 @@ func (c *CloudProvider) Create(ctx context.Context, nodeClaim *corev1beta1.NodeC | |||
} | |||
instance, err := c.instanceProvider.Create(ctx, nodeClass, nodeClaim, instanceTypes) | |||
if err != nil { | |||
if cloudprovider.IsInsufficientCapacityError(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we already get an ICE error back, we shouldn't have to wrap the error again, when we do a check down the line, it should be able to identify that the error is an ICE error, so long as one of the wrapped errors is.
|
||
zones := scheduling.NewNodeSelectorRequirementsWithMinValues(nodeClaim.Spec.Requirements...).Get(v1.LabelTopologyZone) | ||
capacityReservations := []v1beta1.CapacityReservation{} | ||
if capacityType == "capacity-reservation" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we select a capacity reservation NodeClaim, should we just best effort the NodeClaim launch here and then have it get deleted with an ICE error from Fleet if there isn't any available capacity. The next iteration of GetInstanceTypes should have the updated capacity reservation availability so we shouldn't try and launch with the same offering again on the second attempt
zones := scheduling.NewNodeSelectorRequirementsWithMinValues(nodeClaim.Spec.Requirements...).Get(v1.LabelTopologyZone) | ||
capacityReservations := []v1beta1.CapacityReservation{} | ||
if capacityType == "capacity-reservation" { | ||
for _, capacityReservation := range nodeClass.Status.CapacityReservations { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lo.Filter
?
return nil, cloudprovider.NewInsufficientCapacityError(fmt.Errorf("trying to resolve capacity-reservation but no available capacity reservations available")) | ||
} | ||
} | ||
|
||
for params, instanceTypes := range paramsToInstanceTypes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we group paramsToInstanceTypes
above should we also group these with the capacity reservations in mind? This would allow us to keep the same logic on L178 where we iterate through the params and instance types, and I think this may work better as well since there should only be specific instance types that are valid for a given capacity reservation
@@ -71,6 +71,12 @@ cat << EOF > controller-policy.json | |||
"Resource": "arn:${AWS_PARTITION}:eks:${AWS_REGION}:${AWS_ACCOUNT_ID}:cluster/${CLUSTER_NAME}", | |||
"Sid": "EKSClusterEndpointLookup" | |||
}, | |||
{ | |||
"Effect": "Allow", | |||
"Action": "eks:DescribeCapacityReservations", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Action": "eks:DescribeCapacityReservations", | |
"Action": "ec2:DescribeCapacityReservations", |
@@ -48,6 +48,7 @@ var ( | |||
"UnfulfillableCapacity", | |||
"Unsupported", | |||
"InsufficientFreeAddressesInSubnet", | |||
"ReservationCapacityExceeded", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we return this type of ICE error, should we short-circuit and update the capacity reservation that we launched with in-place so we don't have to wait for another iteration of the capacity reservation polling to update the instance availability.
If we didn't want to directly update this to 0, we could also use this as a trigger to re-call DescribeCapacityReservations since we know that something has changed since we made the launch decision originally
AvailableInstanceCount int `json:"availableInstanceCount"` | ||
// Instance Match Criteria of the Capacity Reservation | ||
// +required | ||
InstanceMatchCriteria string `json:"instanceMatchCriteria"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is instance match criteria?
InstanceMatchCriteria string `json:"instanceMatchCriteria"` | ||
// Instance Platform of the Capacity Reservation | ||
// +required | ||
InstancePlatform string `json:"instancePlatform"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the instance platform?
InstanceType string `json:"instanceType"` | ||
// Owner Id of the Capacity Reservation | ||
// +required | ||
OwnerID string `json:"ownerId"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add OwnerID here?
@@ -175,6 +179,39 @@ type AMISelectorTerm struct { | |||
Owner string `json:"owner,omitempty"` | |||
} | |||
|
|||
// CapacityReservationSelectorTerm defines selection logic for a Capacity Reservation used by Karpenter to launch nodes. | |||
// If multiple fields are used for selection, the requirements are ANDed. | |||
type CapacityReservationSelectorTerm struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many of these selector fields do we think we truly need for the initial run? We can always add more fields later but the second we introduce something, we're stuck with it. High-level, I can see tags, id, and instance type making sense. Do we think the other fields would be commonly used?
This is a collaboration on implementing the RFC #5716 Supporting ODCRs
Progress
capacity-reservation
Description
Supporting associating ODCR to EC2NodeClass
Add a new field
capacityReservationSelectorTerms
toEC2NodeClass
This follows closely (does not implement all fields) how EC2 DescribeCapacityReservations can filter.
Karpenter will perform validation against the spec to ensure there isn't any violation prior to creating the LaunchTemplates.
Supporting new capacity-type capacity-reservation
Adding a new
karpenter.sh/capacity-type: capacity-reservation
allows us to have a EC2NodeClass that does not automatically fallback toon-demand
ifcapacity-reservation
is not available.How was this change tested?
Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.