-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add vt1 support and codify trn1 instances #5991
Conversation
✅ Deploy Preview for karpenter-docs-prod ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
@@ -89,6 +89,7 @@ var ( | |||
ResourceNVIDIAGPU v1.ResourceName = "nvidia.com/gpu" | |||
ResourceAMDGPU v1.ResourceName = "amd.com/gpu" | |||
ResourceAWSNeuron v1.ResourceName = "aws.amazon.com/neuron" | |||
ResourceXilinxAccelerator v1.ResourceName = "xilinx.com/fpga-xilinx_u30_gen3x4_base_1-0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how rare it is to get this instance type, but to validate the resource name in general we should try to get an E2E test setup for this like we have with the habana instance
@@ -450,7 +450,8 @@ func filterExoticInstanceTypes(instanceTypes []*cloudprovider.InstanceType) []*c | |||
if !resources.IsZero(it.Capacity[v1beta1.ResourceAWSNeuron]) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering how long this list is, I wonder if the v1beta1
labels package should just have this list in there and then we don't continually have to remember to add to this list
if strings.HasPrefix(*info.InstanceType, "trn1") { | ||
requirements.Get(v1beta1.LabelInstanceAcceleratorName).Insert(lowerKabobCase("Inferentia")) | ||
// AWS Neuron Accelerators | ||
if info.NeuronInfo != nil && len(info.NeuronInfo.NeuronDevices) == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick sanity check: We always expect this slice of info to have length 1? Is it possible that it could have some length other than 1 ever?
| karpenter | \>= 0.21 | \>= 0.21 | \>= 0.25 | \>= 0.28 | \>= 0.28 | \>= 0.31 | \>= 0.34.0 | | ||
| KUBERNETES | 1.23 | 1.24 | 1.25 | 1.26 | 1.27 | 1.28 | 1.29 | | ||
|------------|---------|---------|---------|---------|---------|---------|---------| | ||
| karpenter | 0.21.x+ | 0.21.x+ | 0.25.x+ | 0.28.x+ | 0.28.x+ | 0.31.x+ | 0.34.x+ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that this probably got updated by the call to docgen
, but if something is changing with this, we should check what is happening with our compatibility script. @engedaam This looks like someone merged in a change expecting the script to match what was merged, but the script hasn't been updated to reflect the new format. Can you confirm and take a look?
@@ -18191,7 +18509,7 @@ below are the resources available with some assumptions and after the instance o | |||
|--|--| | |||
|karpenter.k8s.aws/instance-accelerator-count|1| | |||
|karpenter.k8s.aws/instance-accelerator-manufacturer|aws| | |||
|karpenter.k8s.aws/instance-accelerator-name|inferentia| | |||
|karpenter.k8s.aws/instance-accelerator-name|trainium| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So these were just flat-out wrong?
@@ -7263,7 +7583,6 @@ below are the resources available with some assumptions and after the instance o | |||
|--|--| | |||
|karpenter.k8s.aws/instance-accelerator-count|6| | |||
|karpenter.k8s.aws/instance-accelerator-manufacturer|aws| | |||
|karpenter.k8s.aws/instance-accelerator-name|inferentia| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This got dropped. Do we know why?
@@ -7205,7 +7527,6 @@ below are the resources available with some assumptions and after the instance o | |||
|--|--| | |||
|karpenter.k8s.aws/instance-accelerator-count|1| | |||
|karpenter.k8s.aws/instance-accelerator-manufacturer|aws| | |||
|karpenter.k8s.aws/instance-accelerator-name|inferentia| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This got dropped. Do we know why?
@@ -7234,7 +7555,6 @@ below are the resources available with some assumptions and after the instance o | |||
|--|--| | |||
|karpenter.k8s.aws/instance-accelerator-count|1| | |||
|karpenter.k8s.aws/instance-accelerator-manufacturer|aws| | |||
|karpenter.k8s.aws/instance-accelerator-name|inferentia| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This got dropped. Do we know why?
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
Fixes #N/A
Description
This takes away the hard-coded trainium logic, and ingests it from the AWS SDK. Does the same for Xilinx resources
How was this change tested?
still needs test fixes
Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.