Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add vt1 support and codify trn1 instances #5991

Closed
wants to merge 1 commit into from

Conversation

njtran
Copy link
Contributor

@njtran njtran commented Apr 6, 2024

Fixes #N/A

Description
This takes away the hard-coded trainium logic, and ingests it from the AWS SDK. Does the same for Xilinx resources

How was this change tested?
still needs test fixes

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@njtran njtran requested a review from a team as a code owner April 6, 2024 00:06
@njtran njtran requested a review from engedaam April 6, 2024 00:06
Copy link

netlify bot commented Apr 6, 2024

Deploy Preview for karpenter-docs-prod ready!

Name Link
🔨 Latest commit edc492a
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/661091ef9ecb0d0008fdf021
😎 Deploy Preview https://deploy-preview-5991--karpenter-docs-prod.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@njtran njtran marked this pull request as draft April 6, 2024 00:06
@@ -89,6 +89,7 @@ var (
ResourceNVIDIAGPU v1.ResourceName = "nvidia.com/gpu"
ResourceAMDGPU v1.ResourceName = "amd.com/gpu"
ResourceAWSNeuron v1.ResourceName = "aws.amazon.com/neuron"
ResourceXilinxAccelerator v1.ResourceName = "xilinx.com/fpga-xilinx_u30_gen3x4_base_1-0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how rare it is to get this instance type, but to validate the resource name in general we should try to get an E2E test setup for this like we have with the habana instance

@@ -450,7 +450,8 @@ func filterExoticInstanceTypes(instanceTypes []*cloudprovider.InstanceType) []*c
if !resources.IsZero(it.Capacity[v1beta1.ResourceAWSNeuron]) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering how long this list is, I wonder if the v1beta1 labels package should just have this list in there and then we don't continually have to remember to add to this list

if strings.HasPrefix(*info.InstanceType, "trn1") {
requirements.Get(v1beta1.LabelInstanceAcceleratorName).Insert(lowerKabobCase("Inferentia"))
// AWS Neuron Accelerators
if info.NeuronInfo != nil && len(info.NeuronInfo.NeuronDevices) == 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick sanity check: We always expect this slice of info to have length 1? Is it possible that it could have some length other than 1 ever?

| karpenter | \>= 0.21 | \>= 0.21 | \>= 0.25 | \>= 0.28 | \>= 0.28 | \>= 0.31 | \>= 0.34.0 |
| KUBERNETES | 1.23 | 1.24 | 1.25 | 1.26 | 1.27 | 1.28 | 1.29 |
|------------|---------|---------|---------|---------|---------|---------|---------|
| karpenter | 0.21.x+ | 0.21.x+ | 0.25.x+ | 0.28.x+ | 0.28.x+ | 0.31.x+ | 0.34.x+ |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this probably got updated by the call to docgen, but if something is changing with this, we should check what is happening with our compatibility script. @engedaam This looks like someone merged in a change expecting the script to match what was merged, but the script hasn't been updated to reflect the new format. Can you confirm and take a look?

@@ -18191,7 +18509,7 @@ below are the resources available with some assumptions and after the instance o
|--|--|
|karpenter.k8s.aws/instance-accelerator-count|1|
|karpenter.k8s.aws/instance-accelerator-manufacturer|aws|
|karpenter.k8s.aws/instance-accelerator-name|inferentia|
|karpenter.k8s.aws/instance-accelerator-name|trainium|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these were just flat-out wrong?

@@ -7263,7 +7583,6 @@ below are the resources available with some assumptions and after the instance o
|--|--|
|karpenter.k8s.aws/instance-accelerator-count|6|
|karpenter.k8s.aws/instance-accelerator-manufacturer|aws|
|karpenter.k8s.aws/instance-accelerator-name|inferentia|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got dropped. Do we know why?

@@ -7205,7 +7527,6 @@ below are the resources available with some assumptions and after the instance o
|--|--|
|karpenter.k8s.aws/instance-accelerator-count|1|
|karpenter.k8s.aws/instance-accelerator-manufacturer|aws|
|karpenter.k8s.aws/instance-accelerator-name|inferentia|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got dropped. Do we know why?

@@ -7234,7 +7555,6 @@ below are the resources available with some assumptions and after the instance o
|--|--|
|karpenter.k8s.aws/instance-accelerator-count|1|
|karpenter.k8s.aws/instance-accelerator-manufacturer|aws|
|karpenter.k8s.aws/instance-accelerator-name|inferentia|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got dropped. Do we know why?

Copy link
Contributor

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

Copy link
Contributor

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants