Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource limitation for the sidecar container on Autopilot #35

Open
bhack opened this issue Jun 4, 2023 · 5 comments
Open

Resource limitation for the sidecar container on Autopilot #35

bhack opened this issue Jun 4, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@bhack
Copy link

bhack commented Jun 4, 2023

Looking at the default pytorch example in this repository I see some performance incompatibilities with the minimum autopilot resources request[1].
I think that we will have many problem allocating sidecar resources if we have these high min limits in autopilot.

annotations:
gke-gcsfuse/volumes: "true"
gke-gcsfuse/cpu-limit: "10"
gke-gcsfuse/memory-limit: 40Gi
gke-gcsfuse/ephemeral-storage-limit: 20Gi

[1]https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-resource-requests

@bhack
Copy link
Author

bhack commented Jun 4, 2023

Please consider also that Autopilot is officially the default and recommended GKE since April.

@bhack
Copy link
Author

bhack commented Jun 8, 2023

@songjiaxun Do you prefer to have this on https://issuetracker.google.com ?

@songjiaxun
Copy link
Collaborator

Thanks for the question. I admit that the pytorch example may not work in Autopilot clusters. I am actively working on the AI/ML application tests and will update the example yaml soon.

@bhack are you a Googler by any chance? Could you DM me with more context?

@bhack
Copy link
Author

bhack commented Jun 9, 2023

I've DM to you.

It is not only pytorch, It will not work any real DL scenario as the CPU limit on large nodes for the sidecard it will be MAX:
2 CPU and 14GB Memory.

@songjiaxun songjiaxun added the enhancement New feature or request label Jul 14, 2023
@songjiaxun songjiaxun changed the title Autopilot resources/perf Resource limitation for the sidecar container on Autopilot using GPU: 2 CPU and 14GB Memory Jul 14, 2023
@bhack
Copy link
Author

bhack commented Apr 11, 2024

I think we have regressed a bit here.

Now autopilot is going to accept unlimited/burstable resource on the sidecard: #61

But it "secretly" overriding with minimal resource.
This by an usability point of view it is very confusing as users have direct notification about this overriding so they could expect to work in a burstable context.

Manually scaling sidecar cpu resources it is going to not let the pod scheduling on Autopilot (E.g. >6000m on H100):

Violations details: {"[denied by autogke-no-node-updates]":["Operation on nodes with changes in addition to cordon is not allowed in Autopilot."]}
Requested by user: 'system:serviceaccount:gpu-operator:node-feature-discovery', groups: 'system:serviceaccounts,system:serviceaccounts:gpu-operator,system:authenticated'.```

@bhack bhack changed the title Resource limitation for the sidecar container on Autopilot using GPU: 2 CPU and 14GB Memory Resource limitation for the sidecar container on Autopilot Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants