Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workbench: Tolerations for specific Pods (GPU) #447

Open
kenchrcum opened this issue Dec 8, 2023 · 4 comments
Open

Workbench: Tolerations for specific Pods (GPU) #447

kenchrcum opened this issue Dec 8, 2023 · 4 comments

Comments

@kenchrcum
Copy link

Hi everybody, we are currently trying to deploy Workbench on our Kubernetes cluster via Helm. Everything works fine, but we have some hardware GPU nodes, which should be reserved for Workbench GPU Sessions. We do not have any problems starting the GPU Sessions, but we can't get the node "reserved" for these sessions.
We are trying to do this tainting the nodes, but we can't get the toleration exclusively on the GPU sessions. After reading through the chart and other repo issues it seems that it is only possible to set taints for all sessions of a Workbench server. We hoped placement-constraints would help us solving the task, but this isn't working as expected, as it looks at the labels of a node.
Is there any chance to make this work? Are we just missing some documentation or is this totally out of scope?

Thanks in advance for any help or suggestion :)

@iamsarat
Copy link

+1

@iamsarat
Copy link

We need a way to exclusively use GPU nodes for ONLY GPU resource requests and current configuration doesn't support this.

@colearendt
Copy link
Member

Thanks for reporting this! I think you are right that this is less than ideal. If you are trying to set a toleration exclusively on a GPU session, that is something that may be possible by customizing templates. Customizing templates is generally a pretty advanced feature (and can definitely be tedious / annoying across chart versions), but it should be able to get you going here!

Can you share an example of a toleration as you would expect it to be defined on the pod that is launched? I should be able to mock up some helm values that can work with that input!

@kenchrcum
Copy link
Author

Sorry for the long delay and thank you for your reply.

One Taint we would set on the GPU Node is for example nvidia-gpu=server:NoSchedule and we would need to set the according toleration on GPU workbench sessions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants