Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how can i deploy distributed training on kubernete clusters with torch.distributed.launch #560

Open
ThomaswellY opened this issue Jun 5, 2023 · 3 comments

Comments

@ThomaswellY
Copy link

ThomaswellY commented Jun 5, 2023

I have been using mmpretrain project https://github.com/open-mmlab/mmpretrain, which consists of abundant of classification scripts. However, they use torch.distributed.launch to start distributed training, I wonder is there any method under kubeflow operators to start such distributed training on k8s cluster?
PS: i have seeked help to training-operator and pytorch-operator, but can't see obvious solution.
Thanks in advance~ any hints would be helpful to me.

@alculquicondor
Copy link
Collaborator

Are you in the wrong repo?
This repo is about MPI. Pytorch is supported in https://github.com/kubeflow/training-operator

@tenzen-y
Copy link
Member

tenzen-y commented Jun 5, 2023

@ThomaswellY If you would run torchrun, you should open an issues at training-operator repo. If you would run Distributed Pytorch Training with mpirun, we can answer your questions at this repo.

Which commands do you mean?

@ThomaswellY
Copy link
Author

ThomaswellY commented Jun 6, 2023

@alculquicondor @tenzen-y
I was looking to how to modify the original script which originally use torch.distributed.launch to start training to use mpirun to start training in mpi-operator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants