Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch Support #493

Open
bradmiro opened this issue Dec 14, 2020 · 9 comments
Open

PyTorch Support #493

bradmiro opened this issue Dec 14, 2020 · 9 comments
Labels
question Further information is requested

Comments

@bradmiro
Copy link

Hi there, I'm working on an update for the TonY installation script for GCP Dataproc. While I have been able to (locally) successfully update TensorFlow, I cannot seem to get the PyTorch example working. It does not work on 0.4 (the most recent version you explicitly mentioning supporting) or 1.7.1, the most recent release. I get the following error:

  File "mnist_distributed.py", line 230, in <module>
    main()
  File "mnist_distributed.py", line 225, in main
    init_process(args)
  File "mnist_distributed.py", line 185, in init_process
    distributed.init_process_group(
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1607810694534_0006/container_1607810694534_0006_01_000003/venv/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 413, in init_process_group
    backend = Backend(backend)
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1607810694534_0006/container_1607810694534_0006_01_000003/venv/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 76, in __new__
    raise ValueError("TCP backend has been deprecated. Please use "
ValueError: TCP backend has been deprecated. Please use Gloo or MPI backend for collective operations on CPU tensors.

Latest attempt:
PyTorch 1.7.1
torchvision 0.8.2
TonY 0.4.0
Dataproc 2.0 (Hadoop 3.2.1)

Config:

<configuration>
 <property>
  <name>tony.application.name</name>
  <value>PyTorch</value>
 </property>
 <property>
  <name>tony.application.security.enabled</name>
  <value>false</value>
 </property>
 <property>
  <name>tony.worker.instances</name>
  <value>2</value>
 </property>
 <property>
  <name>tony.worker.memory</name>
  <value>4g</value>
 </property>
 <property>
  <name>tony.ps.instances</name>
  <value>1</value>
 </property>
 <property>
  <name>tony.ps.memory</name>
  <value>2g</value>
 </property>
 <property>
  <name>tony.application.framework</name>
  <value>pytorch</value>
 </property>
 <property>
  <name>tony.worker.gpus</name>
  <value>1</value>
 </property>
</configuration>

Cluster has 1 master, 2 workers and 2 NVIDIA Tesla T4s. However, any combination of configuration I have tried up to this point results in the same error. Any advice would be greatly appreciated!

@oliverhu
Copy link
Member

@gogasca any idea? guess we need to upgrade the PyTorch example script. I don't see TonY or GCP being the issue here.

@bradmiro
Copy link
Author

Great observation and I believe you are correct: here it shows the tcp backend being used. Adding --backend gloo or --backend nccl (on a gpu cluster) to --task_params changed the error message, so it looks like the example just needs a refresh.

@oliverhu
Copy link
Member

@bradmiro would you mind contributing a patch to fix that?

@bradmiro
Copy link
Author

Sure, I can look into this.

@bradmiro
Copy link
Author

bradmiro commented Dec 16, 2020

@oliverhu are there special considerations that need to be taken into consideration re: TonY for use with PyTorch? The error seems to be properly configuring init_process_group.

The current code is this: https://github.com/linkedin/TonY/blob/master/tony-examples/mnist-pytorch/mnist_distributed.py#L184-L189

Changing the backend to gloo throws "connection refused" errors at runtime.

@oliverhu
Copy link
Member

That should not matter, all those backend should work 🤔 Have you tried other backends?

@bradmiro
Copy link
Author

The mpi runtime does not work without an installation and we don't include this by default in the Dataproc image.

The nccl does not seem to work, but I am also testing on a cluster that only has GPUs allocated to workers, not the master. The TensorFlow job seemed to work with GPUs just attached to master, but I am creating a fresh cluster with a GPU attached to the master node as well.

@bradmiro
Copy link
Author

nccl error with gpus attached to all machines: RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

This might be a PyTorch thing, I can look into it more probably early next week. Unsure about gloo as well.

@oliverhu
Copy link
Member

mpi won't work because that requires SSH across workers, that is not something supported by default in Hadoop distributions.

nccl and gloo should work though at a glance. We use TensorFlow so not much insight there, but anything not using MPI should work.

@zuston zuston added the question Further information is requested label Mar 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants