PyTorch Support #493

bradmiro · 2020-12-14T20:56:12Z

Hi there, I'm working on an update for the TonY installation script for GCP Dataproc. While I have been able to (locally) successfully update TensorFlow, I cannot seem to get the PyTorch example working. It does not work on 0.4 (the most recent version you explicitly mentioning supporting) or 1.7.1, the most recent release. I get the following error:

  File "mnist_distributed.py", line 230, in <module>
    main()
  File "mnist_distributed.py", line 225, in main
    init_process(args)
  File "mnist_distributed.py", line 185, in init_process
    distributed.init_process_group(
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1607810694534_0006/container_1607810694534_0006_01_000003/venv/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 413, in init_process_group
    backend = Backend(backend)
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1607810694534_0006/container_1607810694534_0006_01_000003/venv/pytorch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 76, in __new__
    raise ValueError("TCP backend has been deprecated. Please use "
ValueError: TCP backend has been deprecated. Please use Gloo or MPI backend for collective operations on CPU tensors.

Latest attempt:
PyTorch 1.7.1
torchvision 0.8.2
TonY 0.4.0
Dataproc 2.0 (Hadoop 3.2.1)

Config:

<configuration>
 <property>
  <name>tony.application.name</name>
  <value>PyTorch</value>
 </property>
 <property>
  <name>tony.application.security.enabled</name>
  <value>false</value>
 </property>
 <property>
  <name>tony.worker.instances</name>
  <value>2</value>
 </property>
 <property>
  <name>tony.worker.memory</name>
  <value>4g</value>
 </property>
 <property>
  <name>tony.ps.instances</name>
  <value>1</value>
 </property>
 <property>
  <name>tony.ps.memory</name>
  <value>2g</value>
 </property>
 <property>
  <name>tony.application.framework</name>
  <value>pytorch</value>
 </property>
 <property>
  <name>tony.worker.gpus</name>
  <value>1</value>
 </property>
</configuration>

Cluster has 1 master, 2 workers and 2 NVIDIA Tesla T4s. However, any combination of configuration I have tried up to this point results in the same error. Any advice would be greatly appreciated!

The text was updated successfully, but these errors were encountered:

oliverhu · 2020-12-15T01:42:26Z

@gogasca any idea? guess we need to upgrade the PyTorch example script. I don't see TonY or GCP being the issue here.

bradmiro · 2020-12-15T19:25:00Z

Great observation and I believe you are correct: here it shows the tcp backend being used. Adding --backend gloo or --backend nccl (on a gpu cluster) to --task_params changed the error message, so it looks like the example just needs a refresh.

oliverhu · 2020-12-15T22:23:53Z

@bradmiro would you mind contributing a patch to fix that?

bradmiro · 2020-12-16T18:15:55Z

Sure, I can look into this.

bradmiro · 2020-12-16T20:58:11Z

@oliverhu are there special considerations that need to be taken into consideration re: TonY for use with PyTorch? The error seems to be properly configuring init_process_group.

The current code is this: https://github.com/linkedin/TonY/blob/master/tony-examples/mnist-pytorch/mnist_distributed.py#L184-L189

Changing the backend to gloo throws "connection refused" errors at runtime.

oliverhu · 2020-12-16T21:33:04Z

That should not matter, all those backend should work 🤔 Have you tried other backends?

bradmiro · 2020-12-16T22:27:06Z

The mpi runtime does not work without an installation and we don't include this by default in the Dataproc image.

The nccl does not seem to work, but I am also testing on a cluster that only has GPUs allocated to workers, not the master. The TensorFlow job seemed to work with GPUs just attached to master, but I am creating a fresh cluster with a GPU attached to the master node as well.

bradmiro · 2020-12-16T23:11:39Z

nccl error with gpus attached to all machines: RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

This might be a PyTorch thing, I can look into it more probably early next week. Unsure about gloo as well.

oliverhu · 2020-12-17T00:50:39Z

mpi won't work because that requires SSH across workers, that is not something supported by default in Hadoop distributions.

nccl and gloo should work though at a glance. We use TensorFlow so not much insight there, but anything not using MPI should work.

zuston added the question Further information is requested label Mar 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch Support #493

PyTorch Support #493

bradmiro commented Dec 14, 2020

oliverhu commented Dec 15, 2020

bradmiro commented Dec 15, 2020

oliverhu commented Dec 15, 2020

bradmiro commented Dec 16, 2020

bradmiro commented Dec 16, 2020 •

edited

oliverhu commented Dec 16, 2020

bradmiro commented Dec 16, 2020

bradmiro commented Dec 16, 2020

oliverhu commented Dec 17, 2020

PyTorch Support #493

PyTorch Support #493

Comments

bradmiro commented Dec 14, 2020

oliverhu commented Dec 15, 2020

bradmiro commented Dec 15, 2020

oliverhu commented Dec 15, 2020

bradmiro commented Dec 16, 2020

bradmiro commented Dec 16, 2020 • edited

oliverhu commented Dec 16, 2020

bradmiro commented Dec 16, 2020

bradmiro commented Dec 16, 2020

oliverhu commented Dec 17, 2020

bradmiro commented Dec 16, 2020 •

edited