DistributedDataParallel #382

dgm2 · 2022-06-06T10:39:30Z

It seems that a DistributedDataParallel (DDP) pytorch setup is not supported in OT - specifically on emd2 computation.
Any workarounds ideas for making this working?
or any example for multi-gpu setups for OT?

ideally, I would like to make OT working with this torch setup
https://github.com/pytorch/examples/blob/main/distributed/ddp/main.py

Many thanks

example of failed DDP

  ot.emd2(a, b, dist)
  File "/python3.8/site-packages/ot/lp/__init__.py", line 468, in emd2
    nx = get_backend(M0, a0, b0)
  File "/python3.8/site-packages/ot/backend.py", line 168, in get_backend
    return TorchBackend()
  File "/python3.8/site-packages/ot/backend.py", line 1517, in __init__
    self.__type_list__.append(torch.tensor(1, dtype=torch.float32, device='cuda'))
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

my current workaround is:
changing
self.__type_list__.append(torch.tensor(1, dtype=torch.float32, device='cuda'))
to
self.__type_list__.append(torch.tensor(1, dtype=torch.float32, device=device_id))
passing device id from backend, recompiling this OT from source.

The text was updated successfully, but these errors were encountered:

rflamary · 2022-06-07T06:23:02Z

Hello @dgm2 ,

This workaround works? Note that the list is here mainly for debugging and tests (so that we can rub them on all available devices) so I'm a bit surprised if this is the only bottleneck for running POT with DPP.

We are obviously interested in your contribution if you manage to manage it work properly (we don not have multiple GPU so it is a bit hard to implement and debug on our side), probably the device device_id should be detected automatically whene using get_backend and creation, the back-ends should not need parameters to remain practical to use.

ncassereau-idris · 2022-06-09T08:26:32Z

Hello @dgm2,
Could you provide us with the exact code you used to get this error ?
I ran https://github.com/pytorch/examples/blob/main/distributed/ddp/main.py with 4 GPUs and ot.emd2 as the loss function, yet did not get any error, everything seems to have run smoothly whether the distribution was performed with torch or slurm.

dgm2 added bug help wanted labels Jun 6, 2022

atong01 mentioned this issue Sep 16, 2023

GPU Memory Allocation with POT + pytorch #523

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DistributedDataParallel #382

DistributedDataParallel #382

dgm2 commented Jun 6, 2022 •

edited

rflamary commented Jun 7, 2022

ncassereau-idris commented Jun 9, 2022

DistributedDataParallel #382

DistributedDataParallel #382

Comments

dgm2 commented Jun 6, 2022 • edited

rflamary commented Jun 7, 2022

ncassereau-idris commented Jun 9, 2022

dgm2 commented Jun 6, 2022 •

edited