Running multi-gpu training #35

joe-sht · 2024-02-27T19:56:09Z

How to run training on multi gpu? As I can see training runs on single gpu.

Suhail · 2024-03-03T20:26:08Z

I am also curious. The error I get is this:

Traceback (most recent call last):
  File "/root/research/suhail/magvit2/train.py", line 27, in <module>
    trainer = VideoTokenizerTrainer(
  File "/root/research/suhail/.venv/lib/python3.10/site-packages/pytorch_custom_utils/accelerate_utils.py", line 95, in __init__
    _orig_init(self, *args, **kwargs)
  File "<@beartype(magvit2_pytorch.trainer.VideoTokenizerTrainer.__init__) at 0x7f20aa90b910>", line 314, in __init__
  File "/root/research/suhail/.venv/lib/python3.10/site-packages/magvit2_pytorch/trainer.py", line 203, in __init__
    self.has_multiscale_discrs = self.model.has_multiscale_discrs
  File "/root/research/suhail/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'has_multiscale_discrs'```

ChatGPT:
The error you're encountering indicates that an attribute has_multiscale_discrs is being accessed on an object of type DistributedDataParallel, and this object does not have such an attribute. This is a common issue when using PyTorch's DistributedDataParallel (DDP) wrapper around models for distributed training. The DDP wrapper takes your model and replicates it across multiple GPUs, managing the distribution of data and the gathering of results. However, it only forwards calls to the underlying model for methods defined in the nn.Module, not for custom attributes or methods unless they are implemented in a specific way.

madebyollin · 2024-03-03T22:37:26Z

iirc one needs to replace direct self.model.whatever accesses with something like (self.model.module if isinstance(self.model, nn.DataParallel) else self.model).whatever (potentially via a helper function) when using PyTorch DDP

ziyannchen · 2024-04-10T07:46:32Z

The codes use accelerate to do DDP automatically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running multi-gpu training #35

Running multi-gpu training #35

joe-sht commented Feb 27, 2024

Suhail commented Mar 3, 2024 •

edited

madebyollin commented Mar 3, 2024

ziyannchen commented Apr 10, 2024 •

edited

Running multi-gpu training #35

Running multi-gpu training #35

Comments

joe-sht commented Feb 27, 2024

Suhail commented Mar 3, 2024 • edited

madebyollin commented Mar 3, 2024

ziyannchen commented Apr 10, 2024 • edited

Suhail commented Mar 3, 2024 •

edited

ziyannchen commented Apr 10, 2024 •

edited