Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running multi-gpu training #35

Open
joe-sht opened this issue Feb 27, 2024 · 3 comments
Open

Running multi-gpu training #35

joe-sht opened this issue Feb 27, 2024 · 3 comments

Comments

@joe-sht
Copy link

joe-sht commented Feb 27, 2024

How to run training on multi gpu? As I can see training runs on single gpu.

@Suhail
Copy link

Suhail commented Mar 3, 2024

I am also curious. The error I get is this:

Traceback (most recent call last):
  File "/root/research/suhail/magvit2/train.py", line 27, in <module>
    trainer = VideoTokenizerTrainer(
  File "/root/research/suhail/.venv/lib/python3.10/site-packages/pytorch_custom_utils/accelerate_utils.py", line 95, in __init__
    _orig_init(self, *args, **kwargs)
  File "<@beartype(magvit2_pytorch.trainer.VideoTokenizerTrainer.__init__) at 0x7f20aa90b910>", line 314, in __init__
  File "/root/research/suhail/.venv/lib/python3.10/site-packages/magvit2_pytorch/trainer.py", line 203, in __init__
    self.has_multiscale_discrs = self.model.has_multiscale_discrs
  File "/root/research/suhail/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'has_multiscale_discrs'```

ChatGPT:
The error you're encountering indicates that an attribute has_multiscale_discrs is being accessed on an object of type DistributedDataParallel, and this object does not have such an attribute. This is a common issue when using PyTorch's DistributedDataParallel (DDP) wrapper around models for distributed training. The DDP wrapper takes your model and replicates it across multiple GPUs, managing the distribution of data and the gathering of results. However, it only forwards calls to the underlying model for methods defined in the nn.Module, not for custom attributes or methods unless they are implemented in a specific way.

@madebyollin
Copy link

iirc one needs to replace direct self.model.whatever accesses with something like (self.model.module if isinstance(self.model, nn.DataParallel) else self.model).whatever (potentially via a helper function) when using PyTorch DDP

@ziyannchen
Copy link

ziyannchen commented Apr 10, 2024

The codes use accelerate to do DDP automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants