Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running out of memory #74

Open
andrewmoise opened this issue Aug 11, 2022 · 1 comment · May be fixed by #75
Open

Running out of memory #74

andrewmoise opened this issue Aug 11, 2022 · 1 comment · May be fixed by #75

Comments

@andrewmoise
Copy link

Any advice on how to deal with running out of GPU memory? I'm just getting started with pytorch / this package, and this is what happens when I try an initial test run using 7000 steps (57000 training images, size 128x128, on a GPU with 15GB memory):

>>> from denoising_diffusion_pytorch import Unet, GaussianDiffusion, Trainer
>>> 
>>> model = Unet(
...     dim = 64,
...     dim_mults = (1, 2, 4, 8)
... ).cuda()
>>> 
>>> diffusion = GaussianDiffusion(
...     model,
...     image_size = 128,
...     timesteps = 1000,   # number of steps                                           
...     loss_type = 'l1'    # L1 or L2                                                  
... ).cuda()
>>> trainer = Trainer(
...     diffusion,
...     'training-set-2',
...     train_batch_size = 32,
...     train_lr = 2e-5,
...     train_num_steps = 7000,         # total training steps                          
...     gradient_accumulate_every = 2,    # gradient accumulation steps                 
...     ema_decay = 0.995,                # exponential moving average decay            
...     amp = True                        # turn on mixed precision                     
... )
>>> 
>>> trainer.train()
sampling loop time step: 100%|██████████████████| 1000/1000 [08:45<00:00,  1.90it/s]
loss: 0.2902:  14%|███▊                       | 1001/7000 [55:22<5:31:53,  3.32s/it]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.7/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 823, in train
    self.accelerator.backward(loss)
  File "/opt/conda/lib/python3.7/site-packages/accelerate/accelerator.py", line 884, in backward
    loss.backward(**kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 14.56 GiB total capacity; 13.02 GiB already allocated; 84.44 MiB free; 13.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
@andrewmoise
Copy link
Author

Update: I modified Trainer.train() to delete intermediate data (the loss history and the sample image stuff) once it's done with it, and it's survived past the point when it was running out of memory before. I'll play with it a little more and then send a PR if that sounds okay.

@andrewmoise andrewmoise linked a pull request Aug 12, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant