Running out of memory #74

andrewmoise · 2022-08-11T17:52:07Z

Any advice on how to deal with running out of GPU memory? I'm just getting started with pytorch / this package, and this is what happens when I try an initial test run using 7000 steps (57000 training images, size 128x128, on a GPU with 15GB memory):

>>> from denoising_diffusion_pytorch import Unet, GaussianDiffusion, Trainer
>>> 
>>> model = Unet(
...     dim = 64,
...     dim_mults = (1, 2, 4, 8)
... ).cuda()
>>> 
>>> diffusion = GaussianDiffusion(
...     model,
...     image_size = 128,
...     timesteps = 1000,   # number of steps                                           
...     loss_type = 'l1'    # L1 or L2                                                  
... ).cuda()
>>> trainer = Trainer(
...     diffusion,
...     'training-set-2',
...     train_batch_size = 32,
...     train_lr = 2e-5,
...     train_num_steps = 7000,         # total training steps                          
...     gradient_accumulate_every = 2,    # gradient accumulation steps                 
...     ema_decay = 0.995,                # exponential moving average decay            
...     amp = True                        # turn on mixed precision                     
... )
>>> 
>>> trainer.train()
sampling loop time step: 100%|██████████████████| 1000/1000 [08:45<00:00,  1.90it/s]
loss: 0.2902:  14%|███▊                       | 1001/7000 [55:22<5:31:53,  3.32s/it]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.7/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 823, in train
    self.accelerator.backward(loss)
  File "/opt/conda/lib/python3.7/site-packages/accelerate/accelerator.py", line 884, in backward
    loss.backward(**kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 14.56 GiB total capacity; 13.02 GiB already allocated; 84.44 MiB free; 13.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The text was updated successfully, but these errors were encountered:

andrewmoise · 2022-08-11T20:25:45Z

Update: I modified Trainer.train() to delete intermediate data (the loss history and the sample image stuff) once it's done with it, and it's survived past the point when it was running out of memory before. I'll play with it a little more and then send a PR if that sounds okay.

andrewmoise linked a pull request Aug 12, 2022 that will close this issue

Free some resources after each step to avoid OOM #75

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running out of memory #74

Running out of memory #74

andrewmoise commented Aug 11, 2022

andrewmoise commented Aug 11, 2022

Running out of memory #74

Running out of memory #74

Comments

andrewmoise commented Aug 11, 2022

andrewmoise commented Aug 11, 2022