You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Any advice on how to deal with running out of GPU memory? I'm just getting started with pytorch / this package, and this is what happens when I try an initial test run using 7000 steps (57000 training images, size 128x128, on a GPU with 15GB memory):
>>> from denoising_diffusion_pytorch import Unet, GaussianDiffusion, Trainer
>>>
>>> model = Unet(
... dim = 64,
... dim_mults = (1, 2, 4, 8)
... ).cuda()
>>>
>>> diffusion = GaussianDiffusion(
... model,
... image_size = 128,
... timesteps = 1000, # number of steps
... loss_type = 'l1' # L1 or L2
... ).cuda()
>>> trainer = Trainer(
... diffusion,
... 'training-set-2',
... train_batch_size = 32,
... train_lr = 2e-5,
... train_num_steps = 7000, # total training steps
... gradient_accumulate_every = 2, # gradient accumulation steps
... ema_decay = 0.995, # exponential moving average decay
... amp = True # turn on mixed precision
... )
>>>
>>> trainer.train()
sampling loop time step: 100%|██████████████████| 1000/1000 [08:45<00:00, 1.90it/s]
loss: 0.2902: 14%|███▊ | 1001/7000 [55:22<5:31:53, 3.32s/it]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/lib/python3.7/site-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 823, in train
self.accelerator.backward(loss)
File "/opt/conda/lib/python3.7/site-packages/accelerate/accelerator.py", line 884, in backward
loss.backward(**kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 14.56 GiB total capacity; 13.02 GiB already allocated; 84.44 MiB free; 13.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The text was updated successfully, but these errors were encountered:
Update: I modified Trainer.train() to delete intermediate data (the loss history and the sample image stuff) once it's done with it, and it's survived past the point when it was running out of memory before. I'll play with it a little more and then send a PR if that sounds okay.
Any advice on how to deal with running out of GPU memory? I'm just getting started with pytorch / this package, and this is what happens when I try an initial test run using 7000 steps (57000 training images, size 128x128, on a GPU with 15GB memory):
The text was updated successfully, but these errors were encountered: