local gpu can't run full in small sample cases #263

leiqianstat · 2023-10-20T09:36:43Z

My data is a 32*320 matrix with 32 samples and 320 dimensions. But locally using 4090, each iteration takes 20s and the cpu usage is 99% and gpu is 1%. When I increase the sample size to 1000 or 10000, 20 iteration per second, and the cpu usage is 99%, gpu is 99%. When I ran the example with n=32 and p=320 on kaggle p100, I found that it was 3 iterations per second, and the cpu usage was 99% and the gpu was at 99%.
I don't know what the problem is that the local gpu is much slower than on kaggle at n=32.
Hopefully this can be fixed, here is my code.

import torch
from denoising_diffusion_pytorch import Unet1D, GaussianDiffusion1D, Trainer1D, Dataset1D

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = Unet1D(
    dim=64,
    dim_mults=(1, 2, 4, 8),
    channels=1
).to(device)

diffusion = GaussianDiffusion1D(
    model,
    seq_length=320,
    timesteps=100,
    objective='pred_v'
).to(device)


data = torch.randn(32,320)
training_seq = data.unsqueeze(1).float()  # convert to 32*1*320
dataset = Dataset1D(training_seq)

trainer = Trainer1D(
    diffusion,
    dataset=dataset,
    train_batch_size=64,
    train_lr=8e-5,
    train_num_steps=500,         # total training steps
    gradient_accumulate_every=2,    # gradient accumulation steps
    ema_decay=0.995,                # exponential moving average decay
)

trainer.train()

The text was updated successfully, but these errors were encountered:

leiqianstat · 2023-10-29T09:58:21Z

Hi, @lucidrains . Could you help me see what the problem is?

reinterpret-cast · 2024-04-03T07:53:15Z

I have similar question: I tried with a 2080ti 12gb but it went OOM immediately, when I reduced to 10 images it started to train at least, but very slow and did not use much cpu or gpu at all. Do we know what hardware, image numbers and batch size is needed to utilize the hardware properly?

kidintwo3 · 2024-04-24T12:54:21Z

I can't seem to run Unet1D either on local GPU. Unet2D seems to pick up GPU properly with "Accelerate". Even though the device is set to "cuda:0" it only uses CPU after a few seconds of GPU usage.

reinterpret-cast · 2024-04-24T17:11:02Z

I found out one reason for slowness/idling: Using windows does not work if the DataLoader is configured with parallelism. It will fork new processes all the time which live only for short period. Windows seems to be a killer. It would be nice if a warning was printed to windows users.

kidintwo3 · 2024-04-24T17:57:57Z

I found out one reason for slowness/idling: Using windows does not work if the DataLoader is configured with parallelism. It will fork new processes all the time which live only for short period. Windows seems to be a killer. It would be nice if a warning was printed to windows users.

Removing num_workers seems to fix it:

denoising-diffusion-pytorch/denoising_diffusion_pytorch/denoising_diffusion_pytorch_1d.py

Line 767 in 9c9e403

    
           dl = DataLoader(dataset, batch_size = train_batch_size, shuffle = True, pin_memory = True, num_workers = cpu_count())

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

local gpu can't run full in small sample cases #263

local gpu can't run full in small sample cases #263

leiqianstat commented Oct 20, 2023 •

edited

leiqianstat commented Oct 29, 2023

reinterpret-cast commented Apr 3, 2024 •

edited

kidintwo3 commented Apr 24, 2024

reinterpret-cast commented Apr 24, 2024

kidintwo3 commented Apr 24, 2024 •

edited

local gpu can't run full in small sample cases #263

local gpu can't run full in small sample cases #263

Comments

leiqianstat commented Oct 20, 2023 • edited

leiqianstat commented Oct 29, 2023

reinterpret-cast commented Apr 3, 2024 • edited

kidintwo3 commented Apr 24, 2024

reinterpret-cast commented Apr 24, 2024

kidintwo3 commented Apr 24, 2024 • edited

leiqianstat commented Oct 20, 2023 •

edited

reinterpret-cast commented Apr 3, 2024 •

edited

kidintwo3 commented Apr 24, 2024 •

edited