Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

local gpu can't run full in small sample cases #263

Open
leiqianstat opened this issue Oct 20, 2023 · 5 comments
Open

local gpu can't run full in small sample cases #263

leiqianstat opened this issue Oct 20, 2023 · 5 comments

Comments

@leiqianstat
Copy link

leiqianstat commented Oct 20, 2023

My data is a 32*320 matrix with 32 samples and 320 dimensions. But locally using 4090, each iteration takes 20s and the cpu usage is 99% and gpu is 1%. When I increase the sample size to 1000 or 10000, 20 iteration per second, and the cpu usage is 99%, gpu is 99%. When I ran the example with n=32 and p=320 on kaggle p100, I found that it was 3 iterations per second, and the cpu usage was 99% and the gpu was at 99%.
I don't know what the problem is that the local gpu is much slower than on kaggle at n=32.
Hopefully this can be fixed, here is my code.

import torch
from denoising_diffusion_pytorch import Unet1D, GaussianDiffusion1D, Trainer1D, Dataset1D

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = Unet1D(
    dim=64,
    dim_mults=(1, 2, 4, 8),
    channels=1
).to(device)

diffusion = GaussianDiffusion1D(
    model,
    seq_length=320,
    timesteps=100,
    objective='pred_v'
).to(device)


data = torch.randn(32,320)
training_seq = data.unsqueeze(1).float()  # convert to 32*1*320
dataset = Dataset1D(training_seq)

trainer = Trainer1D(
    diffusion,
    dataset=dataset,
    train_batch_size=64,
    train_lr=8e-5,
    train_num_steps=500,         # total training steps
    gradient_accumulate_every=2,    # gradient accumulation steps
    ema_decay=0.995,                # exponential moving average decay
)

trainer.train()
@leiqianstat
Copy link
Author

Hi, @lucidrains . Could you help me see what the problem is?

@reinterpret-cast
Copy link

reinterpret-cast commented Apr 3, 2024

I have similar question: I tried with a 2080ti 12gb but it went OOM immediately, when I reduced to 10 images it started to train at least, but very slow and did not use much cpu or gpu at all. Do we know what hardware, image numbers and batch size is needed to utilize the hardware properly?

@kidintwo3
Copy link

I can't seem to run Unet1D either on local GPU. Unet2D seems to pick up GPU properly with "Accelerate". Even though the device is set to "cuda:0" it only uses CPU after a few seconds of GPU usage.

@reinterpret-cast
Copy link

I found out one reason for slowness/idling: Using windows does not work if the DataLoader is configured with parallelism. It will fork new processes all the time which live only for short period. Windows seems to be a killer. It would be nice if a warning was printed to windows users.

@kidintwo3
Copy link

kidintwo3 commented Apr 24, 2024

I found out one reason for slowness/idling: Using windows does not work if the DataLoader is configured with parallelism. It will fork new processes all the time which live only for short period. Windows seems to be a killer. It would be nice if a warning was printed to windows users.

Removing num_workers seems to fix it:

dl = DataLoader(dataset, batch_size = train_batch_size, shuffle = True, pin_memory = True, num_workers = cpu_count())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants