Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training gg colab: RuntimeError: No such operator aten::cudnn_convolution_backward_weight #631

Open
minhduc01168 opened this issue Dec 14, 2023 · 0 comments

Comments

@minhduc01168
Copy link

2023-12-14 09:24:39.480344: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-14 09:24:39.480399: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-14 09:24:39.480449: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-14 09:24:41.346063: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Training for 500 kimg...

Traceback (most recent call last):
File "/content/drive/MyDrive/WIP/stylegan3/train.py", line 286, in
main() # pylint: disable=no-value-for-parameter
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/content/drive/MyDrive/WIP/stylegan3/train.py", line 281, in main
launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
File "/content/drive/MyDrive/WIP/stylegan3/train.py", line 96, in launch_training
subprocess_fn(rank=0, c=c, temp_dir=temp_dir)
File "/content/drive/MyDrive/WIP/stylegan3/train.py", line 47, in subprocess_fn
training_loop.training_loop(rank=rank, **c)
File "/content/drive/MyDrive/WIP/stylegan3/training/training_loop.py", line 278, in training_loop
loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, gain=phase.interval, cur_nimg=cur_nimg)
File "/content/drive/MyDrive/WIP/stylegan3/training/loss.py", line 111, in accumulate_gradients
loss_Dgen.mean().mul(gain).backward()
File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/content/drive/MyDrive/WIP/stylegan3/torch_utils/ops/conv2d_gradfix.py", line 144, in backward
grad_weight = Conv2dGradWeight.apply(grad_output, input)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/content/drive/MyDrive/WIP/stylegan3/torch_utils/ops/conv2d_gradfix.py", line 173, in forward
return torch._C._jit_get_operation(name)(weight_shape, grad_output, input, padding, stride, dilation, groups, *flags)
RuntimeError: No such operator aten::cudnn_convolution_backward_weight
Google Colab: T4 GPU
!nvidia-smi: CUDA 12.0
pytorch 2.1.0+cu118
I ran the following command for training: # Fine-tune StyleGAN3-R for MetFaces-U using 1 GPU, starting from the pre-trained FFHQ-U pickle.
!python /content/drive/MyDrive/WIP/stylegan3/train.py --outdir=~/training-runs --cfg=stylegan3-r --data=/content/drive/MyDrive/WIP/stylegan3/datasets/Face_Celeb-1024x1024.zip
--gpus=1 --batch=16 --batch-gpu=8 --gamma=6.6 --mirror=1 --kimg=500 --snap=5
--resume=https://api.ngc.nvidia.com/v2/models/nvidia/research/stylegan3/versions/1/files/stylegan3-r-ffhqu-1024x1024.pkl
Hoping to get some help. I've been searching but still haven't found a solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant