-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UserWarning: Plan failed with a cudnnException #121834
Comments
Lowering the input resolution a bit in another run I don't see #121504 (comment) (as documented in that ticket) and I see these extra messages in the log: /tmp/torchinductor_root/py/cpylrdfke46tta45o5xnxi77ex3ja2o5vdxsbjbcnp66kgd7vwqd.py:615: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at /opt/conda/conda-bld/pytorch_1710229288018/work/aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
buf3 = extern_kernels.convolution(buf0, buf1, stride=(1, 1), padding=(1, 1), dilation=(1, 1), transposed=False, output_padding=(0, 0), groups=1, bias=None)
/tmp/torchinductor_root/py/cpylrdfke46tta45o5xnxi77ex3ja2o5vdxsbjbcnp66kgd7vwqd.py:644: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at /opt/conda/conda-bld/pytorch_1710229288018/work/aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
buf13 = extern_kernels.convolution(buf10, buf11, stride=(1, 1), padding=(1, 1), dilation=(1, 1), transposed=False, output_padding=(0, 0), groups=1, bias=None)
/tmp/torchinductor_root/rd/crdjdt7nq5zpiv2qjswdnkkjyqhawkfoeb5jm6t3lthc3dr3vmbq.py:615: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at /opt/conda/conda-bld/pytorch_1710229288018/work/aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
buf3 = extern_kernels.convolution(buf0, buf1, stride=(1, 1), padding=(1, 1), dilation=(1, 1), transposed=False, output_padding=(0, 0), groups=1, bias=None) |
@ezyang do you know who is the right person to look at this. The warning is coming from the eager convolution kernel. |
It sounds like something is wrong with our size/stride meta. cc @eellison |
I think you are linking the wrong inductor code. There is no conv in that output. Can you please include include a full repro ? |
I have many inductor code |
If you want instead to reproduce it from the source code @williamwen42 has already some instruction at |
If you run with
|
You can find it here: |
Looking at the dump here - the warning happens prior to dynamo tracing so I'm not sure that it is a fake tensor (or pt2) issue.
|
But I could confirm that without decorating with |
P.s. I meant without decorating. |
I got this warning after updating to PyTorch 2.3.0 today. reproduction code import torch
import torch.nn as nn
conv_no_warn = nn.Conv2d(8, 3, kernel_size=3, stride=1, padding=0).eval().cuda()
conv_warn = nn.Conv2d(8, 1, kernel_size=3, stride=1, padding=0).eval().cuda()
x = torch.rand((1, 8, 546, 392)).cuda()
with torch.inference_mode(), torch.autocast(device_type="cuda"):
# No warning
conv_no_warn(x)
# UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED
conv_warn(x) |
Seeing this consistently with |
This can be seen here:
https://github.com/pytorch/builder/actions/runs/8819486240/job/24210862873#step:11:4098 This was happening in nightly on March 13: However was fixed on March 14: |
I had used Edit: import torch
import torch.nn as nn
conv_warn = nn.Conv2d(768, 96, kernel_size=(1, 1), stride=(1, 1)).eval().cuda()
x = torch.rand((1, 768, 39, 28)).cuda()
with torch.inference_mode(), torch.autocast(device_type="cuda"):
# UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED
conv_warn(x) In my current environment, this code consistently produces warnings, but the previous code stopped producing warnings after I ran it with |
also encountering this in eager mode in our unit tests to upgrade to torch 2.3 |
with pytorch 2.3.0 with ultralytics yolov8 same problem |
With pytorch 2.3.0 and ultralytics yolov8 I'm getting UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ..\aten\src\ATen\native\cudnn\Conv_v8.cpp:919.) |
I've encountered the same issue while using the latest official Docker image "pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime". Here is the warning message: /opt/conda/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at /opt/conda/conda-bld/pytorch_1712608935911/work/aten/src/ATen/native/cudnn/Conv_v8.cpp:919.) System Info: The specific training model is the official resnet18. It is worth noting that this error did not affect the progress of training. |
We are looking to get this resolved in 2.3.1, and yes the warning alone should not affect the results of training. It is basically saying that the first selected cuDNN algorithm could not run the workload---in this case the next selected cuDNN algorithm will be tried. |
I encountered the same problem nvcc: NVIDIA (R) Cuda compiler driver NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4 Ubuntu 22.04.1 LTS UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.) I don't know if it will affect the final execution result of the program |
Closing this since cherry-picking PR posted: #125790 |
i had the same issue : '' torch/nn/modules/conv.py:952: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.) |
I have also just got this error. I don't believe I saw this yesterday? Maybe it resurfaced? |
In my case the torch version is not even 2.3 , it is 2.2.1 and it started to have that issue starting from yesterday, can it be related to an internal problem in my GPU ? error : ' CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.) Why is this issue closed ? Since the error is still not solved |
Update : venv/lib/python3.9/site-packages/torch/autograd/graph.py:744: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.) |
I get this, but it doesn't seem to interfere with training.
Driver Version: 550.54.15, CUDA Version: 12.4 |
I get the same UserWarning with Pytorch 2.3.0 and CUDA11.8 |
I have similar problem (to some extent) I got the error while using resnet3d that gets sequence of video/frames with batchsize 16 , sequence length 16 on following version of torch and cudnn:
|
The issue was solved when I downgraded torch to 2.2.2 |
The issue was solved when I downgraded torch to 2.2.2 |
Validated with 2.3.1 rc: https://github.com/pytorch/builder/actions/runs/9288567412/job/25566399559 |
I can confirm this is an issue with 2.3.0. I have a nix flake.lock pinning torch and torchaudio to 2.3.0+cu121, and a separate pinning them to 2.2.2+cu121 When I run The error: |
Downgrading to Pytorch 2.2.2 solved the issue. |
Does this cause any performance degrade? |
Should not cause performance degradation as after the first iteration the failing config will be skipped. |
I have not noticed any performance degradation yet. You only get the warning at the first iteration anyway. |
🐛 Describe the bug
Compiling this
forward
https://github.com/yoxu515/aot-benchmark/blob/paot/networks/engines/aotv3_engine.py#L35-L110I got this warning.
Error logs
And after few inputs I got:
#121504 (comment)
/cc @ezyang @gchanan @zou3519 @kadeng @csarofeen @ptrblck @xwang233 @msaroufim @bdhirsh @anijain2305 @chauhang @williamwen42
Here in the attachment the generated inductor[
c5nhr6q2xpuk52rh5thx56utuj6tjvxobjgjbd3rsdjvwggjys3d.py.txt
](url)
Minified repro
No response
Versions
Last official pytorch-nightly image.
The text was updated successfully, but these errors were encountered: