Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA does not attempt to reclaim memory when return code is CUDNN_STATUS_EXECUTION_FAILED #649

Open
egeonat opened this issue Dec 24, 2020 · 0 comments

Comments

@egeonat
Copy link
Contributor

egeonat commented Dec 24, 2020

While training a Knet model, I was getting a "CUDNN_STATUS_EXECUTION_FAILED" error thrown by CUDA.jl. Further inspection revealed that this is related to CUDA.jl only attempting to reclaim memory when the error code is "CUDNN_STATUS_ALLOC_FAILED". This can be seen in the @check macro that is responsible for attempting memory reclamations and throwing api errors:

https://github.com/JuliaGPU/CUDA.jl/blob/b3228085bc6bf87a0feb5885fc636f352d0e3f0e/lib/cudnn/error.jl#L28

Replacing that line with the following code ends up fixing my problem:

res = @retry_reclaim err -> isequal(err, CUDNN_STATUS_ALLOC_FAILED) ||       
                            isequal(err, CUDNN_STATUS_EXECUTION_FAILED) begin
    $(esc(ex))                                                               
end                                                                          

I believe ideally this should be fixed by CUDA and CUDNN packages. They are incorrectly assuming memory reclamations are only necessary to attempt when the error code is "CUDNN_STATUS_ALLOC_FAILED", but they also return "CUDNN_STATUS_EXECUTION_FAILED" for issues that could be fixed by reclaiming memory. But until the issue if fixed, it also affects Knet functionality, so I think a temporary workaround could be beneficial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant