Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libcudart.so.10.2: cannot open shared object file: No such file or directory #88

Open
joecomerisnotavailable opened this issue Mar 7, 2023 · 3 comments

Comments

@joecomerisnotavailable
Copy link

After I installed the cauchy-mult extension as indicated in the readme, I noticed I still get the warning

CUDA extension for cauchy multiplication not found. Install by going to extensions/cauchy/ and running python setup.py install. This should speed up end-to-end training by 10-50%

when importing the s4 model. By tracking down the warning message in the source code and trying the import

from extensions.cauchy.cauchy import cauchy_mult

directly, I found that the exception causing the failure is

libcudart.so.10.2: cannot open shared object file: No such file or directory

This is all in a fresh conda environment with pytorch==1.13.1 and cudatoolkit=11.7 (installed by default on the AWS AMI I'm using). It seems like a cuda mismatch issue (10.2!=11.7) but I've seen reading through other issues that the code has been successfully tested with Cuda 11.1 and 11.3 and I didn't see anywhere any specific requirements vis-a-vis cuda version <=11.3. I'm not sure where the 10.2 is coming into play, anyway.

The pykeops approach will probably be ok for my purposes but I wonder if there's a simple fix for this.

Thanks

@albertfgu
Copy link
Contributor

Pytorch 1.13 has deprecated support for CUDA 10.2 (https://pytorch.org/blog/PyTorch-1.13-release/)

I'm actually trying to figure out the best way to deal with this myself, as I have some development environments still on CUDA 10.2. I tried installing pytorch 1.12 but ran into some issue (maybe unrelated). My working conda environment on this machine is still on pytorch 1.10 or 1.11 which still works fine.

@jchia
Copy link
Contributor

jchia commented Apr 26, 2023

Did you try first importing torch, having installed a version of the torch package for the same CUDA version as the CUDA kernel package you are trying to import?

For example, in my venv on a Linux machine, I have torch 2.0.0+cu118 installed and my CUDA kernel package (structured-kernels) was built with CUDA 11.8. If I just import structured_kernels, I get an error "ImportError: libc10.so: cannot open shared object file: No such file or directory" but if I import torch first, there is no problem. I believe the reason it works is that the torch package comes with its own CUDA library that the import structured_kernels can also use. If there is a CUDA version mismatch, it won't work.

@albertfgu
Copy link
Contributor

@joecomerisnotavailable Have you been able to resolve anything or still having issues? I've since moved entirely off of CUDA 10.2, it's too outdated.

@jchia Now that you mention that, I think I've seen something similar when trying to import things in the repl. IIRC sometimes the import would fail the first time but work on the second. But if the extension is installed correctly the end-to-end training code should work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants