New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU variant has issues recognizing the GPU. #1035
Comments
@derneuere
Once I removed that, and ran nvidia-smi CUDA showed the version that's installed.
You might want to update the guide and remove those deploy things if that's the case for everyone. |
UPDATE: |
@derneuere Any chance to get this fixed? I do not want to go back to cpu if this will be fixed soon, but right now this thing is totally broken. |
dlib is compiled against a specific version of cuda, which is in this case Cuda 11.7.1 with cudnn8 It complains that "forward compatibility" was attempted and failed. This means that the host system has likely old drivers. That could either be an issue, that the graphic card is old or the drivers are old. Graphics card cannot be the reason as I develop on a system with a 1050ti max q, which works fine. Please update the driver or change the deploy part. On my system I use
I can't make dlib compatible against multiple versions, compiling it during runtime will lead to half an hour start up time and replacing it with something more flexible is not doable for me atm due to time constraints. |
As you can see in my previous comment, if I add that part to the compose, cuda is not detected inside the container. |
Hmm, I will try to bump everything to CUDA 12. According to the docs, it should be backwards compatible. Let's see if that actually works. |
Cool, let me know if I can help. |
Alright I pushed a fix. Should be available in half an hour. Let me know if that fixes the issue for you :) |
Is this on dev or stable? |
Only on dev for now :) |
Ah, can I pull just gpu-dev by adding -dev to it in the docker compose file? |
Yes, works the same way as the other image :) |
Sadly, I've been trying to download that image for 2 days now, it just hangs and times out. I need to restart and hope it fully downloads. |
with the latest dev gpu.
|
@derneuere any idea how we move past this? |
I can't reproduce this and I am pretty sure, that this issue is not on my side. Do other GPU accelerated images work for you? Currently the only bug I can reproduce is #1056 |
Last time I tested the cuda test container it worked, let me verify that now. |
Here's the output, looks like it is working inside the docker. |
I added the parts in the docker compose file back like it says in the guide, and this is what I get now:
When I connect to the container and do nvidia-smi it outputs correctly:
|
@derneuere after the update last night to the backend, things changed.
|
Also, a few things like these:
You'll notice the top 2 are in GMT, and the last one is GMT+2. That might cause issues if you're comparing both, and setting a timeout based on the difference. These messages repeat a few times. Side note:
Wasn't jxl fixed? |
JXL is handled by thumbnail-service / imagemagick and not by vips. Can you look into the log files for face-service and thumbnail-service and post possible errors here? |
Regarding the JXL, so why is it erroring out if it's not supposed to handle these files?
|
Here's what I get when loading the latest backend:
|
Still the same error, that the backend can't find the gpu. I think this has something to do with docker or pytorch and not with librephotos. Can you look for similar issues and check if other containers which support gpu acceleration work? |
If I see CUDA correctly in the test docker from nvidia, is that enough to rule out the infrastructure? Or is there another test that will definitely prove this? |
Forget what I said, I just ran: |
here's my env:
yet as mentioned CUDA is seen. |
I added nvidia-smi to the start of entrypoint, because bing co-pilot suggested I run smi before any command. Here's the output. |
I think this here is the correct issue from pytorch. pytorch/pytorch#49081 I added nvidia modprobe to the container. Lets see if that works. |
Still this. Are you initiating the CUDA drivers and visible devices before the code? It does not look like my pull request was merged, so I could see the result of nvidia-smi in these logs. |
After the modprobe and the pull request merge, still the same issue. |
Should be backwards compatible and we need this version as pytorch has the same version. I also have a 1050ti with CUDA 12.2, drivers with the version 535.129.03 and it works. CUDA drivers should be installed on the host system. The docker image needs the image from nvidia, which we already use. Can you check if there are different drivers available for your system? My system:
My env looks like this:
I added export CUDA_VISIBLE_DEVICES=0 to the entrypoint.sh, maybe that will make a difference. |
Here's my output inside the container:
That would just mean no devices would be registered. |
export CUDA_VISIBLE_DEVICES=0 means, that the 0th devices will be visible, which is in your list your only GPU. |
The naming is a bit confusing then.
Question is, maybe it's such a unique case that it works when you test, but with a desktop one it requires something different? |
Added it manually on my entrypoint, did not help. |
I used your configuration for the GPU. This also works on my machine.
I also executed your HostCuda script and it passed:
Can you try this suggested fix on your host machine? pytorch/pytorch#49081 (comment) |
Do you mean this:
Because that returns: So my host is set up like yours (at least from the prerequisites tests) and the compose file is the same. What else could be different? |
Alright, just execute the second part, I am not basing the debug commands from the documentation as it is usually not complete, but from the issue on github in pytorch, which usually provides better pointers on how to fix the error. I just use kubuntu 22.04, do you use something unique like arch? |
sudo modprobe nvidia_uvm
Nope, Debian bookworm. |
This sounds like the gpu drivers are not actually correctly installed according to stackoverflow: https://askubuntu.com/questions/1413512/syslog-error-modprobe-fatal-module-nvidia-not-found-in-directory-lib-module |
π Bug Report
log
filesπ Description of issue:
When scanning in the new GPU docker variant, I get the following errors:
Also attached.
message.txt
π How can we reproduce it:
Have a docker with nvidia gpu, in my case 1050, it does not get recognized in server stats.
Please provide additional information:
The text was updated successfully, but these errors were encountered: