Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable the proprietary NVIDIA driver #116

Open
tfmoraes opened this issue Apr 16, 2019 · 23 comments · May be fixed by #1407
Open

Enable the proprietary NVIDIA driver #116

tfmoraes opened this issue Apr 16, 2019 · 23 comments · May be fixed by #1407
Labels
1. Bug Something isn't working 5. Help Wanted Extra attention is needed

Comments

@tfmoraes
Copy link

First, great project!

If I'm using Nvidia proprietary driver, OpenGL softwares (like Blender) don't work inside toolbox container. I tried to install the proprietary driver inside the container, it installs but the OpenGL softwares don't work. Is it necessary to install more things? Or set some env variable?

Thanks!

@Findarato
Copy link

Findarato commented May 6, 2019

Toolbox is a container, you would have to map your graphics card inside, or do things the way nvidia-docker does.

The reply further down #116 (comment) works perfectly.

@tfmoraes
Copy link
Author

tfmoraes commented May 6, 2019

@Findarato You mean add something like --volume /dev/nvidia0:/dev/nvidia0 and other /dev files?

@tpopela
Copy link
Collaborator

tpopela commented May 21, 2019

So to have the NVIDIA stuff working inside the Toolbox I had to do this (inspired by https://github.com/thewtex/docker-opengl-nvidia):

  1. You have to patch the Toolbox to bind mount the /dev/nvidia0 and /dev/nvidiactl to the Toolbox and setup the X11 things - see tpopela@40231e8

  2. Download the NVIDIA proprietary drivers on the host:

#!/bin/sh

# Get your current host nvidia driver version, e.g. 340.24
nvidia_version=$(cat /proc/driver/nvidia/version | head -n 1 | awk '{ print $8 }')

# We must use the same driver in the image as on the host
if test ! -f nvidia-driver.run; then
  nvidia_driver_uri=http://us.download.nvidia.com/XFree86/Linux-x86_64/${nvidia_version}/NVIDIA-Linux-x86_64-${nvidia_version}.run
  wget -O ~/nvidia-driver.run $nvidia_driver_uri
fi
  1. Install the drivers while being inside the Toolbox:
#!/bin/sh

sudo dnf install -y glx-utils kmod libglvnd-devel || exit 1
sudo sh ~/nvidia-driver.run -a -N --ui=none --no-kernel-module || exit 1
glxinfo | grep "OpenGL version"

@tfmoraes
Copy link
Author

@tpopela it worked. Thanks!

@tpopela
Copy link
Collaborator

tpopela commented May 23, 2019

I'm glad it worked! But there was a mistake that could lead to malfunctions after the host is restarted - you will need to apply tpopela@3db450a on top of the previous patch.

debarshiray referenced this issue in zerotri/toolbox May 23, 2019
Things like the proprietary NVIDIA driver need access to devices
directly inside the /dev directory (eg., /dev/nvidia0 and
/dev/nvidiactl), and since such devices can come and go at runtime they
cannot be bind mounted individually. Instead, the entire directory
needs to be made available.

https://github.com/debarshiray/toolbox/issues/116
@debarshiray
Copy link
Member

debarshiray commented May 23, 2019

@tpopela We might be able to get away without bind mounting /tmp/.x11-unix. These days the X.org server listens on an abstract UNIX socket and a UNIX socket on the file system. The former doesn't work if you have a network namespace, but the Toolbox container doesn't have one (because podman create --net host), and that's why X applications work. The latter is located at /tmp/.x11-unix and is used by Flatpak containers because those have network namespaces.

References:

@tpopela
Copy link
Collaborator

tpopela commented May 24, 2019

Ah ok @debarshiray! Thank you for clarification. I can confirm that not bind mounting the /tmp/X11-unix doesn't change anything and the integration works (tried to run Blender here).

There is maybe a small change after we are bind mounting the whole /dev. Blender now looks for nvcc (CUDA stuff) in PATH and can't find it.

@tfmoraes
Copy link
Author

With the merge of https://github.com/debarshiray/toolbox/pull/119 this issue may be closed, since Nvidia is working now with proprietary driver. It's just necessary to install nvidia driver once inside the toolbox container. @tpopela's scripts helps with driver installation. @tpopela you have to install CUDA Toolkit. To make it install I've passed the parameters --override and --toolkit. After installing CUDA Toolkit Blender show me option to render using CUDA. But unfortunately CUDA doesn't work with GCC9 :(

@tpopela
Copy link
Collaborator

tpopela commented May 27, 2019

Actually I would leave this open (but I will leave it on Rishi) as we were thinking with @debarshiray about leaking the NVIDIA host drivers to the container, so there will be no need to manually install the drivers in the container. We have a working WIP solution for it.

@tfmoraes
Copy link
Author

That would be great!

@debarshiray
Copy link
Member

debarshiray commented Jun 6, 2019

we were thinking with @debarshiray about leaking the NVIDIA host drivers to the
container, so there will be no need to manually install the drivers in the container.

Yes, I agree that this will be the right thing to do. OpenGL drivers have a kernel module and some user-space components (eg., shared libraries) that talk to each other. In NVIDIA's case the interface between these two components isn't stable and hence the user-space bits inside the container must match the kernel module on the host. These two can go out of sync if your host is lagging behind the container or vice versa.

The problem with leaking the files into the container is maintaining a list of those files somewhere because they vary from version to version. This would be vastly simpler if there was a well known nvidia directory somewhere on the host that could be bind mounted because then we wouldn't have to worry about the names and locations of the individual files themselves. Unfortunately that's not the case.

Looking around, I found Flatpak's solution to be a reasonable compromise. In short, it invents and enforces this well known nvidia directory. It expects distributors of the host OS to put all the user-space files in /var/lib/flatpak/extension/org.freedesktop.Platform.GL.host/x86_64/1.4 and that's implemented by modifying the package shipping the NVIDIA driver.

With that done, we'd need to figure out where to place these files inside the container and how to point the container's runtime environment at them.

@debarshiray debarshiray reopened this Jun 6, 2019
@garyedwards
Copy link

Nvidia have their own solution for this nvidia-container-runtime-hook which works very well with podman triggered by an oci prestart hook. I just run into an issues at the moment when using --uidmaps resulting in losing permissions to run ldconfig:

could not start /sbin/ldconfig: mount operation failed: /proc: operation not permitted

It may be better for toolbox to try and integrate with this existing tool rather then maintaining another implementation.

@garyedwards
Copy link

Issue relating to the uidmap permission problem:

NVIDIA/libnvidia-container#49

@andreldmonteiro
Copy link

andreldmonteiro commented Nov 28, 2019

I was trying to run steam in the toolbox bug #343 I didn't patch the toolbox, steam runs and opengl works but vulkan doesn't seem to work, tried vkmark and Rise of Tomb Raider on steam.

Any ideas how to get it to work?

@HarryMichal HarryMichal added this to Needs triage in Priority Board Jul 28, 2020
@tfmoraes
Copy link
Author

tfmoraes commented Aug 1, 2020

I saw that Singularity ccontainer fix this problem without libnvidia-container. They use a list of needed files

@HarryMichal HarryMichal added 1. Bug Something isn't working 5. Help Wanted Extra attention is needed labels Sep 10, 2020
@HarryMichal HarryMichal moved this from Needs triage to Low priority in Priority Board Sep 10, 2020
@Ayush1325
Copy link

So what is the status of using Nvidia GPU drivers in container in 2021?
I can /dev/nvidia0 and /dev/nvidiactl are mounted.
However, I cannot install Nvidia drivers successfully. The install proceeds normally but checking with modinfo -F version nvidia gives Error:
modinfo: ERROR: Module alias nvidia not found..
And Nvidia Container Toolkit is not officially supported in Fedora, so it doesn't seem like a good idea to use with Fedora Silverblue.

@loganmc10
Copy link

The latest version of toolbox (0.0.99.3) exposes the host filesystem at /run/host. I believe it should be possible to create a Containerfile something like this:

FROM registry.fedoraproject.org/fedora-toolbox:35

RUN ln -s /run/host/usr/share/vulkan/icd.d/nvidia_icd.json /usr/share/vulkan/icd.d/nvidia_icd.json && \
    ln -s /run/host/usr/lib64/libGLX_nvidia.so.0 /usr/lib64/libGLX_nvidia.so.0

To expose the host userspace driver to the container. I don't have an Nvidia machine to test at the moment, but I assume that would do it? The above example should hopefully work for Vulkan, I'm not exactly sure if some extra file would need to be linked for OpenGL

@Ayush1325
Copy link

Ok, so with the latest toolbox, I can install nvidia drivers fine. On running nvidia-smi I gett the correct output as well. However, modinfo -F version nvidia command doesn't seem to work so not sure if the drivers are actually working.

@whs-dot-hk
Copy link

So do you mean reinstall the nvidia driver inside the container is to fix the ldconfig? I remember there is a step to rerun ldconfig

Reference: https://docs.01.org/clearlinux/latest/zh_CN/tutorials/nvidia.html#configure-alternative-software-paths

@Findarato
Copy link

So to have the NVIDIA stuff working inside the Toolbox I had to do this (inspired by https://github.com/thewtex/docker-opengl-nvidia):

1. You have to patch the Toolbox to bind mount the /dev/nvidia0 and /dev/nvidiactl to the Toolbox and setup the X11 things - see [tpopela@40231e8](https://github.com/tpopela/toolbox/commit/40231e8591d70065199c0df9b6811c2f9e9d7269)

2. Download the NVIDIA proprietary drivers on the host:
#!/bin/sh

# Get your current host nvidia driver version, e.g. 340.24
nvidia_version=$(cat /proc/driver/nvidia/version | head -n 1 | awk '{ print $8 }')

# We must use the same driver in the image as on the host
if test ! -f nvidia-driver.run; then
  nvidia_driver_uri=http://us.download.nvidia.com/XFree86/Linux-x86_64/${nvidia_version}/NVIDIA-Linux-x86_64-${nvidia_version}.run
  wget -O ~/nvidia-driver.run $nvidia_driver_uri
fi
3. Install the drivers while being inside the Toolbox:
#!/bin/sh

sudo dnf install -y glx-utils kmod libglvnd-devel || exit 1
sudo sh ~/nvidia-driver.run -a -N --ui=none --no-kernel-module || exit 1
glxinfo | grep "OpenGL version"

Just adding this worked for me too. I hope with the OSS version of their driver it will just work out of the box like all the AMD cards do.

@debarshiray debarshiray changed the title Nvidia proprietary driver Enable the proprietary NVIDIA driver Sep 10, 2022
@3dsf
Copy link

3dsf commented Nov 27, 2022

Ok, so with the latest toolbox, I can install nvidia drivers fine. On running nvidia-smi I gett the correct output as well. However, modinfo -F version nvidia command doesn't seem to work so not sure if the drivers are actually working.

@Ayush1325
yes, the drivers are working as I can compile with nvcc.
yes, modinfo -F version nvidia does not work within the container.

I used the nvidia fedora 35 repo (nvidia-driver and cuda) for both the host (F37) and container (F35; matching gcc version). Beyond that, I added the nvidia bin folder to the path and set the $LD_LIBRARY_PATH for each install.

@mjlbach
Copy link
Contributor

mjlbach commented Mar 25, 2023

What needs to be done for this?

  1. If you don't care about having to have users install nvidia-container-toolkit:
 podman run --rm -it --privileged --security-opt=label=disable -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_DRIVER_CAPABILITIES=all ubuntu
  1. If you want something entirely independent you can mount the relevant nvidia driver files into the container in a manner similar to:
  1. I don't think installing the nvidia driver inside the container is a sustainable solution because host/container should match.

I personally feel option 1 is more sustainable, but it's pretty simple (two appended environmental variables and a host executable check for nvidia-container-toolkit, would a PR for one of these options be accepted @debarshiray or should this be documented?

@debarshiray
Copy link
Member

What needs to be done for this?

[...]

would a PR for one of these options be accepted @debarshiray or should this be documented?

Did you see my comment above? Unless there's a problem with it, I still prefer the unmanaged Flatpak extension option.

I finally got myself some NVIDIA hardware to play with this.

I see that the Container Device Interface requires installing the NVIDIA Container Toolkit.

As far as I can make out, the nvidia-container-toolkit or nvidia-container-toolkit-base packages are only available from NVIDIA's own repositories right now. For example, I am on Fedora 39, and even though they are supposed to be free software, I see them neither in Fedora proper nor RPMFusion, but RPMFusion does have NVIDIA's proprietary driver.

Is there anything else other than NVIDIA that uses the Container Device Interface?

I would like to understand the situation a bit better. Ultimately I want to make it as smooth as possible for the user to enable the NVIDIA proprietary driver. That becomes a problem if one needs to enable multiple different unofficial repositories, at least on Fedora.

I will start by reviving the pull request from @TingPing against negativo17's RPM for the proprietary NVIDIA driver, but against RPMFusion, because that's the implementation Fedora Workstation promotes these days. If nothing else, it will immediately help Flatpak because those containers will always have access to the driver. We can add the same plumbing to Toolbx and benefit similarly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1. Bug Something isn't working 5. Help Wanted Extra attention is needed
Projects
No open projects
Priority Board
  
Low priority
Development

Successfully merging a pull request may close this issue.