Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing EGL ICD when using --nv #2210

Open
yghorbal opened this issue May 6, 2024 · 2 comments
Open

Missing EGL ICD when using --nv #2210

yghorbal opened this issue May 6, 2024 · 2 comments
Assignees
Milestone

Comments

@yghorbal
Copy link

yghorbal commented May 6, 2024

Version of Apptainer

apptainer version 1.2.5

Expected behavior

When running with --nv apptainer needs to map EGL ICD /usr/share/glvnd/egl_vendor.d/10_nvidia.json same as NVIDIA Container Toolkit is doing per NVIDIA/nvidia-docker#1520 (comment)
If not EGL initialisation fails (with eglinfo)

Actual behavior

/usr/share/glvnd/egl_vendor.d/10_nvidia.json is not mapped and eglinfo fails

Steps to reproduce this behavior

  • install apptainer on a host with Nvidia GPU and proprietary drivers
  • create a sif file apptainer build --fakeroot eglinfo.sif eglinfo.recipe
$ cat eglinfo.recipe
Bootstrap: docker
From: ubuntu:22.04
%post
    apt-get update -y
    apt-get install -y mesa-utils
    apt-get clean all
%runscript
    exec "${1+"$@"}"
  • apptainer run --nv eglinfo.sif eglinfo (without the mapping)
[...]
Device platform:
eglinfo: eglInitialize failed
[...]
  • apptainer run --nv -B /usr/share/glvnd/egl_vendor.d/10_nvidia.json eglinfo.sif eglinfo (with manually mapping the definition json)
[...]
Device platform:
EGL API version: 1.5
EGL vendor string: NVIDIA
EGL version string: 1.5
EGL client APIs: OpenGL_ES OpenGL
EGL extensions string:
[...]

What OS/distro are you running

cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="8.8 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.8 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.8
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.8"

How did you install Apptainer

from source

@DrDaveD DrDaveD added this to the 1.3.2 milestone May 7, 2024
@GodloveD
Copy link
Contributor

GodloveD commented May 8, 2024

Looking at this and trying to gather more information.

First, I will note that 10_nvidia.json does seem to be part of the driver. If I extract a .run file and look at the contents it's there. So it does seem like a good candidate to automatically add into the container.

However, we don't currently have a method for adding random files. We can only add libraries and binaries. To add libraries, we search /etc/ld.so.cache for the appropriate library locations. To add binaries I believe we just search the $PATH. I don't think that we should assume the location of the driver installation, so the question becomes "How do we locate (in a performant way) the 10_nvidia.json file on a particular system?"

I'm tempted to suggest that this is rarely needed (since I think this is the first request that I'm aware of) and it should therefore be something the user binds if they need to. Perhaps fix with documentation? Unsure.

Any idea how the NVIDIA container toolkit tackles this issue?

@yghorbal
Copy link
Author

Hi,

Sorry for the long delay! We install drivers through RPM packages provided by Nvidia (not the .run)
The file 10_nvidia.json is shipped with the nvidia-driver-libs package (which is a dependency for nvidia-driver package)
On the Nvidia Container Toolkit, they have a concept of a aggregation of paths that represent config These locations get searched for the list of files to be mounted (the 10_nvidia.json is not the only one) see https://github.com/NVIDIA/nvidia-container-toolkit/blob/f13f1bdba4ae34506f301e53e01644791771c4d4/internal/discover/graphics.go#L66
The so called config locations searched can be user supplied and defaults to /etc, /usr/local/share and /usr/share as per https://github.com/NVIDIA/nvidia-container-toolkit/blob/f13f1bdba4ae34506f301e53e01644791771c4d4/internal/lookup/root/root.go#L59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants