NVIDIA Jetson AGX, singularity exec --nv Could not find any nv files on this host! #2805

vlk-jan · 2024-04-05T13:03:56Z

Version of Singularity

$ singularity --version
singularity-ce version 3.8.0

Describe the bug
When running the singularity image on NVIDIA Jetson AGX, the singularity cannot find nv files.

To Reproduce
Steps to reproduce the behavior:
We use the singularity image from here: https://github.com/vras-robotour/deploy, on NVIDIA Jetson.
Running the following command in the deploy directory

$ ./scripts/start_singularity.sh --nv

=========== STARTING SINGULARITY CONTAINER ============

INFO: Singularity is already installed.
INFO: Updating repository to the latest version.
Already up to date.
INFO: Mounting /snap directory.
INFO: Starting Singularity container from image robotour_arm64.simg.
INFO: Could not find any nv files on this host!
INFO: The catkin workspace is already initialized.

================== UPDATING PACKAGES ==================

INFO: Updating the package naex to the latest version.
Already up to date.
INFO: Updating the package robotour to the latest version.
Already up to date.
INFO: Updating the package map_data to the latest version.
Already up to date.
INFO: Updating the package test_package to the latest version.
Already up to date.

=======================================================

INFO: Starting interactive bash while sourcing the workspace.

Expected behavior
Expected behavior is one where the nv files are found and we would be able to use pytorch. with cuda

OS / Linux Distribution

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Installation Method
Installed using the steps detailed here: https://docs.sylabs.io/guides/3.8/admin-guide/installation.html.

Additional context
We have the nvidia-container-cli

$ nvidia-container-cli --version
version: 0.9.0+beta1
build date: 2019-06-24T22:00+00:00
build revision: 77c1cbc2f6595c59beda3699ebb9d49a0a8af426
build compiler: aarch64-linux-gnu-gcc-7 7.4.0
build platform: aarch64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -g3 -D JETSON=TRUE -DNDEBUG -std=gnu11 -O0 -g3 -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

$ nvidia-container-cli list --binaries --libraries
/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.440.18
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-fatbinaryloader.so.440.18
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-eglcore.so.32.5.1
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-glcore.so.32.5.1
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-tls.so.32.5.1
/usr/lib/aarch64-linux-gnu/tegra/libnvidia-glsi.so.32.5.1
/usr/lib/aarch64-linux-gnu/tegra/libGLX_nvidia.so.0
/usr/lib/aarch64-linux-gnu/tegra-egl/libEGL_nvidia.so.0
/usr/lib/aarch64-linux-gnu/tegra-egl/libGLESv2_nvidia.so.2
/usr/lib/aarch64-linux-gnu/tegra-egl/libGLESv1_CM_nvidia.so.1

$ nvidia-container-cli list --ipcs

strace output of the ./scripts/start_singularity.sh is available here.

The text was updated successfully, but these errors were encountered:

tri-adam · 2024-04-05T15:03:54Z

Hi @vlk-jan, thanks for the report. On the surface of it, this looks similar to #1850. As noted there, the NVIDIA Container CLI is no longer used on Tegra-based systems. There is some hope that the new --oci mode introduced in Singularity 4.x might help with this, but it has not been confirmed. If you're able to give that a go and report back, it would be appreciated. Thanks!

vlk-jan · 2024-04-05T21:31:55Z

Hi, thanks for your swift reply.

I do have some updates.

Similarity to previous issue
I do agree that it seems similar to #1850, however, I believe that the problem there was that no libraries were exported. We have some exported as we are using quite an old version of the nvidia-container-cli. This is why I opened a new issue instead of writing in the old one.
When trying to reproduce our problems on Jetson Orin as opposed to Jetson Xavier, where this was first encountered, we also saw that no libraries were provided (with a fresh install of nvidia-container package).

Odd behavior in binding nv libraries
After some more digging, I found that while the script says: Could not find any nv files on this host!, all of the libraries from nvidia-container-cli list --binaries --libraries are bound in the /.singularity.d/libs/ directory, which seems odd.
The log from the execution with -v and -d flags here. Line 17 shows the could not find message, and lines 136-149 show that the libraries are added and later mounted.
The pytorch inside the singularity still does not support CUDA, but that is probably a problem on our side, as we were using the wrong wheel and were unable to fix that.

Singularity 4.1
I tried installing singularity in version 4.1 on the Jetson but was unsuccessful. The problem seems to be with the libfuse-dev as for Ubuntu 18.04, only the libfuse2 is available. Manual installation of libfuse3 failed for some reason.
I may try that again later. But because of that, I do not have any feedback about the --oci mode for you.

tri-adam · 2024-04-09T18:27:15Z

Similarity to previous issue I do agree that it seems similar to #1850, however, I believe that the problem there was that no libraries were exported. We have some exported as we are using quite an old version of the nvidia-container-cli. This is why I opened a new issue instead of writing in the old one. When trying to reproduce our problems on Jetson Orin as opposed to Jetson Xavier, where this was first encountered, we also saw that no libraries were provided (with a fresh install of nvidia-container package).

Ah, that makes sense. It looks like this was deprecated in v1.10.0 of the NVIDIA Container Toolkit (NVIDIA/nvidia-container-toolkit#90 (comment)), so as you say, that wouldn't be what you're hitting.

Odd behavior in binding nv libraries After some more digging, I found that while the script says: Could not find any nv files on this host!, all of the libraries from nvidia-container-cli list --binaries --libraries are bound in the /.singularity.d/libs/ directory, which seems odd. The log from the execution with -v and -d flags here. Line 17 shows the could not find message, and lines 136-149 show that the libraries are added and later mounted. The pytorch inside the singularity still does not support CUDA, but that is probably a problem on our side, as we were using the wrong wheel and were unable to fix that.

Taking a quick scan through the code of that version of Singularity, it looks like that warning is specifically when no bins or ipcs are found:

singularity/cmd/internal/cli/actions_linux.go

Lines 347 to 351 in 673570c

    
           files := make([]string, len(bins)+len(ipcs)) 
        
           if len(files) == 0 { 
        
           	sylog.Infof("Could not find any %s files on this host!", gpuPlatform) 
        
           } else {

The libraries are handled separately:

singularity/cmd/internal/cli/actions_linux.go

Lines 364 to 369 in 673570c

    
           if len(libs) == 0 { 
        
           	sylog.Warningf("Could not find any %s libraries on this host!", gpuPlatform) 
        
           	sylog.Warningf("You may need to manually edit %s", gpuConfFile) 
        
           } else { 
        
           	engineConfig.SetLibrariesPath(libs) 
        
           }

So that looks like it's functioning as expected based on the output you shared from nvidia-container-cli list --binaries --libraries.

vlk-jan added the bug Something isn't working label Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA Jetson AGX, singularity exec --nv Could not find any nv files on this host! #2805

NVIDIA Jetson AGX, singularity exec --nv Could not find any nv files on this host! #2805

vlk-jan commented Apr 5, 2024 •

edited

tri-adam commented Apr 5, 2024

vlk-jan commented Apr 5, 2024 •

edited

tri-adam commented Apr 9, 2024

NVIDIA Jetson AGX, singularity exec --nv Could not find any nv files on this host! #2805

NVIDIA Jetson AGX, singularity exec --nv Could not find any nv files on this host! #2805

Comments

vlk-jan commented Apr 5, 2024 • edited

tri-adam commented Apr 5, 2024

vlk-jan commented Apr 5, 2024 • edited

tri-adam commented Apr 9, 2024

vlk-jan commented Apr 5, 2024 •

edited

vlk-jan commented Apr 5, 2024 •

edited