Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default Nvidia CDI spec location on rootless kit seems to be unaccessible #47676

Open
LukasIAO opened this issue Apr 4, 2024 · 1 comment
Open
Labels
area/rootless Rootless mode kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage

Comments

@LukasIAO
Copy link

LukasIAO commented Apr 4, 2024

Description

I originally opened an issue on the Nvidia-container-toolkit repo, but we figured the issue may actually be better placed here.

Original issue: NVIDIA/nvidia-container-toolkit#434 @elezar

The Issue
Testing rootless docker 26.0.0 with nvidia container toolkit and Nvidia CDI support, the CDI injection fails, presumably because docker cannot find the nvidia.yaml.

docker run --rm -ti --device=nvidia.com/gpu=all ubuntu nvidia-smi -L
docker: Error response from daemon: CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=all.

The client is looking at

 CDI spec directories:
  /etc/cdi
  /var/run/cdi

by default, but unlike the rootful version, rootless is unable to access the specs.

We tested this by moving the specs to another directory and specified the new location in the docker daemon.json:

{
    "features": {
        "cdi": true
    },
    "cdi-spec-dirs": ["/home/username/.docker/cdi/", "/home/username/.docker/run/cdi/"],
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}
CDI spec directories:
  /home/username/.docker/cdi/
  /home/username/.docker/run/cdi/

Which seems to have solved the issue.

Reproduce

  1. Install docker rootless 26.0.0 via install script
  2. install the nvidia-container toolkit according to the documentation
  3. run nvidia-ctk runtime configure --runtime=docker --cdi.enabled --config=$HOME/.config/docker/daemon.json to enable cdi mode on rootless
  4. check CDI spec directories location via docker info
  5. run a container with native CDI injection docker run --rm -ti --device=nvidia.com/gpu=all ubuntu nvidia-smi -L

Expected behavior

We expected the rootless client to be able to run native CDI injections by accessing the nvidia.yaml default location, or give an indication, that the default location is inaccessible to rootless:

/.config/docker$ docker run --rm -ti --device=nvidia.com/gpu=all ubuntu nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-b6022b4d-71db-8f15-15de-26a719f6b3e1)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-22420f7d-6edb-e44a-c322-4ce539cade19)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-5e3444e2-8577-0e99-c6ee-72f6eb2bd28c)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-dd1f811d-a280-7e2e-bf7e-b84f7a977cc1)

docker version

Client:
 Version:           26.0.0
 API version:       1.45
 Go version:        go1.21.8
 Git commit:        2ae903e
 Built:             Wed Mar 20 15:16:45 2024
 OS/Arch:           linux/amd64
 Context:           rootless

Server: Docker Engine - Community
 Engine:
  Version:          26.0.0
  API version:      1.45 (minimum version 1.24)
  Go version:       go1.21.8
  Git commit:       8b79278
  Built:            Wed Mar 20 15:18:14 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.7.13
  GitCommit:        7c3aca7a610df76212171d200ca3811ff6096eb8
 runc:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
 rootlesskit:
  Version:          2.0.2
  ApiVersion:       1.1.1
  NetworkDriver:    vpnkit
  PortDriver:       builtin
  StateDir:         /run/user/1010/dockerd-rootless
 vpnkit:
  Version:          7f0eff0dd99b576c5474de53b4454a157c642834

docker info

Client:
 Version:    26.0.0
 Context:    rootless
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.13.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.5.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 5
  Running: 0
  Paused: 0
  Stopped: 5
 Images: 3
 Server Version: 26.0.0
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: false
  userxattr: true
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 CDI spec directories:
  /home/ver23371/.docker/cdi/
  /home/ver23371/.docker/run/cdi/
 Swarm: inactive
 Runtimes: nvidia runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7c3aca7a610df76212171d200ca3811ff6096eb8
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  rootless
  cgroupns
 Kernel Version: 5.15.0-1047-nvidia
 Operating System: Ubuntu 22.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 128
 Total Memory: 503.5GiB
 Name: DGX-Station-A100-920-23487-2530-0R0
 ID: 48ae789a-3d2d-43d8-841a-9a34c9bdc46e
 Docker Root Dir: /home/ver23371/.local/share/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support
WARNING: No cpu shares support
WARNING: No cpuset support
WARNING: No io.weight support
WARNING: No io.weight (per device) support
WARNING: No io.max (rbps) support
WARNING: No io.max (wbps) support
WARNING: No io.max (riops) support
WARNING: No io.max (wiops) support

Additional Info

No response

@LukasIAO LukasIAO added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage labels Apr 4, 2024
@elezar
Copy link
Contributor

elezar commented Apr 4, 2024

/cc

@AkihiroSuda AkihiroSuda added the area/rootless Rootless mode label Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rootless Rootless mode kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage
Projects
None yet
Development

No branches or pull requests

3 participants