Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU: Error response from daemon: invalid volume specification #1564

Open
mzernovx opened this issue Oct 12, 2023 · 11 comments
Open

GPU: Error response from daemon: invalid volume specification #1564

mzernovx opened this issue Oct 12, 2023 · 11 comments
Labels
bug Something isn't working

Comments

@mzernovx
Copy link

mzernovx commented Oct 12, 2023

Environment:

  • kubernetes 1.27.3
  • docker v20.10.20

Steps to reproduce:

  • Setup Intel Device Plugins
  • Create any pod with gpu.intel.com/i915 resource allocated

Expected behaviour: pod running

Actual behaviour:
pod in CreateContainerError state
Warning Failed 2m49s (x12 over 5m3s) kubelet Error: Error response from daemon: invalid volume specification: '/dev/dri/by-path/pci-0000:b7:00.0-card:/dev/dri/by-path/pci-0000:b7:00.0-card:ro'

Likely caused by this commit: 943e34f

@tkatila tkatila added the bug Something isn't working label Oct 12, 2023
@tkatila
Copy link
Contributor

tkatila commented Oct 12, 2023

Thanks for reporting this. Did you verify that it's only on docker runtime?

@tkatila
Copy link
Contributor

tkatila commented Oct 12, 2023

The change that is causing this was introduced on 0.26.1 version. You can workaround it by using 0.26.0 in the mean while.

@mythi
Copy link
Contributor

mythi commented Oct 12, 2023

I remember we have had similar cases with volume mounts where the paths have had colons and docker is used. Is docker mandatory here or could proper CRI runtime be used?

@mzernovx
Copy link
Author

@tkatila I can confirm that with containerd it's working fine.

@mzernovx
Copy link
Author

I remember we have had similar cases with volume mounts where the paths have had colons and docker is used. Is docker mandatory here or could proper CRI runtime be used?

BMRA/VMRA uses docker as a default container runtime.

@eero-t
Copy link
Contributor

eero-t commented Oct 12, 2023

docker v20.10.20

That's a bit old. Oldest Docker version listed e.g. in Ubuntu packages site is v20.10.21, and Ubuntu 20.04 LTS updates are already at 24.0.5: https://packages.ubuntu.com/focal-updates/docker.io

Have you tried any newer Docker version?

kubernetes 1.27.3
...
BMRA/VMRA uses docker as a default container runtime.

They could consider updating that default, as Kubernetes deprecated Docker support after k8s v1.20: https://kubernetes.io/blog/2020/12/02/dont-panic-kubernetes-and-docker/

@tkatila
Copy link
Contributor

tkatila commented Oct 13, 2023

Have you tried any newer Docker version?

I tried a newer version and it reproduces with it:

$ dpkg --list | grep Docker
ii  docker-buildx-plugin                             0.11.2-1~ubuntu.22.04~jammy                 amd64        Docker Buildx cli plugin.
ii  docker-ce                                        5:24.0.6-1~ubuntu.22.04~jammy               amd64        Docker: the open-source application container engine
ii  docker-ce-cli                                    5:24.0.6-1~ubuntu.22.04~jammy               amd64        Docker CLI: the open-source application container engine
ii  docker-ce-rootless-extras                        5:24.0.6-1~ubuntu.22.04~jammy               amd64        Rootless support for Docker.
ii  docker-compose-plugin                            2.21.0-1~ubuntu.22.04~jammy                 amd64        Docker Compose (V2) plugin for the Docker CLI.

Pod fails with:

  Warning  Failed     8s (x2 over 9s)  kubelet            Error: Error response from daemon: invalid volume specification: '/dev/dri/by-path/pci-0000:00:02.0-card:/dev/dri/by-path/pci-0000:00:02.0-card:ro'

Docker Engine is mentioned in container runtimes in k8s docs: https://kubernetes.io/docs/setup/production-environment/container-runtimes/#docker that would suggest it's still "ok" to use it.

But to me this is a bug with the docker engine as it works fine with containerd and cri-o. My thought process for this is:

  1. File a bug for the docker engine about it not being able to mount paths with :.
  2. https://github.com/intel/container-experience-kits for docker installation, stick with 0.26.0 GPU plugin
  3. If/when the docker engine bug is resolved, update the GPU plugin to the latest version

I do not want to remove the "by-path" mounting as it's required by distributed training. And adding some cli arg or env variable to temporarily disable it feels icky.

@tkatila
Copy link
Contributor

tkatila commented Oct 13, 2023

It seems that a colon in volumes/binds is a known issue:
docker/docker-py#2041
moby/moby#39293
moby/moby#22825

@mzernovx
Copy link
Author

Looks like there's a workaround to use --mount arg with Docker but there's no clear way to utilize this from the side of Kubernetes.

The most suitable fix for this bug seems to be avoiding using /dev/dri/by-path/xxx as they are basically symlinks to devices in /dev/dri

@mythi
Copy link
Contributor

mythi commented Oct 13, 2023

The most suitable fix for this bug seems to be avoiding using /dev/dri/by-path/xxx

Avoid using docker is not an option?

@mzernovx
Copy link
Author

mzernovx commented Oct 13, 2023

Avoid using docker is not an option?

@mythi BMRA/VMRA still uses docker as a "primary" container runtime. The product is build around customers and their needs, so avoiding using Docker is not an option for us.

Downgrading Intel DP to 0.26.0 can be considered as a workaround, but not a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants