Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDI: Add mount-nvidia-binaries and mount-nvidia-docker-1-directories options #290979

Merged

Conversation

ereslibre
Copy link
Member

@ereslibre ereslibre commented Feb 23, 2024

Add three options to hardware.nvidia-container-toolkit:

  • mounts: list of mounts that allow to mount arbitrary paths on the CDI enabled containers.

  • mount-nvidia-binaries: this option allows users to avoid mounting nvidia binaries on the container.

  • mount-nvidia-docker-1-directories: this option allows users to avoid mounting /usr/local/nvidia/lib{,64} on containers.

Related: #290609

Description of changes

Things done

I have tested that podman run --device=nvidia.com/gpu=all -d -p 11434:11434 --name ollama ollama/ollama works as expected with this change, and is able to use all the GPU's in my system without any change.

My NixOS configuration sets virtualisation.containers.cdi.dynamic.nvidia.enable = true;.

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.05 Release Notes (or backporting 23.05 and 23.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@ereslibre ereslibre force-pushed the cdi-add-nvidia-docker-1-directories branch from 5dbec5a to 3e0dc82 Compare February 24, 2024 07:22
@ereslibre ereslibre changed the title CDI: Add nvidia-docker 1 directories CDI: Add nvidia-docker 1.0 directories Feb 24, 2024
@ereslibre ereslibre force-pushed the cdi-add-nvidia-docker-1-directories branch 2 times, most recently from e1c958f to e30ae50 Compare February 24, 2024 07:35
@ereslibre
Copy link
Member Author

ereslibre commented Feb 24, 2024

If upstream produces a broken image

They are not producing a broken image. When you use CUDA in container images it's common practice to do this. With the nvidia-docker, you would:

  1. Set LD_LIBRARY_PATH to /usr/local/nvidia/lib{,64}
  2. Something --a runtime wrapper,likely-- mounts the required drivers and libraries from the host at /usr/local/nvidia/lib{,64}
  3. Your container hopefully runs

Later on, I don't know exactly the timeline or when that happened, they changed the story a little bit:

  1. You generate a CDI spec, that provides you with the ability to extend OCI hooks and mount other directories from the host.
  2. With CDI, you mount the relevant drivers and binaries from the host on the guest.
  3. An OCI hook, injected by CDI, takes the /etc/ld.so.cache on the container, and updates it, by adding the discovered libcuda.so.* and friends.
  4. Your container hopefully runs

It's not ideal, it's probably not the most stable thing, but it's what we have. Many, many, many container images use this as of today. Our interest as NixOS developers is to find the balance between a perfect implementation, and letting people run their stuff without getting in their way as much as possible.

@SomeoneSerge
Copy link
Contributor

If they actually enforce that these paths exist, they are broken:) If they were not broken, their LD_LIBRARY_PATH would be a no-op and you'd not get an error because the ld.so would still look up the ld.so.cache

@ereslibre
Copy link
Member Author

ereslibre commented Feb 24, 2024

If they actually enforce that these paths exist, they are broken:) If they were not broken, their LD_LIBRARY_PATH would be a no-op and you'd not get an error because the ld.so would still look up the ld.so.cache

They are not broken, again. They just took the decision that the container images would place the LD_LIBRARY_PATH at the place where the runtime wrapper will mount the host libraries (or ld.so.conf.d being augmented with it) --see official example of creating a cuda image from nvidia:--

https://gitlab.com/nvidia/container-images/cuda/-/blob/e3ff10eab3a1424fe394899df0e0f8ca5a410f0f/dist/12.3.1/ubi9/base/Dockerfile#L40-41

https://gitlab.com/nvidia/container-images/cuda/-/blob/e3ff10eab3a1424fe394899df0e0f8ca5a410f0f/dist/12.3.1/ubi9/base/Dockerfile#L44

Reality is, these images exist in the wild, a lot of them. We cannot fix them, they are already built and published. We can adapt to this reality and make our users life a bit easier, that's all.

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Feb 24, 2024

They are not broken, again. They just took the decision that the container images would place the LD_LIBRARY_PATH at the place where the runtime wrapper will mount the host libraries (or ld.so.conf.d being augmented with it) --see official example of creating a cuda image from nvidia:--

The LD_LIBRARY_PATH set up by the container shouldn't matter at all, we have already informed the dynamic loader about the drivers' location through other means

EDIT: I don't disagree we should provide for a smooth experience, I just think we're mistaken about why and how the failure occurs (if it does), and thus about the method of addressing it

@ereslibre
Copy link
Member Author

ereslibre commented Feb 24, 2024

The LD_LIBRARY_PATH set up by the container shouldn't matter at all, we have already informed the dynamic loader about the drivers' location through other means

For the context of this PR, it doesn't matter whether it's LD_LIBRARY_PATH or ld.so.conf.d. Side note: I think the applications depending on CUDA tend to do a dlopen, that is affected by LD_LIBRARY_PATH. I am not so sure that it should not matter at all, I tend to think otherwise.

In any case, I did show both:

  1. How a CUDA container image should be built according to nvidia:
    1. Setting LD_LIBRARY_PATH
    2. Modifying /etc/ld.so.conf.d
  2. nvidia-docker doing this very same thing

@SomeoneSerge
Copy link
Contributor

For the context of this PR, it doesn't matter whether it's LD_LIBRARY_PATH or ld.so.conf.d. Side note: I think the applications depending on CUDA tend to do a dlopen, that is affected by LD_LIBRARY_PATH. I am not so sure that it should not matter at all, I tend to think otherwise.

The dlopen should work out of the box regardless of LD_LIBRARY_PATH, because our CDI hook already takes care of this.

Does your example (ollama) actually fail without these extra mounts?

@ereslibre
Copy link
Member Author

The dlopen should work out of the box regardless of LD_LIBRARY_PATH, because our CDI hook already takes care of this.

No, if the code in the container does a dlopen it's already on its own. Only LD_LIBRARY_PATH will matter.

@SomeoneSerge
Copy link
Contributor

 No, if the code in the container does a dlopen it's already on its own. Only LD_LIBRARY_PATH will matter.

No, dlopen is handled by the dynamic loader, which follows exactly the same logic as DT_NEEDED: it'll see if the library is already loaded (e.g. by LD_PRELOAD), it'll consult LD_LIBRARY_PATH, it'll consult DT_RUNPATH of the calling process, it'll consult the ld.so.caches

@ereslibre
Copy link
Member Author

No, dlopen is handled by the dynamic loader, which follows exactly the same logic as DT_NEEDED: it'll see if the library is already loaded (e.g. by LD_PRELOAD), it'll consult LD_LIBRARY_PATH, it'll consult DT_RUNPATH of the calling process, it'll consult the ld.so.caches

Yes, you are right, my bad. In any case, we still need to mount this directory.

@SomeoneSerge
Copy link
Contributor

Yes, you are right, my bad. In any case, we still need to mount this directory.

Why, have we seen any failures?

@ereslibre
Copy link
Member Author

ereslibre commented Feb 24, 2024

Why, have we seen any failures?

As per PR description: podman run --device=nvidia.com/gpu=all -d -p 11434:11434 --name ollama ollama/ollama does not work with GPU acceleration without this PR.

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Feb 24, 2024

Very good, and sorry I missed this. Could you publish the logs so we can see what the error was?

@ereslibre
Copy link
Member Author

ereslibre commented Feb 24, 2024

Very good, and sorry I missed this. Could you publish the logs so we can see what the error was?

❯ podman run --rm --device=nvidia.com/gpu=all -e OLLAMA_DEBUG="1" -p 11434:11434 --name ollama ollama/ollama serve
Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.
Your new public key is:

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAaCxFGYXlp+0XG39EwoWKRXrCqlhbn4rqYwO7d9OdY8

time=2024-02-24T10:50:18.777Z level=INFO source=images.go:710 msg="total blobs: 0"
time=2024-02-24T10:50:18.777Z level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-02-24T10:50:18.777Z level=INFO source=routes.go:1019 msg="Listening on [::]:11434 (version 0.1.27)"
time=2024-02-24T10:50:18.777Z level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-02-24T10:50:20.600Z level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cuda_v11 cpu cpu_avx cpu_avx2 rocm_v6 rocm_v5]"
time=2024-02-24T10:50:20.601Z level=DEBUG source=payload_common.go:147 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-02-24T10:50:20.601Z level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-02-24T10:50:20.601Z level=INFO source=gpu.go:265 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-02-24T10:50:20.601Z level=DEBUG source=gpu.go:283 msg="gpu management search paths: [/usr/local/cuda/lib64/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* /usr/lib/wsl/lib/libnvidia-ml.so* /usr/lib/wsl/drivers/*/libnvidia-ml.so* /opt/cuda/lib64/libnvidia-ml.so* /usr/lib*/libnvidia-ml.so* /usr/local/lib*/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libnvidia-ml.so* /usr/lib/aarch64-linux-gnu/libnvidia-ml.so* /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so* /usr/local/nvidia/lib/libnvidia-ml.so* /usr/local/nvidia/lib64/libnvidia-ml.so*]"
time=2024-02-24T10:50:20.601Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
time=2024-02-24T10:50:20.601Z level=INFO source=gpu.go:265 msg="Searching for GPU management library librocm_smi64.so"
time=2024-02-24T10:50:20.601Z level=DEBUG source=gpu.go:283 msg="gpu management search paths: [/opt/rocm*/lib*/librocm_smi64.so* /usr/local/nvidia/lib/librocm_smi64.so* /usr/local/nvidia/lib64/librocm_smi64.so*]"
time=2024-02-24T10:50:20.601Z level=INFO source=gpu.go:311 msg="Discovered GPU libraries: []"
time=2024-02-24T10:50:20.601Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-02-24T10:50:20.601Z level=DEBUG source=amd.go:32 msg="amd driver not detected /sys/module/amdgpu"
time=2024-02-24T10:50:20.601Z level=INFO source=routes.go:1042 msg="no GPU detected"

Note last line: "no GPU detected".

@ereslibre
Copy link
Member Author

So yeah this instance is pretty hardcoded in ollama, what we could fix to make it a better citizen to be honest: https://github.com/ollama/ollama/blob/2a4b128ae3e3a18b10e8701aca2434d401eaa7ba/gpu/gpu.go#L37-L51

But honestly, I think this is a very common practice I fear.

@SomeoneSerge
Copy link
Contributor

Aha. I'd need to dig deeper, but looking at https://github.com/ollama/ollama/blob/2a4b128ae3e3a18b10e8701aca2434d401eaa7ba/gpu/gpu.go#L38-L50, it's more than likely they do not use dlopen with library names (as they would if they actually targeted nvidia-docker), but hard-code the absolute paths. So there's a possibility we're not even talking about institutionalizing nvidia's broken designs, but about accommodating a single project that fails to integrate with these nvidia's designs

@ereslibre
Copy link
Member Author

ereslibre commented Feb 24, 2024

@SomeoneSerge we basically saw the same thing at the same time and wrote the same thing hehe.

I would settle it like this: it makes some broken published software work, and the old wrapper was doing it as well; also, we don't lose much in the process.

Addendum: I also fear this is not the only instance, but I don't have more evidence at hand to back that statement.

@ereslibre
Copy link
Member Author

I have to say: and thanks for pushing to find the real root cause :)

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Feb 24, 2024

Ah, you were faster:)

But honestly, I think this is a very common practice I fear.

What I'm thinking is, and this concerns the /usr/bin and /usr/lib paths from the original PR, we could add options like cdi.dynamic.nvidia.mount-usr-local, mount-usr-bin, etc. I'd set them to false by default

we don't lose much in the process.

Fairly hard to say. This increases pollution/complexity of the environment, i.e. makes errors and successes less deterministic

@ereslibre ereslibre force-pushed the cdi-add-nvidia-docker-1-directories branch 3 times, most recently from 59209f1 to 9c73be8 Compare March 23, 2024 23:50
@ereslibre ereslibre marked this pull request as ready for review March 31, 2024 19:47
@ereslibre ereslibre force-pushed the cdi-add-nvidia-docker-1-directories branch from 9c73be8 to 6b80785 Compare April 16, 2024 19:13
@ereslibre
Copy link
Member Author

ereslibre commented Apr 16, 2024

Rebased this. I would like to include this for NixOS 24.05 (#303285). I think removing cdi.static and cdi.dynamic attributes that were included in fd464f0 and 8ba61eb is a better balance than doing the strict separation of cdi.static and cdi.dynamic for NixOS.

@SomeoneSerge
Copy link
Contributor

  • I think removing cdi.static and cdi.dynamic attributes that were included in fd464f0 and 8ba61eb is a better balance

    👍🏻

  • I think a typed option for CDI specs is still warranted. Note that any custom specs in etc."cdi/..." that specify "hostPaths" will need to handle exportReferencesGraph anyway.

  • Needs an mkRenamedOptionModule for from ...cdi.dynamic.nvidia to the new service for unstable users

@ereslibre
Copy link
Member Author

I think a typed option for CDI specs is still warranted. Note that any custom specs in etc."cdi/..." that specify "hostPaths" will need to handle exportReferencesGraph anyway.

I think we can implement the typed option for CDI specs. I would suggest to do so as a follow up after this has been merged, and a way for users to specify their own CDI configurations statically that is properly type checked.

Needs an mkRenamedOptionModule for from ...cdi.dynamic.nvidia to the new service for unstable users

Right, thanks for the reminder. Adding it later today.

@ereslibre ereslibre force-pushed the cdi-add-nvidia-docker-1-directories branch from 26c83b0 to 7ad83fc Compare April 18, 2024 20:12
@ereslibre
Copy link
Member Author

Right, thanks for the reminder. Adding it later today.

Just for the sake of notifying; this was pushed that very same day already.

cc/ @SomeoneSerge

@SomeoneSerge
Copy link
Contributor

The updated PR removes the previously introduced "generic" options (virtualisation.containers.cdi.{static,dynamic}) and introduces a more specific hardware.nvidia-container-toolkit. This is more modest than the previous interface, will be easier to maintain, and should be compatible with implementing a new generic interface later after the release. I think we should merge this. We should also post a warning in https://discourse.nixos.org/t/breaking-changes-announcement-for-unstable/17574/.

I updated #290609 to reflect this new direction, and I updated the message in the "feature freeze" issue to refer other issues with the current CDI implementation. Notably, we still need to follow up with a fix to the incomplete deployment issue (dependencies of the host drivers and tools not being mounted).

Before this PR is merged I'd like the commit history to be updated. Ideally, the PR be split into two commits: one for deprecation of the previous interface (the file moves, attribute renamings, &c), and one for the new feature (the FHS bin/libs paths)

@ereslibre thanks again for driving this

@ereslibre ereslibre force-pushed the cdi-add-nvidia-docker-1-directories branch from 7ad83fc to cf5fdb7 Compare April 22, 2024 21:40
@ereslibre
Copy link
Member Author

@SomeoneSerge: thank you. Splitted in two commits as you suggested.

I will post in https://discourse.nixos.org/t/breaking-changes-announcement-for-unstable/17574/ when it's merged.

Notably, we still need to follow up with a fix to the incomplete deployment issue (dependencies of the host drivers and tools not being mounted).

Absolutely, we have to follow up on this. Thanks for your feedback during this process.

@ereslibre ereslibre force-pushed the cdi-add-nvidia-docker-1-directories branch from cf5fdb7 to 6edbd3e Compare April 22, 2024 21:55
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/breaking-changes-announcement-for-unstable/17574/47

…s.cdi.dynamic.nvidia.enable`

Add the NixOS option `hardware.nvidia-container-toolkit-cdi-generator.enable`.

This enables the ability to expose GPU's in containers for container
runtimes that support the Container Device Interface (CDI)

Remove `cdi.static` and `cdi.dynamic.nvidia.enable` attributes.
…ount-nvidia-docker-1-directories` options

- `mount-nvidia-binaries`: this option allows users to avoid mounting
nvidia binaries on the container.

- `mount-nvidia-docker-1-directories`: this option allows users to
avoid mounting `/usr/local/nvidia/lib{,64}` on containers.
@ereslibre ereslibre force-pushed the cdi-add-nvidia-docker-1-directories branch from 6edbd3e to de3ce5f Compare April 23, 2024 10:32
Copy link
Contributor

@SomeoneSerge SomeoneSerge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you've tested the runtime, I'm happy to merge now (I can test later)

@ereslibre
Copy link
Member Author

I just re-run it, but let me do a more thorough check before finally merging, everything should be fine but will comment here.

I will also double check the Docker issue and if it’s what I presume I will open another PR to fix that so it also makes it for 24.05.

@ereslibre
Copy link
Member Author

If you've tested the runtime, I'm happy to merge now (I can test later)

Confirmed that it's working fine here in all the cases I tested.

@SomeoneSerge SomeoneSerge merged commit 7035968 into NixOS:master Apr 23, 2024
21 checks passed
@ereslibre ereslibre deleted the cdi-add-nvidia-docker-1-directories branch April 23, 2024 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants