Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ublue-nvctk-cdi.service runs always #180

Open
m2Giles opened this issue Dec 14, 2023 · 3 comments
Open

ublue-nvctk-cdi.service runs always #180

m2Giles opened this issue Dec 14, 2023 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@m2Giles
Copy link
Member

m2Giles commented Dec 14, 2023

In the Nvidia images, we have the ublue-nvctk-cdi.service to support containers.

The only dependencies this service has is if the binary exists, is executable, and we are after local-fs.target. This is problematic because it will always run even if the Nvidia modules are not loaded due to an Nvidia card not being present. For eGPUs, the Nvidia card is not present until much later in the boot process. Instead of using a service, this should be handled via udev rule since this script is dependent on the necessary hardware being present. Right now with an eGPU, you have to manually restart the service before entering any containers.

I'll try converting the service to a udev rule to test.

@bsherman
Copy link
Contributor

A related concern was reported in Discord ( https://discord.com/channels/1072614816579063828/1072617059265032342/1232829046036103231 ) where if the nvidia GPU has been disabled (for example, BIOS disabled dGPU on a dual GPU laptop), then this fails erroneously.

I should finally fix this bug.

@bsherman bsherman self-assigned this Apr 25, 2024
@bsherman bsherman added the bug Something isn't working label Apr 25, 2024
@m2Giles
Copy link
Member Author

m2Giles commented Apr 25, 2024

This will also fail if the nvidia card isn't "ready". We've seen internal A4000 also throw this error.

@Sharkitty
Copy link

Hello! I'm the user mentionned by @bsherman
The system this happened on is running a custom image based on ublue-kinoite-nvidia image (No nvidia related change applied downstream of ublue, only surface stuff so far). As described, the dGPU is disabled in BIOS when this happens, no error in Hybrid mode. This is the systemd log of the failed service:

× ublue-nvctk-cdi.service - ublue nvidia container toolkit CDI auto-generation
     Loaded: loaded (/usr/lib/systemd/system/ublue-nvctk-cdi.service; enabled; preset: enabled)
    Drop-In: /usr/lib/systemd/system/service.d
             └─10-timeout-abort.conf
     Active: failed (Result: exit-code) since Thu 2024-04-25 16:49:45 CEST; 2h 27min ago
   Main PID: 5074 (code=exited, status=1/FAILURE)
        CPU: 28ms

Apr 25 16:49:45 fedora systemd[1]: Starting ublue-nvctk-cdi.service - ublue nvidia container toolkit CDI auto-generation...
Apr 25 16:49:45 fedora nvidia-ctk[5074]: time="2024-04-25T16:49:45+02:00" level=info msg="Auto-detected mode as \"nvml\""
Apr 25 16:49:45 fedora nvidia-ctk[5074]: time="2024-04-25T16:49:45+02:00" level=error msg="failed to generate CDI spec: failed to create device CDI specs: failed to initialize NVML: ERROR_DRIVER_NOT_LOADED"
Apr 25 16:49:45 fedora systemd[1]: ublue-nvctk-cdi.service: Main process exited, code=exited, status=1/FAILURE
Apr 25 16:49:45 fedora systemd[1]: ublue-nvctk-cdi.service: Failed with result 'exit-code'.
Apr 25 16:49:45 fedora systemd[1]: Failed to start ublue-nvctk-cdi.service - ublue nvidia container toolkit CDI auto-generation.

And here is what journalctl -xeu returns for this service:

Apr 25 16:49:45 fedora systemd[1]: Starting ublue-nvctk-cdi.service - ublue nvidia container toolkit CDI auto-generation...
░░ Subject: A start job for unit ublue-nvctk-cdi.service has begun execution
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A start job for unit ublue-nvctk-cdi.service has begun execution.
░░
░░ The job identifier is 331.
Apr 25 16:49:45 fedora nvidia-ctk[5074]: time="2024-04-25T16:49:45+02:00" level=info msg="Auto-detected mode as \"nvml\""
Apr 25 16:49:45 fedora nvidia-ctk[5074]: time="2024-04-25T16:49:45+02:00" level=error msg="failed to generate CDI spec: failed to create device CDI specs: failed to initialize NVML: ERROR_DRIVER_NOT_LOADED"
Apr 25 16:49:45 fedora systemd[1]: ublue-nvctk-cdi.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ An ExecStart= process belonging to unit ublue-nvctk-cdi.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 1.
Apr 25 16:49:45 fedora systemd[1]: ublue-nvctk-cdi.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ The unit ublue-nvctk-cdi.service has entered the 'failed' state with result 'exit-code'.
Apr 25 16:49:45 fedora systemd[1]: Failed to start ublue-nvctk-cdi.service - ublue nvidia container toolkit CDI auto-generation.
░░ Subject: A start job for unit ublue-nvctk-cdi.service has failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A start job for unit ublue-nvctk-cdi.service has finished with a failure.
░░
░░ The job identifier is 331 and the job result is failed.

As I mentioned on discord, I think disabling the dGPU shouldn't be a source of error, as this has a HUGE impact on battery life, and if I don't plan on doing something that requires the dGPU, I think it's best to just disable it until I need it. In this case, I think displaying warnings at most would be ideal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Todo
Development

No branches or pull requests

3 participants