Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logind CanGraphical state change only after DRM driver init #32509

Open
theofficialgman opened this issue Apr 26, 2024 · 18 comments
Open

logind CanGraphical state change only after DRM driver init #32509

theofficialgman opened this issue Apr 26, 2024 · 18 comments
Labels
login RFE 🎁 Request for Enhancement, i.e. a feature request

Comments

@theofficialgman
Copy link

theofficialgman commented Apr 26, 2024

Component

systemd-logind

Is your feature request related to a problem? Please describe

Related to https://bugs.launchpad.net/linux/+bug/2063143 and all the bugs referenced within.

The problem is that as is right now, DRM gpu drivers are not guaranteed to have finished initializing by the time that login managers are called. This can result in unexpected behavior (eg: permanent black screen) if the DRM drives have not finished initializing when login managers start up.

This issue where DRM gpu drivers have not finished initializing is causing black screen on boot on kubuntu 24.04 on two separate test systems (AMD Framework 13 (amdgpu) and HP Spectre x360 (i915))

Describe the solution you'd like

As proposed by @jadahl and @superm1

If logind could hook up any async module loading hueristics to CanGraphical no login managers would need to worry about DRM drivers not having finished initialization.

Something on the line of:

  1. simpledrm exports a new sysfs file probed that has 0 or 1. Defaults to 0.
  2. All DRM drivers call a new symbol when they finish the probe a card that updates the value of that to 1.
  3. logind introduces logic to look for nomodeset and if it's set then CanGraphical() returns TRUE every time.
  4. logind introduces logic to look for probed file in DRI directory. If it's not found, then existing logic. If it's found, then calls to CanGraphical() return TRUE only when the value is 1.

Describe alternatives you've considered

Implement in every login manager a way to wait for DRM drivers to have finished initializing (like currently done in GDM because of this issue https://gitlab.gnome.org/GNOME/gdm/-/commit/895f765aa8cc5a9dd2901be65bcd638b8aa7c577)

The systemd version you checked that didn't have the feature you are asking for

255

@theofficialgman theofficialgman added the RFE 🎁 Request for Enhancement, i.e. a feature request label Apr 26, 2024
@github-actions github-actions bot added the login label Apr 26, 2024
@superm1
Copy link
Contributor

superm1 commented Apr 26, 2024

My idea also requires kernel changes in its current proposal.

Since the original idea I did have a different idea how it could be done. Logind can look whether /dev/dri/card0 has been removed.

This happens when amdgpu takes over the framebuffer.

@n3rdopolis
Copy link
Contributor

My idea also requires kernel changes in its current proposal.

Since the original idea I did have a different idea how it could be done. Logind can look whether /dev/dri/card0 has been removed.

This happens when amdgpu takes over the framebuffer.

I assume there would have to be a timeout for that, so that systems, like qemu VMs using the cirrus driver, or other obscure hardware that need simpledrm to work for that though right? Otherwise (unless I am missing something) simpledrm systems will never go to CanGraphical?

@superm1
Copy link
Contributor

superm1 commented Apr 26, 2024

I assume there would have to be a timeout for that, so that systems, like qemu VMs using the cirrus driver, or other obscure hardware that need simpledrm to work for that though right? Otherwise (unless I am missing something) simpledrm systems will never go to CanGraphical?

That's a good point on this gap you identified. Perhaps within logind the equivalent of udevadm settle for the device needs to finish then before CanGraphical can run.

@n3rdopolis
Copy link
Contributor

Yeah. simpledrm is a fallback driver. It's for hardware that doesn't have its own mode setting drivers.

It seems like it can actually be compiled as a module again (most distros compile it into the kernel), and it can in theory be loaded after if /dev/dri/card* fails to get created, the question still remains when to decide to load it, like is there no /dev/dri/card0 because the drivers haven't started yet, or is there no /dev/dri/card0 because the only video device is a cirrus card. Not sure how possible that is to determine that with user space

I don't know how video card driver loading work in the kernel, is it possible to make simpledrm just load and hold the BIOS/sysbuf memory, not create the /dev/dri/card0, and if all those drivers probes fail, (or whatever happens), then it creates its /dev/dri/card0 ?

It might make more sense than simpledrm loading, and then getting replaced, but maybe there is a reason why it does it in that order. (

@superm1
Copy link
Contributor

superm1 commented Apr 26, 2024

So part of the problem is going to be "new" hardware. For example Hardware where amdgpu loads but isn't yet supported in that kernel version .

You want simpledrm to load in this case and so any logic you would build around an assumption of vendor id or if a module is loaded falls apart.

I think the best thing to do is run a settle sequence to decide when to CanGraphical. Then you will be sure that whatever should load is loaded and most importantly keep things simpler in logind.

@n3rdopolis
Copy link
Contributor

That makes sense, I wonder though if this is worth a thread on the LKML too first to see if the simpledrm dev has any insight? Or do you think it's probably not feasible to fix in the kernel?

@superm1
Copy link
Contributor

superm1 commented Apr 27, 2024

I personally don't see any way to do it in the kernel. But if you want to ask, go for it.

Fwiw I am the one that did the delicate dance in amdgpu a year or so ago to make sure it smoothly handles the case of an unsupported GPU in a given kernel. Specifically it doesn't give up the framebuffer that simpledrm is using until it is sure it has all the driver code to support all IP blocks all the firmware that matches them.

@n3rdopolis
Copy link
Contributor

n3rdopolis commented Apr 27, 2024

OK, so maybe it does have to be done in userspace then.

Thinking more and there is the possibility for there to be some initrds that only have simpledrm in it, and then the actual drivers to be on the rootfs, and all that guessing based on available drivers would be wrong, I guess

@superm1
Copy link
Contributor

superm1 commented Apr 27, 2024

Not only possibly - that's exactly how Ubuntu works when you don't have disk encryption turned on.

@n3rdopolis
Copy link
Contributor

So I have been doing some testing, since I saw the SDDM issue about this.

Last week, I was kind of confused by this I will admit, I thought the issue was /dev/dri/card1 was being created too soon, where because of all the firmware and stuff, it wasn't usable until a certain point.
It has come clear to me that this is not the case. I now understand the issue is that it actually takes a bit longer for /dev/dri/card1 to actually appear, so now greeter display servers have more of a chance to accidentally start using the simpledrm device. (Or see NO drm devices when loading too quickly on simpledrm-less distros)

So ... I can I see this strategy fixing computers WITHOUT simpledrm. That could remove the possibility of a login manager, or the greeter display server saying "Oh heys! seat0 has no /dev/dri/card* devices! lets fail!", and this could also make it so that the greeter display servers don't start using the simpledrm device first, and then get it pulled out from under them.

In my mind though, if it's possible for them to support it, I think the various display servers should better address if the simpledrm device they are using gets replaced.

I see the original https://gitlab.gnome.org/GNOME/mutter/-/issues/2909 filed against Mutter, in the end it looks like it was addressed in GDM instead, even though it looks like the gnome-shell based greeter died.

Testing on a VM with modprobe.blacklist=virtio_gpu and then a later running sudo modprobe virtio-gpu manualy under the user session replicates the issue (the dash vs hyphen inconsistency was king of hard to remember at first). This is the display servers running as actual user sessions, and not as greeters.

Results:
1 gnome-shell/mutter: crashes sometimes, but sometimes stays running, but doesn't graphically recover.
2. Xorg: crashes
3. Weston: Stays running, but doesn't use the new GPU, so it results in a blank screen
4. KWin: Aborts when the primary GPU goes away (the cool recovery thing kind of takes over here though). I tried patch that out to see what happens, and it crashes, but recovers.
5. wlroots: stays running, doesn't recover graphically. The logs say it opens the new GPU though, at least Cage does...
6. mir: stays running, doesn't recover graphically
7. plymouth: Not a display server, since it doesn't have X11/Wayland clients, but it handles the transition perfectly

I am wondering if bug reports should be made to the various display servers to support when the simpledrm device gets replaced?

What do you think?

@superm1
Copy link
Contributor

superm1 commented May 9, 2024

You know what; that's a pretty similar contrived test that I was doing where I would let the display server startup and then load the driver later. But the problem is this is viewed as a "double hotplug" event, which isn't supported.
GDM can't handle the primary GPU going away and a new one coming to replace it. I expect the same exists for other environments too.

@n3rdopolis
Copy link
Contributor

Maybe it's at least worth asking kwin/wlroots? Their hotplug stuff could be different maybe? Or no?

Also, this theory is kind of wacky, how much of simpledrm is dependent on that lower memory? Like just the display part right?
Like is it possible for say when a new driver gets loaded, make it so that instead of /dev/dri/card0 going away instantly and killing the handles, it simulates just an unplug of the "Unknown-1" screen it presents, to where it's not able to display anything anymore, but it won't kill all the display server's handles to it? (and then maybe it can go away when the last handle to it closes)?
I am probably talking way out of my tree there. lol

@superm1
Copy link
Contributor

superm1 commented May 9, 2024

Maybe it's at least worth asking kwin/wlroots? Their hotplug stuff could be different maybe? Or no?

It wouldn't hurt to ask, but I would be surprised if they handle hotplug for the primary display. That's tough to support!

Also, this theory is kind of wacky

Even if this was possible the problem you'll have is a phantom screen where the cursor isn't visible. Although it wouldn't crash the display server it's not the best experience..

@n3rdopolis
Copy link
Contributor

It wouldn't hurt to ask, but I would be surprised if they handle hotplug for the primary display. That's tough to support!
Yeah, it seems that even when sessions go inactive, they still maintain handles to the /dev/dri/card devices...

Even if this was possible the problem you'll have is a phantom screen where the cursor isn't visible. Although it wouldn't crash the display server it's not the best experience..

Yeah, I didn't think so, but I meant that it just starts reporting itself as a device with no attached screens...

@superm1
Copy link
Contributor

superm1 commented May 9, 2024

The problem is you have no idea if a fully functional driver is "going" to load later. If it doesn't you want simpledrm to render.

@n3rdopolis
Copy link
Contributor

Well, my thought was that the simpledrm device acts as normal when booting,

but when the usual GPU driver loads and replaces it, instead of disappearing, whether it's amdgpu, or i915, or nouveau, or virtio_gpu getting loaded, the /dev/dri/card0 device stays alive, but then starts reporting that there are no screens/CRTCs attached to where it is useless for displaying stuff, since now the real driver is handling them, just so that the display servers handles don't close...

@superm1
Copy link
Contributor

superm1 commented May 9, 2024

I guess you can raise this idea on dri-devel with the simpledrm maintainer for their thoughts.

@n3rdopolis
Copy link
Contributor

Done, hopefully I worded that correctly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
login RFE 🎁 Request for Enhancement, i.e. a feature request
Development

No branches or pull requests

3 participants