Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu hotplug doesn't work #8149

Open
athul-krishna-kr opened this issue May 5, 2024 · 6 comments
Open

gpu hotplug doesn't work #8149

athul-krishna-kr opened this issue May 5, 2024 · 6 comments
Labels
bug Not working as intended

Comments

@athul-krishna-kr
Copy link

athul-krishna-kr commented May 5, 2024

Device and Software Info:

  • Sway Version: sway version 1.9
  • Device: Asus Zephyrus G14 GA402RJ
  • BIOS: 319
  • OS: Arch Linux
  • Kernel: 6.8.9-arch1-1
  • DGPU: AMD Radeon RX 6700S(amdgpu)

dgpu_add:

  • echo 1 | sudo tee /sys/bus/pci/rescan

dgpu_remove:

  • echo "0000:03:00.0" | sudo tee /sys/bus/pci/devices/0000:03:00.0/driver/unbind && echo 1 | sudo tee /sys/bus/pci/devices/0000:03:00.0/remove

udev_trigger:

  • sudo udevadm trigger --verbose --type=devices --action=remove --subsystem-match=drm --property-match="MINOR=0"

Bug report:

lsof_before shows output of sudo lsof /dev/dri/card* before udev_trigger. After udev_trigger there seems to be one thread(?) using file /dev/dri/card0(see lsof_after).

After dgpu_remove and dgpu_add, dgpu comes backup with different card number(card1) tries to initialize drm backend and fails to initialize egl context. Again removing and adding dgpu, it comes back with different card number(card3) and again fails to initialize egl context with error;

  • [ERROR] [wlr] [EGL] command: eglQueryDeviceStringEXT, error: EGL_BAD_PARAMETER (0x300c), message: "eglQueryDeviceStringEXT"
  • [ERROR] [wlr] [EGL] command: eglQueryDeviceStringEXT, error: EGL_BAD_PARAMETER (0x300c), message: "eglQueryDeviceStringEXT"
  • amdgpu_device_initialize: amdgpu_get_auth (2) failed (-1)
  • amdgpu: amdgpu_device_initialize failed.

both times.

Closing sway session and removing and adding dgpu from tty does reset the dgpu card number to 0 or1.
Starting a new sway session with dgpu removed, and then adding dgpu does work. It does show up in sudo lsof /dev/dri/card*.

Log files:

@athul-krishna-kr athul-krishna-kr added the bug Not working as intended label May 5, 2024
@bl4ckb0ne
Copy link
Contributor

bl4ckb0ne commented May 6, 2024

That's a wlroots bug.

It's weird that eglQueryDeviceStringEXT fails with EGL_BAD_PARAMETER, from the spec 1

On failure, NULL is returned.  An EGL_BAD_DEVICE_EXT error is
generated if <device> is not a valid EGLDeviceEXT.  An
EGL_BAD_PARAMETER error is generated if <name> is not one of the
values described above.

Would you be able to get a stacktrace?

@athul-krishna-kr
Copy link
Author

athul-krishna-kr commented May 6, 2024

How should I get stacktrace? stacktrace of what?

@bl4ckb0ne
Copy link
Contributor

of the value being given to eglQueryDeviceStringEXT

@athul-krishna-kr
Copy link
Author

Please give me instructions to get stack trace.

@bl4ckb0ne
Copy link
Contributor

See the build instruction in the repo to build in debug, then start sway in a gdb session from another computer in ssh.

@athul-krishna-kr
Copy link
Author

I should add debug flags to only wlroots?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Not working as intended
Development

No branches or pull requests

2 participants