-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nvidia GPU Passthrough: Failed moving device BAR: failed allocating new MMIO range: 0xe7f00000->0x0(0x80000) #6147
Comments
Issue is present with both Issue is not present with roughly equivalent
|
The GPU works without any noticeable issues, it's only the HDMI controller that fails. lspci -nnkvv -s '0a:00.0'
|
Both the GPU and its HDMI controller are in their own IOMMU group:
|
Possibly related
|
Do you have the log of firmware? Or the issue can help you. |
(note to self) issue reproduced with:
default log:
`-v` log:
`-vv` log:
|
cross-referencing the debug logs with lspci:
|
Issue reproduces with release |
Does passing through the same device to Linux work? The address space layout provided by Cloud Hypervisor is fixed. Passing the same device(s) to Linux can help identify whether this is an issue with Windows or Cloud Hypervisor. |
Trying a NixOS live ISO it looked like it was working at first, but after a guest restart I ran into this:
|
before that error, there are these 4 lines:
but the linux vm still boots, albeit with a handful of errors and warnings observable dmesg I think the most telling one is
where |
@liuw since I've managed to reproduce that error by invoking |
Yes. I don't think Windows is the culprit here. Cloud Hypervisor cannot allocate the requested resources.
|
Should I edit the original issue comment to use the minimal example in that comment ? |
@thomasbarrett Can you take a look at this? |
Hey @matdibu, I will take a closer took tomorrow, but at first glance, this looks like a issue is closely related to the firmware. It looks like the firmware is trying to move the device MMIO regions around during early boot and cloud-hypervisor is unable accomodate the request. Do you mind sharing which version of OVMF and rust-hypervisor-firmware you tested with and also posting the firmware logs. |
Rust Hypervisor Firmware does not reconfigure PCI device BARs but OVMF does. |
I have ran a fresh suite of tests:
No firmware
cloud-hypervisor logsdefault log:
-v log:
-vv log:
|
rust-hypervisor-firmware
guest logsdmesg:
dmesg --level=err,warn:
lspci:
lspci -nnvv -s 00:05.0:
cloud-hypervisor logsdefault log:
-v log:
-vv log, but cut at the end for brevity:
|
CLOUDHV.fd
guest logsdmesg:
dmesg --level=err,warn:
lspci:
lspci -nnvv -s 00:05.0:
cloud-hypervisor logsdefault log:
-v log:
-vv log, but cut at the end for brevity:
|
qemu-system-x86_64 + OVMF
guest logsdmesg:
dmesg --level=err,warn:
lspci:
lspci -nnvv -s 00:03.0:
|
Didn't mean to close the issue, misclick, sorry. |
When comparing the qemu:
cloud-hypervisor:
|
Thanks for the update @matdibu. Now that the original error
|
Update. IRQ -2147483648 (0x80000000) is a special constant (IRQ_NOTCONNECTED) defined in linux. See this |
Sorry @thomasbarrett - I think that line is still present in the OVMF log:
Maybe we need some better logging for why moving the BAR failed. |
Ugh. I missed that @rbradford. So:
If we fix the invalid IRQ number, then we should be able to at least get this working with rust-hypervisor-firmware. |
OVMF is moving the 32-bit MMIO bar from @matdibu, do you have the logs from OVMF itself? You should have some obnoxiously long logs that look something like this? Getting these logs is what we need to figure out why OVMF is choosing to move the BAR to that location.
edit: if you don't get these logs, you may need to be running a debug build of OVMF. I personally use this build of OVMF, which might be a quick and easy way for you to get the debug logs. |
This is also really suspicious - PCI BARs must be naturally aligned - but bit 0 is set to 1 here.
|
That is also weird because it looks like that ROM bar was initially allocated at |
that region 8 is just bad debug logging. Implementing parts of vfio for cloud hypervisor to fix a problem with NVIDIA gear, I came across this log error. The PCI spec is clear here and the code, I believe, that prints that error is a third party dependency. Note, it has been awhile, and memory fuzzy. The hda-intel error is not a big deal - this is he sound driver, perhaps located on the hdmi controller? |
@thomasbarrett I've used your CLOUDHV-ch-highmem-6624aa331f.fd
and the first 1000 lines of logs (with -vv) from cloud-hypervisor:
|
without more knowledge of the subject, I also did a
|
Just checked with Same output with |
Thanks for providing the firmware logs @matdibu. I mostly test on datacenter GPUs which don't have a HDMI function, but I think that I have access to a 3090 as well. I will try to reproduce. |
Would it be worth it to compare the OVMF logs in cloud-hypervisor and qemu side by side? |
I can see this with ch v39 + ovmf master:
and
|
I also don't know what this is:
|
I tried to get logs from qemu with I wanted to compare the logs of the same firmware when ran with cloud-hypervisor and qemu. Although I should ask someone from the qemu project this, does anyone know how to get firmware logs in qemu? |
Would it change anything if I were to test different different permutations the motherboard's ReBAR, Above4GDecoding, etc? |
You can try to |
Describe the bug
GPU HDMI controller does not work within Windows 11 VM:
error code 12: The device cannot find enough free resources that it can use
To Reproduce
Steps to reproduce the behaviour:
Version
Output of
cloud-hypervisor --version
:cloud-hypervisor v37.0.0
Did you build from source, if so build command line (e.g. features): default build from nixpkgs unstable-small
VM configuration
What command line did you run (or JSON config data):
Guest OS version details:
Host OS version details:
Full system config: https://codeberg.org/mateidibu/nix-hv/src/commit/98fb89b7ec7e34ce79b1cd98cabb12dab614a22a
Logs
Output of
cloud-hypervisor -v
from either standard error or via--log-file
:Linux kernel output:
dmesg
dmesg | grep -F '0a:00.1'
grep -C 5 -F '0a:00.1' /proc/iomem
lspci -nnkvv -s '0a:00.1'
The text was updated successfully, but these errors were encountered: