Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Qubes on KVM #12

Open
nrgaway opened this issue Aug 1, 2020 · 23 comments
Open

WIP: Qubes on KVM #12

nrgaway opened this issue Aug 1, 2020 · 23 comments

Comments

@nrgaway
Copy link
Contributor

nrgaway commented Aug 1, 2020

@shawnanastasio

Hello and thanks for the work you have completed so far on libkvmchan and core-qubes-vchan-kvm.

I just wanted to let you know I am currently working on porting all Qubes features and app modules to work on a KVM host. Goals and progress are posted on Google Groups qubes-devel (https://groups.google.com/forum/?oldui=1#!topic/qubes-devel/dIw40asXmEI).

I have packaged your modules for Fedora, Debian and ArchLinux to allow building with in the Qubes builder using your supplied license and crediting you as author. I have not committed and pushed the repos, but plan to this week after having a chance complete testing.

If you have any questions or comments you can post them in the thread I listed above or here. I would also be interested in any future development plans you may have in relation to these modules.

@shawnanastasio
Copy link
Owner

Hi!

Very exciting to hear that someone else will be working on Qubes on KVM. In case you're not aware, there is this effort to do the same, motivated by the desire to run Qubes on POWER9 CPUs which don't have Xen support.

As for the current state of libkvmchan, I would consider it pre-alpha at the moment. The majority of the vchan API is implemented (all that's missing is vchan lifecycle management, i.e. libvchan_is_open), but aside from that there are some security and robustness additions that I would like to implement before the software is ready for deployment like sandboxing, privilege dropping, etc. Not to mention the complete lack of documentation that needs to be resolved.

That said, in its current state, it is enough to support qrexec mostly unmodified (a usleep() needs to be added before the client connects since vchan creation takes longer than it does on Xen but everything else works fine). I haven't been working on it recently but I plan to resume development shortly to resolve the above concerns. After that's done, I will shift my focus to getting qubes-gui-{daemon,agent} running through libkvmchan's ivshmem backend. This will likely require upstream changes to the relevant Qubes components too, but after brief discussions with @marmarek I don't think this will be an issue.

For X86_64, we'd also want to implement architecture-specific VIOMMU support in the project's VFIO driver like I have done for ppc64. It will still work without that, but guests will need to operate in the potentially unsafe VFIO-NOIOMMU mode. See here for more information.

I would like to collaborate on your porting efforts as much as possible, so don't hesitate to reach out with any questions or concerns! I'd also be curious to hear what your experiences have been with libkvmchan thus far. From a skim of the linked Google Groups thread, it seems you have compiled it but not used it yet? Since I have done all development thus far on ppc64le, there may be some kinks that need to be worked out for X86_64 in addition to the VIOMMU thing, but it should be close to working. If you have any questions on how to set it up, let me know.

@nrgaway
Copy link
Contributor Author

nrgaway commented Aug 2, 2020

Thanks for the detailed overview. I'm also happy to hear you're interested in implementing the gui modules :)

You were right, I have not used libkvmchan as of yet. I have only glanced at the code and packaged it as well as created some systemd startup unit files for the kvmchand binary. Although, I am going to start working on it right now since I need (want) to get it working with Qubes as I am at the point where Qubes is attempting to start the qrexec process when starting up the virtual machine (qvm-start).

I just want to confirm that the daemon should be launched as 'kvmchand -d' on the KVM host. (I created a systemd unit file that runs on start-up). Currently the daemon exits when the Qubes qvm-start command is run as it attempts to execute the qubesdb-daemon. I am not really concerned about it crashing as I have not had a chance to debug it further. I only mention this because sometimes when the daemon exits, I am unable to manually restart it, which then requires a reboot. Just wondering if you have experienced this issue, and if so, have any advice, and if not, don't worry about it. FYI, the systemd messages are kvmchand.service: Main process exited, code=exited, status=1/FAILURE and kvmchand.service: Failed with result 'exit-code'. The spawned processes of course disappear as well as the sockets created within the '/tmp/kvmchand' directory. When an attempted restart fails the message I get in '/var/messages' is End of file while reading data: Input/output error.

Something else to think about is considering (sometime in the future) is splitting the host and guest daemons since it currently depends on libvirt which means libvirt needs to be installed into the VM template to satisfy the dependency.

Your quest to run Qubes on the POWER9 CPU sounds very interesting. I can help as I am quite familiar with most of the Qubes components as I worked for them back in 2015 on a one year contract. Once I finish the process of ensuring all existing Qubes modules are working on X86_64 KVM I can attempt to cross-compile for the POWER9 CPU. Won't be much help debugging any issues though since I only have an AMD processor.

@nrgaway
Copy link
Contributor Author

nrgaway commented Aug 2, 2020

I think I figured out what was preventing manually restarting kvmchand. Seems like one of the Qubes processes must have held a reference it. I noticed the VM was started in a pause state. When manually forcing the VM to shutdown using virt-manager, I was able to restart the daemon manually.

@shawnanastasio
Copy link
Owner

Glad to hear you were able to resolve it. One thing to note is that under systemd, you can omit the -d and simply specify the service's Type as simple, which will allow systemd to handle the daemonization of the process and has the added benefit that the stderr log messages the daemon prints out will be accessible via journalctl.

And as for the splitting of host/guest daemons to reduce guest dependencies, this should be possible with a few makefile tweaks (and maybe some preprocessor statements in the entrypoint). I've created issue #13 to track this.

@nrgaway
Copy link
Contributor Author

nrgaway commented Aug 2, 2020

FYI, It seems like kvmchand exits if it loses the libvirt connection. It seems strange but the qubesd python component also seems to lose its connection after each libvirt command accessed, but has a wrapper to automatically re-connect. Will need to check with @marmarek if that is normal behaviour.

Every time qubesd re-established a connection to libvirt, kvmchand exited. For now I just added Restart=always and disabled limits with StartLimitIntervalSec=0 within the systemd unit file.

@nrgaway
Copy link
Contributor Author

nrgaway commented Aug 2, 2020

Glad to hear you were able to resolve it. One thing to note is that under systemd, you can omit the -d and simply specify the service's Type as simple, which will allow systemd to handle the daemonization of the process and has the added benefit that the stderr log messages the daemon prints out will be accessible via journalctl.

Thanks, much better logging :)

@shawnanastasio
Copy link
Owner

FYI, It seems like kvmchand exits if it loses the libvirt connection. It seems strange but the qubesd python component also seems to lose its connection after each libvirt command accessed, but has a wrapper to automatically re-connect. Will need to check with @marmarek if that is normal behaviour.

Yeah, the libvirt part of the code currently doesn't gracefully handle loss of connection to libvirtd, since I originally didn't envision that as a likely scenario. Is it expected for libvirtd to restart frequently in a typical Qubes environment? If so I'll create an issue to track this.

I'm also not sure what exactly the behavior should be when this happens. Should the daemon simply maintain all existing vchans and continuously try reconnectiong, or should it disconnect all existing vchans and essentially restart itself? More information on when exactly the libvirtd connection is expected to drop would be helpful in determining this.

@marmarek
Copy link

marmarek commented Aug 2, 2020

It seems strange but the qubesd python component also seems to lose its connection after each libvirt command accessed

Sounds like libvirtd crashing. Do you see any core dump? (coredumpctl, now everything goes through *ctl...)

Yeah, the libvirt part of the code currently doesn't gracefully handle loss of connection to libvirtd, since I originally didn't envision that as a likely scenario. Is it expected for libvirtd to restart frequently in a typical Qubes environment? If so I'll create an issue to track this.

It shouldn't be frequent, but it (normally) happens on installing updates.

I'm also not sure what exactly the behavior should be when this happens. Should the daemon simply maintain all existing vchans and continuously try reconnectiong, or should it disconnect all existing vchans and essentially restart itself? More information on when exactly the libvirtd connection is expected to drop would be helpful in determining this.

Restarting libvirt should not interrupt existing connections.

On a general thought - if I understand correctly libkvmvchan requires host side to orchestrate every VM-VM connection. We could use this occasion to adjust libvchan API to ensure dom0 really approves all the connections. In Xen case currently two cooperating VMs can establish vchan without dom0 approval, which is not optimal design. This change do mean libvchan API change, but I think the gains are worth it. If this change could also simplify (or even eliminate) kvmchand guest-host communication, that would be additional gain.
I'm looping @pwmarcz in as we have discussed this change not long ago.

@nrgaway
Copy link
Contributor Author

nrgaway commented Aug 3, 2020

...
That said, in its current state, it is enough to support qrexec mostly unmodified (a usleep() needs to be added before the client connects since vchan creation takes longer than it does on Xen but everything else works fine)...

Add usleep() in qrexec?

...
For X86_64, we'd also want to implement architecture-specific VIOMMU support in the project's VFIO driver like I have done for ppc64. It will still work without that, but guests will need to operate in the potentially unsafe VFIO-NOIOMMU mode...

Working on getting kvmchand running in the template VM. VFIO-NOIOMMU mode is not enabled for the kernel that is installed in the template so I will need to build a custom kernel to get it running for testing purposes. I understand that running in this mode will taint the kernel and I think would prevent device assignment since there would be no IOMMU to provide DMA translation. Is this correct?

Is this the proper way to set up the guest libvirt config? Do msi settings also need to be applied?

<shmem name='kvmchand'>
    <model type='ivshmem-doorbell'/>
    <server path='/tmp/kvmchand/ivshmem_socket'/>
    <!--  <msi vectors='32' ioeventfd='on'/>  -->
</shmem>

@shawnanastasio
Copy link
Owner

Add usleep() in qrexec?

Yeah, something like this:

diff --git a/agent/qrexec-agent-data.c b/agent/qrexec-agent-data.c
index 27200c6..67c4b3d 100644
--- a/agent/qrexec-agent-data.c
+++ b/agent/qrexec-agent-data.c
@@ -201,6 +201,7 @@ static int handle_new_process_common(
         abort();
     }
     cmdline[cmdline_len-1] = 0;
+    usleep(2 * 1000 * 1000);
     data_vchan = libvchan_client_init(connect_domain, connect_port);
     if (!data_vchan) {
         LOG(ERROR, "Data vchan connection failed");
-- 
2.27.0

Working on getting kvmchand running in the template VM. VFIO-NOIOMMU mode is not enabled for the kernel that is installed in the template so I will need to build a custom kernel to get it running for testing purposes. I understand that running in this mode will taint the kernel and I think would prevent device assignment since there would be no IOMMU to provide DMA translation. Is this correct?

NOIOMMU mode will not prevent device assignment or hotplugging - the kernel will still assign memory regions to the PCIe device as normal. The only difference is that the ivshmem device's view of memory will not be restricted by an IOMMU. This means a malicious ivshmem device (which means a malicious host QEMU) would be able to write to privileged guest memory. For this use case, that's obviously not an issue since a compromised QEMU would be able to do those things anyways.

Is this the proper way to set up the guest libvirt config? Do msi settings also need to be applied?

For NOIOMMU, no guest libvirt config changes are necessary. kvmchand will automatically attach the required ivshmem devices to all libvirt-managed guests at run-time. If you're curious, the relevant code is here. You'll be able to see the attachment in kvmchand's log, or by running lspci in the guest and checking the output for something like this:

0001:00:00.0 RAM memory: Red Hat, Inc. Inter-VM shared memory (rev 01)

@shawnanastasio
Copy link
Owner

I'm also not sure what exactly the behavior should be when this happens. Should the daemon simply maintain all existing vchans and continuously try reconnectiong, or should it disconnect all existing vchans and essentially restart itself? More information on when exactly the libvirtd connection is expected to drop would be helpful in determining this.

Restarting libvirt should not interrupt existing connections.

Gotcha. Created #14.

On a general thought - if I understand correctly libkvmvchan requires host side to orchestrate every VM-VM connection.

Correct.

We could use this occasion to adjust libvchan API to ensure dom0 really approves all the connections. In Xen case currently two cooperating VMs can establish vchan without dom0 approval, which is not optimal design. This change do mean libvchan API change, but I think the gains are worth it.

This sounds perfectly reasonable to me. We discussed this previously in #1 and what the API for this might look like. Curious to hear yours and @pwmarcz's thoughts.

If this change could also simplify (or even eliminate) kvmchand guest-host communication, that would be additional gain.

I don't think it would have any significant impact on kvmchand, since this is already essentially how its implemented today. The changes would just be adding some additional authentication logic and associated new APIs.

@nrgaway
Copy link
Contributor Author

nrgaway commented Aug 8, 2020

I'm having an issue with getting qrexec-agent to start within guest. When staring qrexec-agent a new chardev and device appears to be created successfully but the vfio.vfio_get_device_fd function reports Unable to obtain device fd for 0000:00:0d.0: No such device!. The device does get created, but then removed after the error. I did notice that when the device is created it gets added to iommu group 1 where iommu group 0 is used when kvmchand starts via systemctl at boot. Any feedback would be greatly appreciated.

I am including the related logs below for the host and i440 guest. The q35 machine does not work as libvirt attempts to attach the device to 'pcie.0' which is invalid as it needs to be either attached to a pci or pcie root port. I can get it to attach to a q35 guest if passing a static value in same manner as the POWER9 code, but we can leave that for another issue.

The logs contains a few extra entries from master, but all line numbers match and I removed noise to keep the size to a minimum.

PASS: Host start of kvmchand via systemd unit file

host:  01:48:05.175000-0000 host  systemd[1]: Started KVM vchan daemon.
host:  01:48:05.175000-0000 host  audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=kvmchand comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'

PASS: Guest start with systemd unit file auto-starting kvmchad on boot

host:  >>> virsh start fc32-I440-UEFI --console
host:  01:50:15.618483-0000 host  kvmchand[178531]: [INFO] daemon/libvirt.c:579: Domain fc32-I440-UEFI(1, UUID: a3d8e779-d4d8-44c4-a58b-dfee60b62b8f) changed state (event: 4, detail: 0)!
host:  01:50:15.618483-0000 host  kvmchand[178531]: [WARN] daemon/libvirt.c:653: Unknown lifecycle event 10! Ignoring...
host:  01:50:15.619540-0000 host  kvmchand[178531]: [INFO] daemon/libvirt.c:579: Domain fc32-I440-UEFI(1, UUID: a3d8e779-d4d8-44c4-a58b-dfee60b62b8f) changed state (event: 2, detail: 0)!
host:  01:50:15.622069-0000 host  kvmchand[178531]: [INFO] daemon/libvirt.c:262: About to attach ivshmem device at /tmp/kvmchand/ivshmem_socket, index 0

# chardev-add charshmem0
host:  01:50:15.622+0000:   host  libvirtd[178369]:[WARNING] qemuDomainObjTaint:7163 : Domain id=1 name='fc32-I440-UEFI' uuid=a3d8e779-d4d8-44c4-a58b-dfee60b62b8f is tainted: custom-monitor
host:  01:50:15.622+0000:   host  libvirtd[178369]: qemuMonitorSend:933 : QEMU_MONITOR_SEND_MSG: mon=0x7f64a0033d30 msg={"execute":"chardev-add","arguments":{"id":"charshmem0","backend":{"type":"socket","data":{"server":false,"addr":{"type":"unix","data":{"path":"/tmp/kvmchand/ivshmem_socket"}}}}},"id":"libvirt-369"}
host:   fd=-1
host:  01:50:15.623+0000:   host  libvirtd[178569]: qemuMonitorJSONIOProcessLine:239 : QEMU_MONITOR_RECV_REPLY: mon=0x7f64a0033d30 reply={"return": {}, "id": "libvirt-369"}

# device_add shmem0
host:  01:50:15.623+0000:   host  libvirtd[178370]: qemuMonitorSend:933 : QEMU_MONITOR_SEND_MSG: mon=0x7f64a0033d30 msg={"execute":"device_add","arguments":{"driver":"ivshmem-doorbell","id":"shmem0","chardev":"charshmem0","vectors":2},"id":"libvirt-370"}
host:   fd=-1
host:  01:50:15.623+0000:   host  libvirtd[178569]: qemuMonitorIOWrite:429 : QEMU_MONITOR_IO_WRITE: mon=0x7f64a0033d30 buf={"execute":"device_add","arguments":{"driver":"ivshmem-doorbell","id":"shmem0","chardev":"charshmem0","vectors":2},"id":"libvirt-370"}
host:   len=136 ret=136 errno=0
host:  01:50:15.624+0000:   host  libvirtd[178569]: qemuMonitorJSONIOProcessLine:239 : QEMU_MONITOR_RECV_REPLY: mon=0x7f64a0033d30 reply={"return": {}, "id": "libvirt-370"}
host:  01:50:15.623137-0000 host  kvmchand[178535]: [INFO] daemon/ivshmem.c:534: Got connection from PID: 178565

guest: 01:50:20.704052+0000 guest kvmchand[731]: [INFO] daemon/vfio.c:1090: Got ivshmem device: 0000:00:0c.0
guest: 01:50:20.704196+0000 guest kernel: vfio-pci 0000:00:0c.0: Adding to iommu group 0
guest: 01:50:20.704282+0000 guest kernel: vfio-pci 0000:00:0c.0: Adding kernel taint for vfio-noiommu group on device
guest: 01:50:20.704344+0000 guest kvmchand[731]: [WARN] daemon/vfio.c:1107: Some ivshmem devices aren't bound to vfio-pci. Attempting to bind...
guest: 01:50:20.706995+0000 guest kvmchand[731]: [WARN] daemon/vfio.c:1122: Successfully bound ivshmem devices.
guest: 01:50:20.706995+0000 guest kvmchand[731]: [INFO] daemon/vfio.c:895: vfio_init.ivshmem_group->name: /dev/vfio/g (USE_VFIO_SPAPR ONLY)
guest: 01:50:20.707106+0000 guest kvmchand[731]: [INFO] daemon/vfio.c:523: vfio_get_device: group_fd=8, device=0000:00:0c.0
guest: 01:50:20.707831+0000 guest kernel: vfio-pci 0000:00:0c.0: vfio-noiommu device opened by user (kvmchand:731)
guest: 01:50:20.707920+0000 guest kvmchand[731]: [INFO] daemon/vfio.c:877: Successfully connected to host daemon.

FAIL: Guest start of qubes-qrexec-agent.service

>>> mkdir -p /var/run/qubes
>>> chmod 2770 /var/run/qubes
>>> chgrp qubes /var/run/qubes
>>> mkdir -p /var/log/qubes
>>> systemctl start qubes-qrexec-agent.service

host:  03:09:41.586071+0000 host  kvmchand[183489]: [INFO] daemon/connections.c:118: vchan_init called! server_dom: 2, client_dom: 0, port 512, read_min: 4096, write_min: 4096
host:  03:09:41.586844+0000 host  kvmchand[183490]: [INFO] daemon/libvirt.c:262: About to attach ivshmem device at /tmp/kvmchand/ivshmem_socket, index 2

# chardev-add charshmem1
host:  03:09:41.587078+0000 host  libvirtd[179846]: QEMU_MONITOR_SEND_MSG: mon=0x7fa8c00323d0 msg={"execute":"chardev-add","arguments":{"id":"charshmem1","backend":{"type":"socket","data":{"server":false,"addr":{"type":"unix","data":{"path":"/tmp/kvmchand/ivshmem_socket"}}}}},"id":"libvirt-371"} fd=-1
host:  03:09:41.588159+0000 host  kvmchand[183494]: [INFO] daemon/ivshmem.c:534: Got connection from PID: 183530
host:  03:09:41.588211+0000 host  libvirtd[179846]: QEMU_MONITOR_RECV_REPLY: mon=0x7fa8c00323d0 reply={"return": {}, "id": "libvirt-371"}

# Start qrexec-agent
guest: 03:09:41.583958+0000 guest systemd[1]: Starting Qubes remote exec agent...
guest: 03:09:41.586192+0000 guest qrexec-agent[1652]: qrexec-agent.c:376:init: qrexec-agent.init: ctrl_vchan call
guest: 03:09:41.586192+0000 guest qrexec-agent[1652]: [INFO] library.c:300: libkvmchan_server_init: Entered function
guest: 03:09:41.586420+0000 guest qrexec-agent[1652]: [INFO] library.c:308: libkvmchan_server_init: Initialize libkvmchan struct...
guest: 03:09:41.586420+0000 guest qrexec-agent[1652]: [INFO] library.c:316: libkvmchan_server_init: Send request to kvmchand...
guest: 03:09:41.586420+0000 guest qrexec-agent[1652]: [INFO] library.c:330: libkvmchan_server_init: Message send...
guest: 03:09:41.586420+0000 guest qrexec-agent[1652]: [INFO] library.c:332: libkvmchan_server_init: Message receive...
guest: 03:09:41.586496+0000 guest kvmchand[735]: [INFO] daemon/localhandler.c:600: Client connected! fd: 8

# device_add shmem1
host:  03:09:41.588697+0000 host  libvirtd[179846]: QEMU_MONITOR_SEND_MSG: mon=0x7fa8c00323d0 msg={"execute":"device_add","arguments":{"driver":"ivshmem-doorbell","id":"shmem1","chardev":"charshmem1","vectors":2},"id":"libvirt-372"} fd=-1
host:  03:09:41.590623+0000 host  libvirtd[179846]: QEMU_MONITOR_RECV_REPLY: mon=0x7fa8c00323d0 reply={"return": {}, "id": "libvirt-372"}

guest: 03:09:41.592567+0000 guest kvmchand[731]: [INFO] daemon/vfio.c:1006: resp from host: err: 0, ret: 2
guest: 03:09:41.592786+0000 guest qrexec-agent[1652]: [INFO] library.c:339: libkvmchan_server_init: Dom0 check...
guest: 03:09:41.592786+0000 guest qrexec-agent[1652]: [INFO] library.c:344: libkvmchan_server_init: Dom0 is False
guest: 03:09:41.592786+0000 guest qrexec-agent[1652]: [INFO] library.c:267: get_conn_fds_deferred: usleep(4 * 1000 * 1000)...
guest: 03:09:41.593660+0000 guest kernel: vfio-pci 0000:00:0c.0: EDR: ACPI event 0x1 received
guest: 03:09:41.593862+0000 guest kernel: pci 0000:00:0d.0: [1af4:1110] type 00 class 0x050000
guest: 03:09:41.593906+0000 guest kernel: pci 0000:00:0d.0: reg 0x10: [mem 0x00000000-0x000000ff]
guest: 03:09:41.593935+0000 guest kernel: pci 0000:00:0d.0: reg 0x14: [mem 0x00000000-0x00000fff]
guest: 03:09:41.593962+0000 guest kernel: pci 0000:00:0d.0: reg 0x18: [mem 0x00000000-0x00003fff 64bit pref]
guest: 03:09:41.595219+0000 guest kernel: pci 0000:00:0d.0: BAR 2: assigned [mem 0x80011c000-0x80011ffff 64bit pref]
guest: 03:09:41.595354+0000 guest kernel: pci 0000:00:0d.0: BAR 1: assigned [mem 0xc804d000-0xc804dfff]
guest: 03:09:41.596670+0000 guest kernel: pci 0000:00:0d.0: BAR 0: assigned [mem 0xc804e000-0xc804e0ff]
guest: 03:09:41.596814+0000 guest kernel: vfio-pci 0000:00:0d.0: Adding to iommu group 1
guest: 03:09:41.596915+0000 guest kernel: vfio-pci 0000:00:0d.0: Adding kernel taint for vfio-noiommu group on device

guest: 03:09:45.592926+0000 guest qrexec-agent[1652]: [INFO] library.c:269: get_conn_fds_deferred: wakeup...
guest: 03:09:45.592926+0000 guest qrexec-agent[1652]: [INFO] library.c:276: get_conn_fds_deferred: localmsg_send...
guest: 03:09:45.592926+0000 guest qrexec-agent[1652]: [INFO] library.c:278: get_conn_fds_deferred: localmsg_recv...
guest: 03:09:45.593587+0000 guest kvmchand[731]: [INFO] daemon/vfio.c:522: vfio_get_device: group_fd=8, device=0000:00:0d.0

# device_del shmem1
host:  03:09:45.593655+0000 host  libvirtd[179846]: QEMU_MONITOR_SEND_MSG: mon=0x7fa8c00323d0 msg={"execute":"device_del","arguments":{"id":"shmem1"},"id":"libvirt-373"} fd=-1

guest: 03:09:45.593744+0000 guest kvmchand[731]: [ERROR] daemon/vfio.c:523: Unable to obtain device fd for 0000:00:0d.0: No such device!
guest: 03:09:45.593744+0000 guest kvmchand[731]: [ERROR] daemon/vfio.c:787: Failed to rebuild vfio connection vec!
guest: 03:09:45.593780+0000 guest qrexec-agent[1652]: [INFO] library.c:280: get_conn_fds_deferred: ret.error...
guest: 03:09:45.593780+0000 guest qrexec-agent[1652]: [INFO] library.c:281: get_conn_fds_deferred: ret.error
guest: 03:09:45.593780+0000 guest qrexec-agent[1652]: [ERROR] library.c:387: libkvmchan_server_init: FAIL_MALLOC_RET
guest: 03:09:45.593780+0000 guest qrexec-agent[1652]: [INFO] library.c:390: libkvmchan_server_init: OUT
guest: 03:09:45.593780+0000 guest qrexec-agent[1652]: qrexec-agent.c:376:init: qrexec-agent.init: ctrl_vchan returned
guest: 03:09:45.593780+0000 guest qrexec-agent[1652]: qrexec-agent.c:332:handle_vchan_error: Error while vchan server_init
, exiting
guest: 03:09:45.593972+0000 guest kvmchand[735]: [INFO] daemon/localhandler.c:608: Client disconnected! fd: 8
guest: 03:09:45.594166+0000 guest systemd[1]: qubes-qrexec-agent.service: Main process exited, code=exited, status=1/FAILURE

host:  03:09:45.594245+0000 host  libvirtd[179846]: Line [{"return": {}, "id": "libvirt-373"}]
host:  03:09:45.594267+0000 host  libvirtd[179846]: QEMU_MONITOR_RECV_REPLY: mon=0x7fa8c00323d0 reply={"return": {}, "id": "libvirt-373"}

guest: 03:09:45.594288+0000 guest systemd[1]: qubes-qrexec-agent.service: Failed with result 'exit-code'.
guest: 03:09:45.594703+0000 guest systemd[1]: Failed to start Qubes remote exec agent.
guest: 03:09:45.596423+0000 guest kvmchand[731]: [INFO] daemon/vfio.c:1006: resp from host: err: 0, ret: 0
guest: 03:09:45.597644+0000 guest kernel: vfio-pci 0000:00:0d.0: EDR: ACPI event 0x3 received
guest: 03:09:45.597780+0000 guest kernel: vfio-pci 0000:00:0d.0: Removing from iommu group 1

host:  03:09:45.657558+0000 host  libvirtd[179846]: QEMU_MONITOR_RECV_EVENT: mon=0x7fa8c00323d0 event={"timestamp": {"seconds": 1596856185, "microseconds": 657381}, "event": "DEVICE_DELETED", "data": {"device": "shmem1", "path": "/machine/peripheral/shmem1"}}

# chardev-del charshmem1
host:  03:09:50.598873+0000 host  libvirtd[179846]: QEMU_MONITOR_SEND_MSG: mon=0x7fa8c00323d0 msg={"execute":"chardev-remove","arguments":{"id":"charshmem1"},"id":"libvirt-374"} fd=-1
host:  03:09:50.599670+0000 host  libvirtd[179846]: QEMU_MONITOR_RECV_REPLY: mon=0x7fa8c00323d0 reply={"return": {}, "id": "libvirt-374"}

@shawnanastasio
Copy link
Owner

Interesting, thank you for the detailed logs! I believe the issue is the following:

I did notice that when the device is created it gets added to iommu group 1 where iommu group 0 is used when kvmchand starts via systemctl at boot.

The current VFIO code assumes that all ivshmem devices will be in the same IOMMU group. On POWER this isn't an issue, since all devices end up in the same group (potentially due to statically assigning the device to the same bridge). I think the next step would be to see if we can get that behavior on x86 by reusing the same static assignment code, though I vaguely recall a limitation with NOIOMMU mode that results in each device getting its own fake IOMMU group so this may not work.

The NOIOMMU code was added really early on in this project's life, before hotplugging of ivshmem devices was implemented, and hasn't been tested by me since the addition of POWER vIOMMU support, so that's why I haven't caught this.

At this point, instead of modifying the VFIO code to tolerate multiple VFIO groups for NOIOMMU mode, I think the effort would be best spent on implementing proper vIOMMU support for x86_64 (#16). I'll spin up an x86_64 box and start work on this. I'll give you an update when it's implemented.

@shawnanastasio
Copy link
Owner

shawnanastasio commented Aug 8, 2020

Already hit a roadblock with vIOMMU support on x86_64. Hilariously, the x86_64 VM PCIe hotplug driver requires the entire PCIe bridge to be shut down for the duration of the hotplug, therefore invalidating all previously held ivshmem device handles. I may look into patching this in the kernel (likely this requirement comes from constraints of real hardware that don't necessarily apply to VMs with virtual PCIe devices), but for the meantime adding support for multiple IOMMU groups to VFIO and sticking with NOIOMMU mode may be the way to go.

EDIT: Upon further investigation, the issues go even deeper. On Q35, all hotplugged devices need their own pre-defined pcie root port (see here), so additional root port allocation code will need to be added to libvirt.c. As you mentioned, i440fx seems to allow hotplugging of devices by default without any manual root port assignment. The downside is that the vIOMMU is unavailable to i440fx guests.

In light of all of this, adding multiple IOMMU group support so that the existing NOIOMMU code can be used with i440fx guests seems like the path of least resistance for now.

@nrgaway
Copy link
Contributor Author

nrgaway commented Aug 9, 2020

I hate roadblocks :) Thanks for looking into this so quickly.

Are you familiar with the pcie_acs_override kernel option or ivshmem2? I have absolutely no idea if either of these could assist in in form as I have not done any research on them. From what I remember the pcie_acs_override allows modification of IOMMU groupings (access control lists) but may have no effect on hotplugging. In regards to ivshmem2 I saw a YouTube video the other day from Jan Kiszka of Siemens (https://www.youtube.com/watch?v=TiZrngLUFMA) who has created some server and guest components which may not apply to this use case but maybe some of the code can be reused. Other than watching the video I have not looked into it further.

In regards to Q35, you can add a pcie-to-pci-bridge and plug into that attaching a pci instead of pcie device.

I don't mind having to rely on NOIOMMU mode in the short term but have concerns relying on that mode since it requires a custom kernel to be built (and maintained) as mainstream kernels such as Fedora do not enable the option. The other concern is how this effects security when also passing through other vfio devices, such as a network controller and GPU, when that NOIOMMU is enabled.

The ultimate goal would be to have a solution that work with the Q35 machine as the I440 is considered legacy and Q35 adds many performance improvements and features. Even the OVMF team has stated many bios features are targeted towards Q35.

Now I am aware that multiple IOMMU groups is an issue, I'll also start researching possible solutions tonight and play around with your code some more. Feel free to offload any testing or additional research to me.

@shawnanastasio
Copy link
Owner

shawnanastasio commented Aug 9, 2020

Implemented support for multiple VFIO groups: 8b605c2. Everything seems to work as expected on an i440fx guest!

Are you familiar with the pcie_acs_override kernel option or ivshmem2? I have absolutely no idea if either of these could assist in in form as I have not done any research on them. From what I remember the pcie_acs_override allows modification of IOMMU groupings (access control lists) but may have no effect on hotplugging. In regards to ivshmem2 I saw a YouTube video the other day from Jan Kiszka of Siemens (https://www.youtube.com/watch?v=TiZrngLUFMA) who has created some server and guest components which may not apply to this use case but maybe some of the code can be reused. Other than watching the video I have not looked into it further.

From what I remember, pcie_acs_override requires an out-of-tree kernel patch. In any case it shouldn't really benefit us here.
As for ivshmem2, this is the first I've heard of it. From watching the video it looks like it introduces a lot of nice changes that could benefit us, particularly with regards to lifecycle management. It seems the patches aren't merged into mainline QEMU yet, but I'll be keeping a close eye on this. Thanks for the heads up.

In regards to Q35, you can add a pcie-to-pci-bridge and plug into that attaching a pci instead of pcie device.

This might be the perfect solution - I don't know why I didn't try that! In theory this should match the i440fx behavior (with the added requirement of specifying the bridge in the libvirt.c hotplug code), right? If that's the case then adding support should be trivial. I'll work on this next.

I don't mind having to rely on NOIOMMU mode in the short term but have concerns relying on that mode since it requires a custom kernel to be built (and maintained) as mainstream kernels such as Fedora do not enable the option. The other concern is how this effects security when also passing through other vfio devices, such as a network controller and GPU, when that NOIOMMU is enabled.

The ultimate goal would be to have a solution that work with the Q35 machine as the I440 is considered legacy and Q35 adds many performance improvements and features. Even the OVMF team has stated many bios features are targeted towards Q35.

Agreed on both accounts - NOIOMMU is a stopgap solution at best. Now that I'm aware of the pcie-to-pci-bridge solution with Q35, though, the path forward should hopefully be easier, especially since the PCI hotplug kernel code seems to behave much closer to what I expect than the PCIe hotplug code.

Feel free to offload any testing or additional research to me.

If you could let me know how the multiple VFIO group commit works for you, that'd be great. I'd also like to look into a proper way for detecting which PCI(e) bridge ivshmem devices should be hotplugged into in the libvirt.c code, as the current solution of hardcoding bridge names per-platform isn't ideal. My current idea is to simply parse the guest's libvirt XML definition and pick out the correct bridge but this is a bit messy and if there was a cleaner way I'd love to know.

@nrgaway
Copy link
Contributor Author

nrgaway commented Aug 9, 2020

Implemented support for multiple VFIO groups: 8b605c2. Everything seems to work as expected on an i440fx guest!

Great, I will start testing it tonight!

In regards to Q35, you can add a pcie-to-pci-bridge and plug into that attaching a pci instead of pcie device.

This might be the perfect solution - I don't know why I didn't try that! In theory this should match the i440fx behavior (with the added requirement of specifying the bridge in the libvirt.c hotplug code), right? If that's the case then adding support should be trivial. I'll work on this next.

I quickly hacked the libvirt.attach_ivshmem_device function to do this as shown below. As the pcie-to-pci biidge is on bus 8, I set pci_bus to 8. I set pci_slot to index+i just to be able to test adding other devices. The devices were added (and deleted due to the different IOMMU group (have not tested with your updated code).

Q35 CONTROLLER CONFIG
        00: pcie-root
            VIDEO               bus=00, slot=01, func=0, id=video0
            FILESYSTEM-SHARE    bus=00, slot=0b, func=0, id=fs0
            SOUND               bus=00, slot=1b, func=0, id=sound0
            SATA                bus=00, slot=1f, func=2
        01: pcie-root-port      bus=00, slot=02, func=0, id=pci.1
            NETWORK             bus=01, slot=00, func=0, id=net0
        02: pcie-root-port      bus=00, slot=02, func=1, id=pci.2
            USB                 bus=02, slot=00, func=0, id=usb
        03: pcie-root-port      bus=00, slot=02, func=2, id=pci.3
            VIRTIO-SERIAL       bus=03, slot=00, func=0, id=virtio-serial0
        04: pcie-root-port      bus=00, slot=02, func=3, id=pci.4
            DISK                bus=04, slot=00, func=0, id=virtio-disk0
        05: pcie-root-port      bus=00, slot=02, func=4, id=pci.5
            MEMBALOON           bus=05, slot=00, func=0, id=baloon0
        06: pcie-root-port      bus=00, slot=02, func=5, id=pci.6
            RANDOM              bus=06, slot=00, func=0, id=rng0
        07: pcie-root-port      bus=00, slot=02, func=6, id=pci.7
        08: pcie-pci-bridge     bus=07, slot=00, func=0, id=pci.8

attach_ivshmem_device HACK:

    const char qmp_new_ivshmem_format[] = "{\"execute\":\"device_add\", \"arguments\": {\"driver\": "
        "\"ivshmem-doorbell\", \"id\":\"shmem%1$"PRIu32"\", \"chardev\":\"charshmem%1$"PRIu32"\", \"vectors\": 2,"
        "\"bus\": \"%2$s\", \"addr\": \"%3$s\"}}";

    // Fill in arguments for chardev and ivshmem commands
    char chardev_buf[sizeof(qmp_new_chardev_format) + 255];
    char ivshmem_buf[sizeof(qmp_new_chardev_format) + 255];
    snprintf(chardev_buf, sizeof(chardev_buf), qmp_new_chardev_format, index, path);

    // TODO: CONVERT TO HEX
    const char pci_bus_format[] = "pci.%1$"PRIu32"";
    //const char pci_slot_format[] = "%1$"PRIu32"";
    const char pci_slot_format[] = "0x%1$"PRIu32"";
    char pci_bus[10 /* pcie.ff.ff */];
    char pci_slot[4 /* 0xff */];
    snprintf(pci_bus, sizeof(pci_bus), pci_bus_format, 8);
    snprintf(pci_slot, sizeof(pci_slot), pci_slot_format, index+1);

    snprintf(ivshmem_buf, sizeof(ivshmem_buf), qmp_new_ivshmem_format, index,
             pci_bus, pci_slot);

If you could let me know how the multiple VFIO group commit works for you, that'd be great. I'd also like to look into a proper way for detecting which PCI(e) bridge ivshmem devices should be hotplugged into in the libvirt.c code, as the current solution of hardcoding bridge names per-platform isn't ideal. My current idea is to simply parse the guest's libvirt XML definition and pick out the correct bridge but this is a bit messy and if there was a cleaner way I'd love to know.

I will test it tonight. I was thinking that one pcie-to-pci controller should be dedicated for all the kvmchand ivshmem devices. This would allow upto 31 devices to be added to the controller which I would think should be enough. Then maybe add an alias within the configuration to be able to identify the controller? With the alias set, maybe you can parse the xml with something like libxml matching by path. The other thing I can look into is to see if there is a libvirt or qemu command that would better represent the data like query-pci

@nrgaway
Copy link
Contributor Author

nrgaway commented Aug 10, 2020

Just a quick update to let you know that the multiple VFIO groups seem to be working nicely and was able to get the host to communicate with a I440 guest using qrexec! I will provide more details over the next few days since I still need to work on the configurations.

@nrgaway
Copy link
Contributor Author

nrgaway commented Aug 21, 2020

Just another update...

Over the last week I have done many tests and worked some more on packaging. The qubes-builder currently builds all host and template packages for Fedora 32. The template boots and communicates via qubes-db. I have not been able to get qvm-run working from host to guest for some reason, but admit I have not spent that much time on it since I wanted the initial packaging and build working for further testing.

I had issues with qrexec-daemon.c when using qvm-start to boot the VM where is was getting messages as shown below. To work around this issue, I added code within qrexec-daemon.c which attempts to reconnect every second. So far it seems to work and connects quickly.

HOST kvmchand[237779]: [WARN] daemon/connections.c:289: Tried to connect to non-existent vchan at dom 0 port 111

I also had issues within the VM where qubes-db and qrexec service would fail to start since kvmchand daemon was not completely started, so I changed the kvmchand unit file to use notify and added some code to libkvmchan to notify systemd once it was initialized (patch to follow).

@nrgaway
Copy link
Contributor Author

nrgaway commented Sep 16, 2020

Another Quick Status Update

Some good news; I just finished implementing the qubes-gui-agent component within the guest VM minus the gui-{daemon,agent}. There are a few problems which I will detail within new issues, but the only VM parts left to complete are the networking and of course the gui-{daemon,agent}. After that a few host issues to resolve and package everything up. Currently all host and VM components build and I have built temporary custom installers for both host and VM. Some manual configuration for the host is still required but will be addressed in the packaging stage. The VM image is more automated as it takes the image built and applies any KVM modifications not yet added to the packaging. The end result is a mostly fully unctional Qubes image (minus GUI and network) that boots and communicates with host.

I was also in the process of preparing resources for you to highlight the code that needs to be changed for the KVM GUI when I came across some previous work you had completed within the qubes-gui-daemon and figured you must already have an understanding of the related Qubes internals. I will post them in a separate issue since the resources still may be useful. If there is anything I can help out with to get this implemented, just let me know what you need.

@jonathancross
Copy link

Hi guys, any updates for us lurkers?
Thanks for the great work!

@shawnanastasio
Copy link
Owner

Hi guys, any updates for us lurkers?
Thanks for the great work!

Hi,

I've recently picked up working on libkvmchan again to fix the outstanding bugs and bring it closer to feature parity with Xen's vchan. I just pushed a fix for #20, which was one of the major outstanding issues.

At this point I'm going to look into Qubes-specific bringup work. It seems @nrgaway has already done a lot in this area, so I'll likely begin by basing it off of their work. My initial target will be ppc64le, but x86_64 shouldn't be much extra work.

@nrgaway, if you could provide an overview of your development environment (host OS and configuration, how you build your qubes VM images, etc.), that would be greatly appreciated.

@brunocek
Copy link

The community is documenting pros and cons in an architectural discussion on the qubes forum here:
https://forum.qubes-os.org/t/porting-qubes-to-hypervisors-other-than-xen-abstracting-the-functionality-early-stage/23478/8

@flflover mentioned this thread there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants