Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Team work on getting blktap in recent kernel #325

Open
olivierlambert opened this issue Aug 11, 2020 · 20 comments
Open

Team work on getting blktap in recent kernel #325

olivierlambert opened this issue Aug 11, 2020 · 20 comments

Comments

@olivierlambert
Copy link

Hi there!

We started to think about various changes on blktap, especially to achieve these goals:

  1. IO_uring support for faster IOPS (we already have a PoC)
  2. Faster integration/merge in recent Linux kernels
  3. Ideally, having it upstream Linux if possible at some point

Obviously, doing this together and going into the right direction after some architecture decision might be faster for everyone, deliver better perfs for both Citrix Hypervisor and XCP-ng. What do you think? Where you'd like to start?

@edwintorok
Copy link
Contributor

edwintorok commented Aug 11, 2020

For 2/3: I think the goal is to use NBD instead of the block device, so we do not require a kernel patch anymore.
io_uring would be interesting, but it is not available in the 4.19 kernel that we use.

@olivierlambert
Copy link
Author

I'm not thinking about 4.19 but more recent kernel (cf the title of the issue). Something "future proof". So far, it's really hard to merge current blktap code in 5.x kernels.

Regarding 2 and 3, if you use NBD, what about VHD format? Is it mutually exclusive with it? So in short, you'll use NBD directly between the guest and dom0?

@MarkSymsCtx
Copy link
Contributor

No, we'll use NBD from qemu to tapdisk without going via /dev/tdX (symlinked to /dev/sm/backend/XX) which will be deprecated and removed as the patch in the kernel to provide those devices cannot be forward ported to any 5.x or later kernel. Tapdisk will still write the data to VHD files/volumes.

@MarkSymsCtx
Copy link
Contributor

We know that the current blktap patch will not be accepted into the upstream kernel, it's been asked previously, so the correct way forward is to not use it at all.

@BenSimsCitrix
Copy link
Contributor

BenSimsCitrix commented Aug 11, 2020

The nbd refit is almost complete with only a few bugs to fix in the toolstack.

Regarding the use of io_uring, If was hoping to extend the work done in ba79b73. But if you have a similar plugin architecture that's fine.

We need some thought about howto fully systemd tapdisk.

@olivierlambert
Copy link
Author

olivierlambert commented Aug 11, 2020

We'll be happy to help! Do you have any preliminary perfs results from switching from blktap to NBD? (if I'm correct with the way to make a recap of what you say here).

We can also help on the testing side of things, as long we know what branches to build together to generate RPM packages and start to do the bench work (we have some automated scripts to do nice fio graphs).

About your systemd questions, feel free to share your questions/blockers, more brains could mean faster thinking sometimes!

@Wescoeur
Copy link
Contributor

Hi everyone,
I would like to have more info concerning your message @MarkSymsCtx:

No, we'll use NBD from qemu to tapdisk without going via /dev/tdX (symlinked to /dev/sm/backend/XX) which will be deprecated and removed as the patch in the kernel to provide those devices cannot be forward ported to any 5.x or later kernel. Tapdisk will still write the data to VHD files/volumes.

I'm not sure to understand. Is this diagram up to date? https://wiki.xenproject.org/images/0/06/Blktap%24blktap_diagram_differentSymbols.png

What's the usage of qemu here? If I understand correctly the goal is to use a qemu instance in the host user space to write/read from the VM disk using blkback instead of the blktap patch or another driver? The data is then forwarded to tapdisk using NBD, isn'it? So to use a VM disk, we must have a qemu process + tapdisk to just read/write using NBD. Why not use qemu directly without tapdisk, like qemu-dp in SMAPIv3 but with VHD support?

Thank you in advance for your clarification!

@MarkSymsCtx
Copy link
Contributor

HVM guests have a qemu process which handles the "emulated" devices required at boot. This is the primary thing requiring the /dev/blktap/blktapX devices. Once the PV drivers have loaded in guest all disk IO is performed via the PV ring direct to the tapdisk user space process. The diagram above is incorrect and shows blkfrnt communicating with in-kernel blktap. This doesn't happen, and hasn't been the case since blktap3. As the qemu "device model" process is also capable of taking its boot disk parameter as an NBD url (as is used in the case of the Citrix GFS2 SR as it doesn't provide a Dom0 block device interface) and the userspace tapdisk process also has an NBD server in it we can change the consumers of the block device to use NBD and thus no longer require the unsupportable in-kernel device, at which point the kernel patch can be dropped entirely. This will need to happen before moving to a 5.x kernel.

@Wescoeur
Copy link
Contributor

Oh! I see, it's more clear now! I had trouble understanding the architecture, so you were talking about qemu-dm! 🙂 So NBD or the blktap patch is only used for HVM or PVM during the boot phase. Thank you so much!

@MarkSymsCtx
Copy link
Contributor

Boot or before PV drivers are installed in the case of Window, yes.

@jandryuk
Copy link
Contributor

I ported blktap2 to 5.4:
https://github.com/OpenXT/xenclient-oe/blob/master/recipes-kernel/linux/5.4/patches/blktap2.patch

OpenXT uses tap-ctl to extract kernels and hash disk images during pre-boot.

Having said that, I wish for it to go away. Will tapdisk work today without the blktap2 module?

@olivierlambert
Copy link
Author

@jandryuk we did it too IIRC (@Wescoeur did). But yes, we want to move forward too and make everything work without it. Maybe we should discuss that more in depth somewhere?

@jandryuk
Copy link
Contributor

@olivierlambert do you have a pointer to your blktap2 patch? I had to fix some bugs, so I'm interested to compare.

OpenXT is tightly bound to tap-ctl and blktap2, so it's not something I am looking to tackle any time soon, sorry.

@MarkSymsCtx
Copy link
Contributor

It won't currently work without as all the tap-ctl commands require the device minor id as a parameter in order to identify which disk the command refers to as tapdisk is able to handle multiple VBDs in a single tapdisk process (not that it gets used this way, certainly not in Citrix Hypervisor and I don't think in either xcp-ng or OpenXT either). This will require some careful thought, design and architectural oversight to address likely as a several phase operation, i.e. add a new parameter that can optionally (and preferentially) be used in place of the device minor id, update the control plane code and xenopsd to use this identifier and track down any places that fail or continue to use the "wrong" thing. Then finally remove the old parameter and the kernel patch.

@olivierlambert
Copy link
Author

@Wescoeur do you remember where we stored your draft work on getting recent kernel working?

@MarkSymsCtx the best course will be obviously (for XCP-ng and likely Citrix Hypervisor) to get rid entirely on this legacy stuff. As we said earlier, we'll be happy to spend some time and efforts on this 👍

@MarkSymsCtx
Copy link
Contributor

A large chunk of it is sort of done, basically replacing use of /dev/tdX in Dom0 with NBD to the NBD socket provided by the server in Tapdisk and this completes end to end testing. But the next bits are to update what goes into XenStore as backend configuration keys, hint physical-device is no longer helpful, and make sure that tapback does the right thing with that. Then the designing of the updated commandlines needs to happen to allow the control plane to interact with tap-ctl without referring to the identity of the block device that we want to remove. (Oh, and all of this without breaking Storage XenMotion of course, it being the only current usage of the NBD server).

@Wescoeur
Copy link
Contributor

@olivierlambert @jandryuk We have it in an internal gitlab. It's a quick and dirty poc but I can create a public repo.

@MarkSymsCtx Thank you for the details! 👍

@Wescoeur
Copy link
Contributor

Wescoeur commented May 5, 2021

Hi @MarkSymsCtx,

We are very interested in speeding up the removal of this patch and contributing on this repo.
Could we have information about which projects use the "physical-device" property?
I know xenopsd writes this value, tapdisk reads it, but are there other places where it’s used?

If I don't write the "physical-device" property I can start a VM using a nbd+unix URI with the socket path (I already patched my local smapi to do that), but I don’t understand where the change between emulation and PV driver occurs. What piece of code is in charge of this please? Is there a piece of code to modify in tapdisk or elsewhere to finalize this switch?
Thank you!

@Wescoeur
Copy link
Contributor

@MarkSymsCtx Do you have more info please? 🙂

@MarkSymsCtx
Copy link
Contributor

Not at this time, no I don't

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants