Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KVM: replace -drive by -blockdev and make HMP obsolete #1667

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rbott
Copy link
Member

@rbott rbott commented Apr 6, 2022

This PR replaces the deprecated -drive commandline parameter of QEMU with -blockdev. It also switches hotplugging to use blockdev-add|del via QMP and hence cuts the last ties from Ganeti to QEMU's human monitor (HMP).

Please consider this PR work-in-progress!

This PR also touches several of KVMs parameters:

  • disk_type ide is now named ide-hd
  • disk_cache now only accepts writethrough/writeback (default: writeback)
  • disk_aio now also supports io_uring when QEMU 5.0+ is present (default: threads)
  • disk_discard now only accepts ignore/unmap (default: ignore)

Instead of changing parameters to reflect QEMU (e.g. ide -> ide-hd) we could also try to keep parameters as they are and just translate in the code. What would you prefer? Regarding the disk_cache parameter, there are really only two sensible choices (dicated by the blockdev parameter cache.direct=on|off). I think it makes it easier for the user to not have Ganeti offer four different choices when there really are only two.

What has been tested?

  • The code works fine on Debian Bullseye locally with plain disk instances and Hot-Add/Hot-Removal
  • The code works fine with the QA suite (3-Node Debian Bullseye clusters with various storage modes + Live Migration & Hotplugging)
    • DRBD
    • File
    • SharedFile
    • Plain
    • RBD (including userspace access)
    • GlusterFS (including userspace access) Note: these tests have actually been carried out manually. Not through the QA suite
  • hot-adding disks after cluster upgrades work (tested with 3.0 -> 3.1/HEAD with and without instance restarts)
  • updates to the documentation (e.g. updated parameters)
  • the cdrom code needs to be refactored to use -blockdev
  • the floppy code needs to be refactored to use -blockdev
  • migration code for the disk_discard parameter (no more "default" value)

What is currently missing?

  • ExtStorage disk type is entirely untested (also see this post)

@apoikos @iustin @saschalucas it would be great if you can find some time to review so we can try to improve this PR. Once this is done, we can close #1542 and aim for a 3.1 release :-)

@apoikos apoikos self-requested a review April 6, 2022 19:01
Copy link
Member

@apoikos apoikos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this @rbott!

At a quick glance, it looks good in principle. There are obviously some bits missing, please find my comments below - I admit I wrote them before reading the whole description where you already mention that these are known issues. Still, I view them as important points that I would like to see addressed.

I'll try to find some time in the following days for a more in-depth review.

Just a small request: if there's missing infrastructure that you decide to implement as part of this work (e.g. QMP event handling), please do so in new PRs against master. This leads to easier reviews and more self-contained code.

Cheers!

@@ -2731,7 +2731,7 @@ htHvmValidVifTypes = ConstantUtils.mkSet [htHvmVifIoemu, htHvmVifVif]
-- * Disk types

htDiskIde :: String
htDiskIde = "ide"
htDiskIde = "ide-hd"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This affects serialization to/from the config, and thus will require a migration for config data. Since htDiskIde is Ganeti's disk type, and it doesn't have to be the same as whatever QEMU uses for IDE, I would keep it as-is and just pass ide-hd to QEMU where needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I will change that back!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

htCacheDefault = "default"

htCacheNone :: String
htCacheNone = "none"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, dropping these means that we need to take care of upgrading existing configs and mapping previous values to the new ones (e.g. convert none to writethrough), otherwise deserialization will fail and/or instances will fail to start e.g. because of https://github.com/ganeti/ganeti/pull/1667/files#diff-cd6e55ca6245e02cbd788a3e5a43583e83e2de3816766ce2a98552cec0841df4R203.

We should also check and see if there's any upgrade that needs to be done at the runtime files of already-running instances (I'm not sure it's needed, but haven't checked).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this would require upgrade/downgrade code and of course lots of testing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed (disk_cache reverted to the old list of accepted values)

# TODO: implement receiving of QMP events
# We need to wait for the DEVICE_DELETED event via QMP before proceeding.
# The old implementation using HMP only worked because it is so slow.
time.sleep(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really like to see even some hack for event handling here, rather than time.sleep :-).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, entirely agree. That was more of a workaround to get things in a working state. OTOH it pretty much reflects what happened before (although not as explicit) and implementing events in QMP is probably way out of scope for this PR.

@saschalucas
Copy link
Member

Thanks @rbott for your work. Without going into depth, I would prefer to keep old values for instance parameters disk_type (ide case) and specially disk_cache.

I think it might be not enough to just support/translate writethrough/writeback into cache.direct. Basically writethrough means read-cache-only. Which can't be cache.direct=on??? We also need none for no-cache.

In qemu-system-x86_64(1) there is a nice translation table from -drive to -device (column 2) and -blockdev (columns 3+4) cache options.

             │ cache.writeback   cache.direct   cache.no-flush
────────────┼─────────────────────────────────────────────────
writeback    │ on                off            off
none         │ on                on             off
writethrough │ off               off            off
directsync   │ off               on             off
unsafe       │ on                off            on

The first column matches Ganeti's disk_cache. Maybe we can add unsafe for faster Ganeti-QA runs 😄.

Also I think we can't drop the hassle of valid cache combinations with aio/disk template type. Especially for rbd's feature of coherent distributed caching?

Without much investigation, initially I thought we need to use the table above?

@rbott
Copy link
Member Author

rbott commented Apr 7, 2022

In qemu-system-x86_64(1) there is a nice translation table from -drive to -device (column 2) and -blockdev (columns 3+4) cache options.

             │ cache.writeback   cache.direct   cache.no-flush
────────────┼─────────────────────────────────────────────────
writeback    │ on                off            off
none         │ on                on             off
writethrough │ off               off            off
directsync   │ off               on             off
unsafe       │ on                off            on

The first column matches Ganeti's disk_cache. Maybe we can add unsafe for faster Ganeti-QA runs smile.

Thanks, I actually did not see that table while looking at the documentation. I guess I will add unsafe then and update the PR accordingly. I will let you know once I have changed the code. I also need to figure out why the python tests are hanging indefinitely on GH Actions without any output (they happily pass when I run them in the same container locally).

To safe yourself some time, please wait with further reviews until I have dealt with disk_type and disk_cache :-)

@rbott rbott force-pushed the qemu_blockdev branch 2 times, most recently from 58ca757 to 3efc5ed Compare April 11, 2022 18:50
@rbott
Copy link
Member Author

rbott commented Apr 11, 2022

I have udpated the PR:

  • disk_cache now accepts its original values again
  • disk_type now accepts ide instead of ide-hd again
  • actual cache settings are now derived from the table @saschalucas posted above

@rbott
Copy link
Member Author

rbott commented Apr 11, 2022

I have also opened #1669 because I can not test all Ganeti disk_types (but they do not work on the stable-3.0 or master branch as well). But I don't think we should solve/address that as part of this PR.

@rbott
Copy link
Member Author

rbott commented Apr 12, 2022

I have added one fix: HotAddDevice() now has the dev_type of the disk-to-add available (plain, drbd8, file etc.) to determine if the underlying storage is file- or blockdevice-based. The previous code checked the wrong variable which broke hot-adding file-based disks.

The QA suite passes für Debian Bullseye. It fails on Debian Buster, because it tries to test disk_aio=io_uring (which is expected as QEMU is too old as detected by the parameter verification). I need to figure out a way for the QA to automatically exclude io_uring in unsupported environments. It did however pass one round of instance tests with a disk_aio=threads :-)

@rbott rbott force-pushed the qemu_blockdev branch 5 times, most recently from 184b503 to 9fdcdbb Compare April 19, 2022 19:58
@rbott
Copy link
Member Author

rbott commented Apr 19, 2022

CDROM code transitioned to -blockdev

The latest push also updates the _CdromOption() method to use -blockdev + -device instead of -drive. The code has been tested for:

  • add one or two CD drives with types scsi-cd, ide (actually ide-cd) or paravirtual (actually virtio-blk-pci)
  • use of local ISOs or HTTP(s) URLs as source (the latter only works on Debian if the package qemu-block-extra is installed - but that should be no different from the old code)
  • boot from CD with all three disk types

Differences to old code:

  • setting boot_order=cdrom silently set the cdrom-disk-type to ide - that is not happening any more
  • disk types other than the three named above will raise a HypervisorError (the other types should not work anyways and prevent QEMU from starting)

Limitations of the code (which were already present in the old code):

  • using disk_type paravirtual actually inserts the CD drives ahead of the other virtio disks (e.g. they appear as vda and vdb, followed by the first virtio disk as vdc)
  • no hotplugging (or rather in case of cd drives: ejecting/changing the disk (-iso))
  • the code is not covered by unit tests or the QA

@rbott rbott force-pushed the qemu_blockdev branch 3 times, most recently from 8ade311 to 594c54f Compare April 24, 2022 16:35
@rbott
Copy link
Member Author

rbott commented Apr 24, 2022

floppy code transitioned to -blockdev

The KVM hypervisor code has been extended with a _FloppyOption() method (analogue to _CdromOption()) which extends the KVM commandline with -blockdev/-device parameters to add a virtual floppy drive using a provided image file.

Design Document for -blockdev

I have also added a design document for 3.1, describing the code changes introduced with this PR.

With this, we are now only lacking upgrade/downgrade code + proper testing of cluster upgrades. Could you please take another look, @apoikos @saschalucas?

@saschalucas
Copy link
Member

floppy code transitioned to -blockdev
...
Design Document for -blockdev

With this, we are now only lacking upgrade/downgrade code + proper testing of cluster upgrades. Could you please take another look, @apoikos @saschalucas?

Thanks @rbott for your work. I'll have a look. I'm also going to make some tests, especially with rbd and access=userspace.

Hot-Removing Disks
++++++++++++++++++

Hot-adding a disk consists of two steps:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be Hot-removing, instead of Hot-adding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely. Thanks for spotting this!

Copy link
Contributor

@atta atta Aug 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just tested the ceph-rbd userspace and got an error

cat /var/log/ganeti/kvm/test.vm.log
qemu-system-x86_64: -blockdev driver=raw,node-name=disk-222ec87a-b27d-4756,auto-read-only=off,cache.direct=off,cache.no-flush=off,discard=ignore,file.aio=threads,file.driver=host_device,file.filename=rbd:rbd/4fdf6419-ffd5-4dff-a60f-7a4274a00566.rbd.disk0: Could not open 'rbd:rbd/4fdf6419-ffd5-4dff-a60f-7a4274a00566.rbd.disk0': No such file or directory 

@@ -0,0 +1,7 @@
=================
Ganeti 3.0 design
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ganeti 3.1 here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, copy/paste error. Sorry :-)

@rbott
Copy link
Member Author

rbott commented May 18, 2022

Thanks @rbott for your work. I'll have a look. I'm also going to make some tests, especially with rbd and access=userspace.

Did you already have the time to run some of the checks @saschalucas?

@rbott rbott force-pushed the qemu_blockdev branch 2 times, most recently from 98cd436 to a5d6fad Compare May 31, 2022 19:31
@rbott
Copy link
Member Author

rbott commented May 31, 2022

I have added migration code to take care of the obsolete default value of the disk_discard hvparam. During upgrade, all occurances of default will now be replaced with ignore.

The cluster upgrades well from 3.0.2 to the code in this PR. Of course it does not do anything to running instances with regards to storage devices/disks. Trying to hot-remove a disk from a running instance after the migration will fail, until the instance gets re-created with -blockdev parameters for drives. OTOH, re-creating an instance (stop/start or reboot) works exactly as intended.

I yet have to test the following scenario (need multiple nodes for that):

  • have a multi-node cluster with 3.0.2 running
  • create a DRBD/sharedfile instance
  • upgrade cluster to the code in this PR
  • live-migrate instance (this should re-create the instance on the target with -blockdev parameters without problems)

If anyone reading this has the time to carry out the test themselves, please proceed :-)

@saschalucas
Copy link
Member

Did you already have the time to run some of the checks @saschalucas?

Sorry, very busy ATM. I'll try in the next days.

@rbott rbott mentioned this pull request Aug 21, 2022
@rbott rbott force-pushed the qemu_blockdev branch 2 times, most recently from 60df0e1 to 3d1e964 Compare November 4, 2022 23:20
@rbott
Copy link
Member Author

rbott commented Nov 4, 2022

I have updated this PR and added code to support gluster (both kernel- and userspace). @atta this also adds the groundwork to support other userspace storage backends (e.g. rbd). I still do not have a ceph setup at hand, but if you could provide me with an example rbd URI I could adopt the code to possibly also support rbd userspace access. QEMU seems to expect the following options: pool, image - and a bunch of optional parameters as well.

For gluster, I simply extracted the required information from the gluster URL (gluster://host:port/volume/path) and filled the required qemu fields with that information.

//Edit: nevermind, @atta's comment above actually show a rbd "URL":

rbd:rbd/4fdf6419-ffd5-4dff-a60f-7a4274a00566.rbd.disk0

Let's see what we can make out of this :-)

@rbott rbott force-pushed the qemu_blockdev branch 2 times, most recently from 80a838f to f93e638 Compare April 17, 2023 21:44
@rbott
Copy link
Member Author

rbott commented Apr 17, 2023

Finally I had some time to revisit this PR. I have now added/verified support for blockdev with RBD storage (both userspace/kernelspace access). There is a lot of room for improvement in the code (especially with the additions for gluster & rbd) and I would like to revise it before we move forward with this.

However, recent QA runs have also shown that with Debian Bookworm hotplugging with the legacy QEMU storage implementation (e.g. in stable-3.0) is broken again. This PR should fix this issue. However I suppose its way to big of a change to get included into Debian Bookworm at his point (we are already knee deep into the freeze/release schedule).

I have improved the QA suite setup to also support RBD setups (only kernelspace access for now). This should make it easier to test.

There is only one main problem left: I have not tested ext-storage yet. According to the documentation, "userspace" access is also supported:

After successful attachment the script returns to its stdout a string, which is the full path of the block device to which the volume is mapped. e.g:: /dev/dummy1
[...]
In case the storage technology supports userspace access to volumes as well, e.g. the QEMU Hypervisor can see an RBD volume using its embedded driver for the RBD protocol, then the provider can return extra lines denoting the available userspace access URIs per hypervisor. The URI should be in the following format: :. For example, a RADOS provider should return kvm:rbd:/ in the second line of stdout after the local block device path (e.g. /dev/rbd1).

So, if the access disk parameter is userspace for the ext disk template, then the QEMU command will end up having file= in the -drive option.

This is some sort of an issue now. The current implementation would just pass the userspace URI over to qemu (which will either understand it or not). With the blockdev implementation ganeti actually needs to parse the URI, select the corresponding driver and fill whatever parameters might be required with the information it found in the URI. This somewhat undermines the sense of the ext storage provider. Unless we also change the ext-storage-provider model to require e.g. passing of valid JSON configuration for qemu instead an URI. What would be your take on this @atta @apoikos @saschalucas?

To illustrate this a bit - this is how the RBD userspace access looks like in the JSON syntax for QMP (the commandline syntax is just a flattened-string-representation of the same):

{
   "driver":"raw",
   "node-name":"disk-234d1e63-effa-45b4",
   "discard":"ignore",
   "cache":{
      "direct":true,
      "no-flush":false
   },
   "file":{
      "driver":"rbd",
      "pool":"ganeti",
      "image":"891ef23c-068e-44f2-9636-1bafb5f48c87.rbd.disk4"
   }
}

In this case, pool and image have been extracted from the RBD userspace URL by Ganeti. The same applies to userspace access with GlusterFS. With this in mind, the file section would actually have to be provided by the ext-storage-provider.

@saschalucas
Copy link
Member

thanks @rbott for working on this important topic. I hope to find some time for review soon....

@rbott
Copy link
Member Author

rbott commented Apr 23, 2023

This PR has finally left the draft stage :-)

The latest revision contains the following changes:

  • code cleanup
  • removal of duplicate code logic in vm-startup and hotplugging disks (both parts re-implemented the same logic to find the correct QEMU parameters)
  • QEMU commandline parameters now get auto-generated from the corresponding QMP data structure for block devices
  • a more generic approach to URI parsing for userspace variants of RBD and Gluster
  • unit tests

I still have not tested this against an implementation of the ext-storage-interface. However, in theory this should work when the storage provider returns a block device (which is the default) and also should work when it returns an additional userspace URI, as long as it is either RBD or gluster. Other URIs could also work but need to be implemented in the _ParseStorageUriToBlockdevParam() function first (which kind of defeats the purpose of the ext storage interface).

I'd say this is really good for testing now. Go for it :-)

@rbott rbott force-pushed the qemu_blockdev branch 2 times, most recently from ce590da to 9e45d94 Compare April 25, 2023 07:56
@@ -0,0 +1,175 @@
==========================
Ganeti daemons refactoring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this heading needs changing (copy-pasted from design-daemons.rst ?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks :-)

================

We have to eliminate all uses of the `-drive` parameter from the Ganeti codebase
to ensure compatibility with future versions fo QEMU. With this, we can also
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/ fo / of /

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks :-)

This commit drops support for the deprecated -drive parameter of QEMU
(in favor of -blockdev). It also switches disk hotplugging to use
blockdevadd/blockdev-del via QMP which was the last part of Ganeti which
made use of QEMU's human monitor (HMP).

Along some adjustments are made to KVM parameters:
* disk_aio now also supports io_uring on QEMU 5.0+ (default: threads)
* disk_discard now only accepts ignore/unmap (default: ignore)

Signed-off-by: Rudolph Bott <r@spam.wtf>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stop using QEMU's human monitor
5 participants