Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvm_qmp_timeout hypervisor parameter #1597

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ksanislo
Copy link

This change makes the kvm hypervisor QMP timeout a settable hypervisor parameter. The default timeout remains the preconfigured 5 seconds, but can now be easily extended for edge cases that create QMP timeouts regularly.

@dannyman
Copy link
Contributor

dannyman commented May 20, 2021

I was just about to bring this to the mailing list. I see messages of this nature when running hbal to migrate DRBD instances across nodes:

Thu May 20 12:23:50 2021 * changing into standalone mode
Thu May 20 12:23:51 2021 * changing disks into single-master mode
Thu May 20 12:23:52 2021 * wait until resync is done
Job 40182 failed: Failure: command execution error:
Could not migrate instance (name): Failed to get migration status: Timeout while receiving a QMP message: timed out

On my ganeti-3.0.1 cluster, I hit this error consistently when I try to hbal (or gnt-instance migrate.) Is this PR addressing my use case? Have you any advice for a workaround pending the release of this feature?

If appropriate, please consider my comment as a vote for this feature/PR.

Thanks,
-danny

@saschalucas
Copy link
Member

Thanks @ksanislo for your contribution. Can you please tell a bit more about your "edge cases that create QMP timeouts regularly"? I remember a conversation with @apoikos, about the QMP timeout, where Apollon seems to see no reason, why QMP is unresponsive for longer then 5s.

@dannyman: Sadly I'm in real world workload still on 2.16, but I've backported commit c66db86 from 3.0 and observed a similar error during migration. Increasing QMP timeout helped a bit, but in the end even 30s were not enough for some instances. I suspected here a real bug, something that's blocking the QMP socket. I.e. parts in the code could use QMP directly without the QMP-Decorator and potentially fail to close the connection (like in 3.0beta1 setting the spice password). Any help on this is highly appreciated.

@ksanislo
Copy link
Author

@saschalucas In my environment, I'm getting seemingly random timeouts at 5 seconds with QMP on migrations, but not when collecting info for gnt-instance list. In my case, my masters are in Seattle, and my failures are almost exclusively on node groups which live in remote data centers, so I can't rule out my problem as being latency specific from the network.

With this change, I'm no longer seeing any jobs with QMP timeouts, despite the nearly constant 24x7 rebalancing operations of an automated tool on the cluster.

@saschalucas
Copy link
Member

Thanks @ksanislo, it seems the same problem as @dannyman's, which is probably the same as I observed in the past by my backport attempt. AFAIK @ksanislo's $far_away_node_group has nothing to do with QMP timeout. The message flow is: master to node via noded-RPC and then local connection to qemu-socket. The QMP timeout only belongs to the local unix socket communication.

Currently I'm unable to reproduce the problem in a lab environment on Debian Buster. What $Distro do you use?

Regarding my speculation on codes paths using QMP directly, there seems only one left:

try:
qmp = QmpConnection(self._InstanceQmpMonitor(instance_name))
qmp.connect()
vcpus = len(qmp.Execute("query-cpus"))
# Will fail if ballooning is not enabled, but we can then just resort to
# the value above.
mem_bytes = qmp.Execute("query-balloon")[qmp.ACTUAL_KEY]
memory = mem_bytes // 1048576
except errors.HypervisorError:
pass

It's actively used before the migration is started. The brave may comment the whole try/expect block on the node, which is the source of migration (don't forget to restart ganeti service after editing the code).

@ksanislo
Copy link
Author

@saschalucas I'm running Debian buster on all nodes.

I've removed (commented out) the direct QMP call you specified, and set my timeout back to the default 5 seconds to match with it, and the timeout errors have returned in full force. I don't think that section of the code is directly related to this bug.

Anything else you can think of that might be a worthwhile test I can run here?

@dannyman
Copy link
Contributor

dannyman commented May 21, 2021

We have this problem at a site where we are running ganeti-3.0.1 on Ubuntu 20.04. Any attempt to migrate a DRBD VM results in a "Timeout while receiving a QMP message" and leaves the instance "up" from ganeti's point of view, but inaccessible. To get the instance fixed I have to SSH to the host node and kill the instance's PID, at which point I can tell ganeti to start it up again.

This would be a HUGE problem if this weren't a DR site. I'm happy to put some investigation time in to get this fixed.

It sounds like this bug may be related to c66db86 because I am seeing it in 3.0.1 and @saschalucas sees it in his older cluster when he backports the code?

Update: digging around, it looks like Failed to migrate instance: Timeout while receiving a QMP message: timed out is raised from

raise errors.HypervisorError("Timeout while receiving a QMP message: "
invoked by
def MigrateInstance(cluster_name, instance, target, live):

Is this the right context for this information or would it be helpful to file a bug report? (Or the mailing list, which I was planning to hit when I saw this come up.)

@ksanislo
Copy link
Author

@dannyman You're definitely looking at the same flaw I am... bumping the timeout from 5 to 15 seconds seem to be a viable workaround for my environment, but it sounds like there's a deeper issue going on somewhere and just raising the timeout value isn't the best fix as the root cause remains.

A possible workaround you can test yourself would be to edit the file /usr/share/ganeti/3.0/ganeti/hypervisor/hv_kvm/monitor.py in place on your machines, and change line 131 from:
_SOCKET_TIMEOUT = 5
to
_SOCKET_TIMEOUT = 15

And then restart your ganeti service on each of the changed nodes.

I would edit this on the master, then gnt-cluster copyfile to deploy it, followed by gnt-cluster command to issue the service restart command.

Effectively, this is all I'm doing after my custom patch from above is applied and this setting becomes a gnt-cluster -H kvm: parameter. So far, increasing this timeout alone has fully resolved all QMP timeouts in my cluster over about 72 hours of testing, and hundreds of migration moves. Oddly, saschalucas reports that even 30s isn't enough for him, so you may need to try bigger values for the moment until we find a better fix for the root cause.

@saschalucas
Copy link
Member

@ksanislo: Thanks for testing my assumption, seems the wrong direction.

Also it seems, that there are two different behaviors: @ksanislo is "getting seemingly random timeouts at 5 seconds", which I interpret as "happens sometimes" and at different migrations progresses etc (that's what I observed). @dannyman says "Any attempt to migrate a DRBD VM results in a timeout", which means no migration is working at all, it even will not start.

I've setup a Ubuntu Focal lab and live migration is working as expected. So ATM I'm unable to reproduce. In @dannyman's case it would be interesting to know:

  • if the leftover qemu process / instance is in paused state: echo "info status" | socat STDIO UNIX-CONNECT:/var/run/ganeti/kvm-hypervisor/ctrl/<instance>.monitor, if yes sent a cont via the HMP monitor socket
  • if there are hints in the KVM log on source/destination (/var/log/ganeti/kvm/<instance>.log)
  • a trace of the QMP communication on the source node. Run before migration:
mv /var/run/ganeti/kvm-hypervisor/ctrl/<instance>.qmp /var/run/ganeti/kvm-hypervisor/ctrl/<instance>.qmp.orig
socat -t100 -v UNIX-LISTEN:/var/run/ganeti/kvm-hypervisor/ctrl/<instance>.qmp,mode=777,reuseaddr,fork UNIX-CONNECT:/var/run/ganeti/kvm-hypervisor/ctrl/<instance>.qmp.orig

If @ksanislo is also able to get a QMP trace, that would be helpful, too.

@ksanislo
Copy link
Author

ksanislo commented May 21, 2021

@saschalucas I couldn't get the problem to occur at the default 5 second timeout on a VM I was watching, but if I switch to a 3 second timeout, it happens consistently for me...

This seems to be the relevant part of the trace with a 3 second time out so it breaks at the delay. Let me know if you'd like more of it, or something else changed and tested. I'm definitely seeing a pause of 3-5 seconds in this same area even when it works and the timeout is set longer too.

`
{"execute": "query-commands"}
< 2021/05/21 15:26:55.369424 length=5408 from=148 to=5555
{"return": [{"name": "netdev_add"}, {"name": "device_add"}, {"name": "query-qmp-schema"}, {"name": "set-numa-node"}, {"name": "query-sev-capabilities"}, {"name": "query-sev-launch-measure"}, {"name": "query-sev"}, {"name": "query-vm-generation-id"}, {"name": "query-hotpluggable-cpus"}, {"name": "xen-load-devices-state"}, {"name": "rtc-reset-reinjection"}, {"name": "query-acpi-ospm-status"}, {"name": "query-memory-devices"}, {"name": "query-memdev"}, {"name": "query-command-line-options"}, {"name": "query-target"}, {"name": "query-fdsets"}, {"name": "remove-fd"}, {"name": "add-fd"}, {"name": "query-cpu-model-expansion"}, {"name": "query-cpu-definitions"}, {"name": "query-memory-size-summary"}, {"name": "query-machines"}, {"name": "closefd"}, {"name": "getfd"}, {"name": "object-del"}, {"name": "object-add"}, {"name": "query-dump-guest-memory-capability"}, {"name": "query-dump"}, {"name": "dump-guest-memory"}, {"name": "device_del"}, {"name": "xen-set-global-dirty-log"}, {"name": "qom-list-properties"}, {"name": "device-list-properties"}, {"name": "qom-list-types"}, {"name": "change"}, {"name": "qom-set"}, {"name": "qom-get"}, {"name": "qom-list"}, {"name": "human-monitor-command"}, {"name": "balloon"}, {"name": "inject-nmi"}, {"name": "system_wakeup"}, {"name": "x-exit-preconfig"}, {"name": "cont"}, {"name": "pmemsave"}, {"name": "memsave"}, {"name": "cpu-add"}, {"name": "system_powerdown"}, {"name": "system_reset"}, {"name": "stop"}, {"name": "quit"}, {"name": "query-pci"}, {"name": "query-balloon"}, {"name": "query-iothreads"}, {"name": "query-cpus-fast"}, {"name": "query-cpus"}, {"name": "query-events"}, {"name": "query-uuid"}, {"name": "query-kvm"}, {"name": "query-name"}, {"name": "add_client"}, {"name": "query-commands"}, {"name": "query-version"}, {"name": "qmp_capabilities"}, {"name": "trace-event-set-state"}, {"name": "trace-event-get-state"}, {"name": "transaction"}, {"name": "migrate-pause"}, {"name": "migrate-recover"}, {"name": "query-colo-status"}, {"name": "xen-colo-do-checkpoint"}, {"name": "query-xen-replication-status"}, {"name": "xen-set-replication"}, {"name": "xen-save-devices-state"}, {"name": "migrate-incoming"}, {"name": "migrate"}, {"name": "query-migrate-cache-size"}, {"name": "migrate-set-cache-size"}, {"name": "migrate_set_speed"}, {"name": "migrate_set_downtime"}, {"name": "migrate-continue"}, {"name": "migrate_cancel"}, {"name": "x-colo-lost-heartbeat"}, {"name": "migrate-start-postcopy"}, {"name": "client_migrate_info"}, {"name": "query-migrate-parameters"}, {"name": "migrate-set-parameters"}, {"name": "query-migrate-capabilities"}, {"name": "migrate-set-capabilities"}, {"name": "query-migrate"}, {"name": "query-display-options"}, {"name": "input-send-event"}, {"name": "send-key"}, {"name": "query-mice"}, {"name": "change-vnc-password"}, {"name": "query-vnc-servers"}, {"name": "query-vnc"}, {"name": "query-spice"}, {"name": "screendump"}, {"name": "expire_password"}, {"name": "set_password"}, {"name": "query-tpm"}, {"name": "query-tpm-types"}, {"name": "query-tpm-models"}, {"name": "query-rocker-of-dpa-groups"}, {"name": "query-rocker-of-dpa-flows"}, {"name": "query-rocker-ports"}, {"name": "query-rocker"}, {"name": "query-rx-filter"}, {"name": "netdev_del"}, {"name": "set_link"}, {"name": "chardev-send-break"}, {"name": "chardev-remove"}, {"name": "chardev-change"}, {"name": "chardev-add"}, {"name": "ringbuf-read"}, {"name": "ringbuf-write"}, {"name": "query-chardev-backends"}, {"name": "query-chardev"}, {"name": "nbd-server-stop"}, {"name": "x-nbd-server-add-bitmap"}, {"name": "nbd-server-remove"}, {"name": "nbd-server-add"}, {"name": "nbd-server-start"}, {"name": "eject"}, {"name": "blockdev-snapshot-delete-internal-sync"}, {"name": "blockdev-snapshot-internal-sync"}, {"name": "query-pr-managers"}, {"name": "x-blockdev-set-iothread"}, {"name": "x-blockdev-change"}, {"name": "block-set-write-threshold"}, {"name": "blockdev-change-medium"}, {"name": "blockdev-insert-medium"}, {"name": "blockdev-remove-medium"}, {"name": "blockdev-close-tray"}, {"name": "blockdev-open-tray"}, {"name": "blockdev-create"}, {"name": "blockdev-del"}, {"name": "blockdev-add"}, {"name": "block-job-finalize"}, {"name": "block-job-dismiss"}, {"name": "block-job-complete"}, {"name": "block-job-resume"}, {"name": "block-job-pause"}, {"name": "block-job-cancel"}, {"name": "block-job-set-speed"}, {"name": "block-stream"}, {"name": "block_set_io_throttle"}, {"name": "blockdev-mirror"}, {"name": "x-debug-block-dirty-bitmap-sha256"}, {"name": "x-block-dirty-bitmap-merge"}, {"name": "x-block-dirty-bitmap-disable"}, {"name": "x-block-dirty-bitmap-enable"}, {"name": "block-dirty-bitmap-clear"}, {"name": "block-dirty-bitmap-remove"}, {"name": "block-dirty-bitmap-add"}, {"name": "drive-mirror"}, {"name": "query-named-block-nodes"}, {"name": "blockdev-backup"}, {"name": "drive-backup"}, {"name": "block-commit"}, {"name": "change-backing-file"}, {"name": "blockdev-snapshot"}, {"name": "blockdev-snapshot-sync"}, {"name": "block_resize"}, {"name": "block_passwd"}, {"name": "query-block-jobs"}, {"name": "query-blockstats"}, {"name": "query-block"}, {"name": "x-block-latency-histogram-set"}, {"name": "query-jobs"}, {"name": "job-finalize"}, {"name": "job-dismiss"}, {"name": "job-complete"}, {"name": "job-cancel"}, {"name": "job-resume"}, {"name": "job-pause"}, {"name": "watchdog-set-action"}, {"name": "query-status"}]}\r

2021/05/21 15:26:55.383750 length=29 from=62 to=90
{"execute": "query-migrate"}
< 2021/05/21 15:26:55.384397 length=507 from=5556 to=6062
{"return": {"expected-downtime": 5000, "xbzrle-cache": {"bytes": 0, "cache-size": 67108864, "cache-miss-rate": 0, "pages": 0, "overflow": 0, "cache-miss": 0}, "status": "active", "setup-time": 92, "total-time": 392, "ram": {"total": 6460088320, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 0, "page-size": 4096, "remaining": 6143713280, "mbps": 7069.504154, "transferred": 274974211, "duplicate": 10260, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 274345984, "normal": 66979}}}\r
< 2021/05/21 15:27:00.348780 length=132 from=0 to=131
{"QMP": {"version": {"qemu": {"micro": 0, "minor": 1, "major": 3}, "package": "Debian 1:3.1+dfsg-8+deb10u8"}, "capabilities": []}}\r
2021/05/21 15:27:00 socat[3517994] E write(6, 0x55fc41b9ea30, 132): Broken pipe
< 2021/05/21 15:27:00.652632 length=132 from=0 to=131
{"QMP": {"version": {"qemu": {"micro": 0, "minor": 1, "major": 3}, "package": "Debian 1:3.1+dfsg-8+deb10u8"}, "capabilities": []}}\r
2021/05/21 15:27:00.653360 length=32 from=0 to=31
{"execute": "qmp_capabilities"}
< 2021/05/21 15:27:00.655479 length=16 from=132 to=147
{"return": {}}\r
2021/05/21 15:27:00.655732 length=30 from=32 to=61
{"execute": "query-commands"}
< 2021/05/21 15:27:00.657869 length=5408 from=148 to=5555
{"return": [{"name": "netdev_add"}, {"name": "device_add"}, {"name": "query-qmp-schema"}, {"name": "set-numa-node"}, {"name": "query-sev-capabilities"}, {"name": "query-sev-launch-measure"}, {"name": "query-sev"}, {"name": "query-vm-generation-id"}, {"name": "query-hotpluggable-cpus"}, {"name": "xen-load-devices-state"}, {"name": "rtc-reset-reinjection"}, {"name": "query-acpi-ospm-status"}, {"name": "query-memory-devices"}, {"name": "query-memdev"}, {"name": "query-command-line-options"}, {"name": "query-target"}, {"name": "query-fdsets"}, {"name": "remove-fd"}, {"name": "add-fd"}, {"name": "query-cpu-model-expansion"}, {"name": "query-cpu-definitions"}, {"name": "query-memory-size-summary"}, {"name": "query-machines"}, {"name": "closefd"}, {"name": "getfd"}, {"name": "object-del"}, {"name": "object-add"}, {"name": "query-dump-guest-memory-capability"}, {"name": "query-dump"}, {"name": "dump-guest-memory"}, {"name": "device_del"}, {"name": "xen-set-global-dirty-log"}, {"name": "qom-list-properties"}, {"name": "device-list-properties"}, {"name": "qom-list-types"}, {"name": "change"}, {"name": "qom-set"}, {"name": "qom-get"}, {"name": "qom-list"}, {"name": "human-monitor-command"}, {"name": "balloon"}, {"name": "inject-nmi"}, {"name": "system_wakeup"}, {"name": "x-exit-preconfig"}, {"name": "cont"}, {"name": "pmemsave"}, {"name": "memsave"}, {"name": "cpu-add"}, {"name": "system_powerdown"}, {"name": "system_reset"}, {"name": "stop"}, {"name": "quit"}, {"name": "query-pci"}, {"name": "query-balloon"}, {"name": "query-iothreads"}, {"name": "query-cpus-fast"}, {"name": "query-cpus"}, {"name": "query-events"}, {"name": "query-uuid"}, {"name": "query-kvm"}, {"name": "query-name"}, {"name": "add_client"}, {"name": "query-commands"}, {"name": "query-version"}, {"name": "qmp_capabilities"}, {"name": "trace-event-set-state"}, {"name": "trace-event-get-state"}, {"name": "transaction"}, {"name": "migrate-pause"}, {"name": "migrate-recover"}, {"name": "query-colo-status"}, {"name": "xen-colo-do-checkpoint"}, {"name": "query-xen-replication-status"}, {"name": "xen-set-replication"}, {"name": "xen-save-devices-state"}, {"name": "migrate-incoming"}, {"name": "migrate"}, {"name": "query-migrate-cache-size"}, {"name": "migrate-set-cache-size"}, {"name": "migrate_set_speed"}, {"name": "migrate_set_downtime"}, {"name": "migrate-continue"}, {"name": "migrate_cancel"}, {"name": "x-colo-lost-heartbeat"}, {"name": "migrate-start-postcopy"}, {"name": "client_migrate_info"}, {"name": "query-migrate-parameters"}, {"name": "migrate-set-parameters"}, {"name": "query-migrate-capabilities"}, {"name": "migrate-set-capabilities"}, {"name": "query-migrate"}, {"name": "query-display-options"}, {"name": "input-send-event"}, {"name": "send-key"}, {"name": "query-mice"}, {"name": "change-vnc-password"}, {"name": "query-vnc-servers"}, {"name": "query-vnc"}, {"name": "query-spice"}, {"name": "screendump"}, {"name": "expire_password"}, {"name": "set_password"}, {"name": "query-tpm"}, {"name": "query-tpm-types"}, {"name": "query-tpm-models"}, {"name": "query-rocker-of-dpa-groups"}, {"name": "query-rocker-of-dpa-flows"}, {"name": "query-rocker-ports"}, {"name": "query-rocker"}, {"name": "query-rx-filter"}, {"name": "netdev_del"}, {"name": "set_link"}, {"name": "chardev-send-break"}, {"name": "chardev-remove"}, {"name": "chardev-change"}, {"name": "chardev-add"}, {"name": "ringbuf-read"}, {"name": "ringbuf-write"}, {"name": "query-chardev-backends"}, {"name": "query-chardev"}, {"name": "nbd-server-stop"}, {"name": "x-nbd-server-add-bitmap"}, {"name": "nbd-server-remove"}, {"name": "nbd-server-add"}, {"name": "nbd-server-start"}, {"name": "eject"}, {"name": "blockdev-snapshot-delete-internal-sync"}, {"name": "blockdev-snapshot-internal-sync"}, {"name": "query-pr-managers"}, {"name": "x-blockdev-set-iothread"}, {"name": "x-blockdev-change"}, {"name": "block-set-write-threshold"}, {"name": "blockdev-change-medium"}, {"name": "blockdev-insert-medium"}, {"name": "blockdev-remove-medium"}, {"name": "blockdev-close-tray"}, {"name": "blockdev-open-tray"}, {"name": "blockdev-create"}, {"name": "blockdev-del"}, {"name": "blockdev-add"}, {"name": "block-job-finalize"}, {"name": "block-job-dismiss"}, {"name": "block-job-complete"}, {"name": "block-job-resume"}, {"name": "block-job-pause"}, {"name": "block-job-cancel"}, {"name": "block-job-set-speed"}, {"name": "block-stream"}, {"name": "block_set_io_throttle"}, {"name": "blockdev-mirror"}, {"name": "x-debug-block-dirty-bitmap-sha256"}, {"name": "x-block-dirty-bitmap-merge"}, {"name": "x-block-dirty-bitmap-disable"}, {"name": "x-block-dirty-bitmap-enable"}, {"name": "block-dirty-bitmap-clear"}, {"name": "block-dirty-bitmap-remove"}, {"name": "block-dirty-bitmap-add"}, {"name": "drive-mirror"}, {"name": "query-named-block-nodes"}, {"name": "blockdev-backup"}, {"name": "drive-backup"}, {"name": "block-commit"}, {"name": "change-backing-file"}, {"name": "blockdev-snapshot"}, {"name": "blockdev-snapshot-sync"}, {"name": "block_resize"}, {"name": "block_passwd"}, {"name": "query-block-jobs"}, {"name": "query-blockstats"}, {"name": "query-block"}, {"name": "x-block-latency-histogram-set"}, {"name": "query-jobs"}, {"name": "job-finalize"}, {"name": "job-dismiss"}, {"name": "job-complete"}, {"name": "job-cancel"}, {"name": "job-resume"}, {"name": "job-pause"}, {"name": "watchdog-set-action"}, {"name": "query-status"}]}\r
`

@ksanislo
Copy link
Author

@saschalucas Just for completeness, here's what it looks like when the timeout is set higher than that pause and it keeps going:

`

2021/05/21 15:18:10.186116 length=29 from=62 to=90
{"execute": "query-migrate"}
< 2021/05/21 15:18:10.186856 length=492 from=5556 to=6047
{"return": {"expected-downtime": 5000, "xbzrle-cache": {"bytes": 0, "cache-size": 67108864, "cache-miss-rate": 0, "pages": 0, "overflow": 0, "cache-miss": 0}, "status": "active", "setup-time": 34, "total-time": 78, "ram": {"total": 4312604672, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 0, "page-size": 4096, "remaining": 4279721984, "mbps": 0, "transferred": 19200003, "duplicate": 3356, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 19132416, "normal": 4671}}}\r
< 2021/05/21 15:18:13.790988 length=132 from=0 to=131
{"QMP": {"version": {"qemu": {"micro": 0, "minor": 1, "major": 3}, "package": "Debian 1:3.1+dfsg-8+deb10u8"}, "capabilities": []}}\r
2021/05/21 15:18:13.791610 length=32 from=0 to=31
{"execute": "qmp_capabilities"}
< 2021/05/21 15:18:13.792152 length=16 from=132 to=147
{"return": {}}\r
2021/05/21 15:18:13.792393 length=30 from=32 to=61
{"execute": "query-commands"}
`

@saschalucas
Copy link
Member

@saschalucas Just for completeness, here's what it looks like when the timeout is set higher than that pause and it keeps going:

> 2021/05/21 15:18:10.186116  length=29 from=62 to=90
{"execute": "query-migrate"}
< 2021/05/21 15:18:10.186856  length=492 from=5556 to=6047
{"return": {"expected-downtime": 5000, "xbzrle-cache": {"bytes": 0, "cache-size": 67108864, "cache-miss-rate": 0, "pages": 0, "overflow": 0, "cache-miss": 0}, "status": "active", "setup-time": 34, "total-time": 78, "ram": {"total": 4312604672, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 0, "page-size": 4096, "remaining": 4279721984, "mbps": 0, "transferred": 19200003, "duplicate": 3356, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 19132416, "normal": 4671}}}\r
< 2021/05/21 15:18:13.790988  length=132 from=0 to=131
{"QMP": {"version": {"qemu": {"micro": 0, "minor": 1, "major": 3}, "package": "Debian 1:3.1+dfsg-8+deb10u8"}, "capabilities": []}}\r

Thanks @ksanislo, I can see the delay being ~3.7s here between the last response and the next QMP greeting, while Ganeti tries to pull every second for the migration status. I can also see some of your migration parameters:

  • migration_downtime = 5s
  • migration_caps = xbzrle

Right? For completeness you may attache the complete trace as a text file (maybe compressed) here. There I can see the initialization of the migration, too.

I would speculate, that the QMP unresponsiveness is triggered, when migration_downtime is near or greater than QMP timeout and maybe in combination with xbzrle capability? Would be great if you can do some further tests.

You may have your reason for setting downtime to 5s. I observed that most workloads are able to migrate with 5Gb/s bandwidth (625MB/s) and 1s downtime (1000ms). BTW Ganeti 3.0 has support for postcopy-ram migration_caps, for faster and guaranteed migration convergence.

Some migration_caps can be tuned or need special orchestration. xbzrle seem to work with defaults? correct postcopy-ram orchestration was implemented in 3.0. Therefore the simple list of migration_caps is not an optimal implementation.

@rbott
Copy link
Member

rbott commented May 23, 2021

I tried to do some documentation research but I could not find anything with regards to expected timings / timeouts for QMP responses in the QEMU docs. I am not exactly sure where the five seconds in the Ganeti code have been derived from.

I have also asked on the #qemu channel on OFTC IRC and will update this post as soon as I find any useful information on that.

OTOH I do not see any obvious connection between e.g. a 5s migration_downtime and a slow QMP response. The migrate operation is asynchronous anyways. I also checked the QEMU bugtracker to see if there has been anything reported (or even fixed) in that regard but could not find anything either.

Update:

I got a quick reply on the #qemu channel. In summary, other implementations like libvirt do not check for a timeout at all and QEMU itself also does not track timeouts of commands. However, if a command hangs for more than 5 seconds that could be considered a QEMU bug and they ask kindly to file a bug with a stack dump of the qemu process while it appears to hang.

For completeness, here is the entire communication:

<rbott> Hey, are there any specific constraints / expectations to answer timing on the QMP interface? e.g. is it possible to determine when a command has timed out / failed to create a response?
<rbott> e.g. Ganeti currently assumes a timeout of 5 seconds but I am not exactly sure where that comes from. I could not find anything related to timings/timeouts in the available QMP documentation
<stefanha> rbott: QMP itself doesn't have a generic timeout. Commands that offer a timeout (there are very few, if any) would have explicit parameters to control it.
<stefanha> rbott: There are a few concepts for handling long-running or urgent QMP commands like "jobs" and "out-of-band" commands.
<stefanha> rbott: Is Ganeti's 5 second timeout related to a specific QMP command?
<rbott> stefanha problems with the timeout were discovered in connection with the "migrate" command but we have not yet figured out why or under what circumstances it might take longer than 5 seconds
<rbott> stefanha: during the investigation of that problem I just tried to understand where the general assumed timeout of 5 seconds might come from
<rbott> (and if it is a good assumption or actually not good at all)
<stefanha> rbott: QEMU itself won't time out AFAIK.
<stefanha> rbott: Was the problem that the 'migrate' command hung and no QMP reply was received by Ganeti?
<rbott> stefanha: the problem is that the QMP reply did not arrive within 5 seconds
<stefanha> rbott: Okay. Something is wrong inside QEMU. If you can collect a stack trace of all threads in the QEMU process during the hang then it may be possible to identify the bug and fix it.
<rbott> stefanha: so should we assume that it is generally OK to wait at most 5 seconds for a QMP reply?
<rbott> (I just want to make sure there are no general flaws in the way Ganeti implements QMP communication)
<rbott> stefanha: people affected by this issue simply raised the timeout to 15 seconds as a workaround and that did solve it for them
<stefanha> rbott: That's always the question with a timeout. It can prevent Ganeti getting stuck when QEMU is unresponsive, but we don't know if QEMU will hang forever or if the operation is legitimately slow.
<stefanha> rbott: I took a quick look at libvirt to see if it has a timeout from QMP commands, and it appears not (though I could be wrong).
<stefanha> rbott: It does react if the monitor connection is closed though (e.g. QEMU crashed).
<stefanha> rbott: If Ganeti wants a timeout that's okay, but QMP itself doesn't offer any timeout functionality or a mechanism for recovering from a hung monitor.
<rbott> stefanha: thanks, that already helps alot. I'll get back to the folks who reported/observed the issues originally and see if they can open a QEMU bug with all required information
<stefanha> rbott: Okay, great!

@dannyman
Copy link
Contributor

dannyman commented May 26, 2021

I wanted to test the hot patch from @ksanislo, above. I was having a hard time reproducing the error:

  1. New instances do not fail to migrate.
  2. Replace the secondary, migrate, replace the secondary, so an instance is not on original disks ... no failure.
  3. Attempts to put the instance under load do not trigger migration failure.
  4. An older running instance is most likely to experience migration failure.
  5. After an older instance experiences migration failure ... no migration failure.

Running hbal -X on our ganeti-3.0 cluster on Ubuntu 20.04 triggers some instances in the move list to experience a QMP timeout error.

Applying the hot patch at #1597 (comment) allows the cluster to hbal -X without errors.

Speculation: something about an instance that has been running a long time (months, here) increases the chances of a 5-second QMP timeout failure.

Conclusion: I don't know the root cause, and I don't care if it takes more than 5 seconds for one step in the migration to complete. So, to me, it would definitely be helpful here to be able to set the timeout locally, hard-code a sufficiently high timeout, or perhaps remove any checking for the timeout entirely. "Wait 5 seconds and then fail the migration" has done me no favors and in some cases leaves instances in an up but unavailable state.

Thank you for the fix, @ksanislo!

@xals
Copy link

xals commented Apr 7, 2022

Hi there.

I am facing this issue, and I locally added some debug traces in monitor.py to be able to pinpoint what is happening.

There are two requests to /instance_get_migration_status, with 1s between them, below are my traces (long lines have been cut).

The first one is handled properly, almost instantly:

2022-04-07 13:56:11,813: ganeti-noded pid=22023 server:240 DEBUG Handling request <ganeti.http.server._HttpServerRequest POST /instance_get_migration_status headers='Host: 51.15.185.125:1811\nUser-Agent: Ganeti 3.0.1\nAccept: */*\nContent-type: application/json\nContent-Length: 581
2022-04-07 13:56:11,813: ganeti-noded pid=22023 backend:3176 DEBUG Backend: GetMigrationStatus                                                                                                                                                                                            
2022-04-07 13:56:11,814: ganeti-noded pid=22023 __init__:2644 DEBUG hv_kvm.GetMigrationStatus                                                                                                                                                                                             
2022-04-07 13:56:11,814: ganeti-noded pid=22023 monitor:287 DEBUG QMP: connect                                                                                                                                                                                                            
2022-04-07 13:56:11,814: ganeti-noded pid=22023 monitor:367 DEBUG QMP: waiting for bytes.                                                                                                                                                                                                 
2022-04-07 13:56:11,814: ganeti-noded pid=22023 monitor:375 DEBUG QMP: received message {"QMP": {"version": {"qemu": {"micro": 0, "minor": 2, "major": 5}, "package": "Debian 1:5.2+dfsg-9~bpo10+1"}, "capabilities": ["oob"]}}                                                           
2022-04-07 13:56:11,814: ganeti-noded pid=22023 monitor:303 DEBUG QMP: got greeting: {"QMP": {"version": {"qemu": {"micro": 0, "minor": 2, "major": 5}, "package": "Debian 1:5.2+dfsg-9~bpo10+1"}, "capabilities": ["oob"]}}                                                              
2022-04-07 13:56:11,814: ganeti-noded pid=22023 monitor:395 DEBUG QMP: sending message {"execute": "qmp_capabilities"}                                                                                                                                                                    
2022-04-07 13:56:11,814: ganeti-noded pid=22023 monitor:367 DEBUG QMP: waiting for bytes.                                                                                                                                                                                                 
2022-04-07 13:56:11,815: ganeti-noded pid=22023 monitor:375 DEBUG QMP: received message {"return": {}}                                                                                                                                                                                    
2022-04-07 13:56:11,815: ganeti-noded pid=22023 monitor:395 DEBUG QMP: sending message {"execute": "query-commands"}                                                                                                                                                                      
2022-04-07 13:56:11,815: ganeti-noded pid=22023 monitor:367 DEBUG QMP: waiting for bytes.                                                                                                                                                                                                 
2022-04-07 13:56:11,815: ganeti-noded pid=22023 monitor:367 DEBUG QMP: waiting for bytes.                                                                                                                                                                                                 
2022-04-07 13:56:11,816: ganeti-noded pid=22023 monitor:375 DEBUG QMP: received message {"return": [{"name": "object-add"}, {"name": "device_add"}, {"name": "query-qmp-schema"}, {"name": "query-pci"}, {"name": "query-acpi-ospm-status"}, {"name": "query-sev-capabilities"}, {"name": 
2022-04-07 13:56:11,816: ganeti-noded pid=22023 monitor:395 DEBUG QMP: sending message {"execute": "query-migrate"}                                                                                                                                                                       
2022-04-07 13:56:11,816: ganeti-noded pid=22023 monitor:367 DEBUG QMP: waiting for bytes.                                                                                                                                                                                                 
2022-04-07 13:56:11,816: ganeti-noded pid=22023 monitor:375 DEBUG QMP: received message {"return": {"expected-downtime": 30000, "status": "active", "setup-time": 24, "total-time": 34, "ram": {"total": 1091379200, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 0, "p
2022-04-07 13:56:11,817: ganeti-noded pid=22023 monitor:442 DEBUG QMP query-migrate None: {'expected-downtime': 30000, 'status': 'active', 'setup-time': 24, 'total-time': 34, 'ram': {'total': 1091379200, 'postcopy-requests': 0, 'dirty-sync-count': 1, 'multifd-bytes': 0, 'pages-per-
                                                                                                                                                                                                                                                                                          
2022-04-07 13:56:11,817: ganeti-noded pid=22023 monitor:208 DEBUG QMP: close                                                                                                                                                                                                              
2022-04-07 13:56:11,817: ganeti-noded pid=22023 server:439 INFO 51.15.185.125:35622 POST /instance_get_migration_status HTTP/1.1 200                                                                                                                                                      

The second one gets a timeout error:

2022-04-07 13:56:12,848: ganeti-noded pid=22029 server:240 DEBUG Handling request <ganeti.http.server._HttpServerRequest POST /instance_get_migration_status headers='Host: 51.15.185.125:1811\nUser-Agent: Ganeti 3.0.1\nAccept: */*\nContent-type: application/json\nContent-Length: 581
2022-04-07 13:56:12,849: ganeti-noded pid=22029 backend:3176 DEBUG Backend: GetMigrationStatus                                                                                                                                                                                            
2022-04-07 13:56:12,849: ganeti-noded pid=22029 __init__:2644 DEBUG hv_kvm.GetMigrationStatus                                                                                                                                                                                             
2022-04-07 13:56:12,849: ganeti-noded pid=22029 monitor:287 DEBUG QMP: connect                                                                                                                                                                                                            
2022-04-07 13:56:12,849: ganeti-noded pid=22029 monitor:367 DEBUG QMP: waiting for bytes.                                                                                                                                                                                                 
2022-04-07 13:56:12,849: ganeti-noded pid=22029 monitor:375 DEBUG QMP: received message {"QMP": {"version": {"qemu": {"micro": 0, "minor": 2, "major": 5}, "package": "Debian 1:5.2+dfsg-9~bpo10+1"}, "capabilities": ["oob"]}}                                                           
2022-04-07 13:56:12,849: ganeti-noded pid=22029 monitor:303 DEBUG QMP: got greeting: {"QMP": {"version": {"qemu": {"micro": 0, "minor": 2, "major": 5}, "package": "Debian 1:5.2+dfsg-9~bpo10+1"}, "capabilities": ["oob"]}}                                                              
2022-04-07 13:56:12,849: ganeti-noded pid=22029 monitor:395 DEBUG QMP: sending message {"execute": "qmp_capabilities"}                                                                                                                                                                    
2022-04-07 13:56:12,849: ganeti-noded pid=22029 monitor:367 DEBUG QMP: waiting for bytes.                                                                                                                                                                                                 
2022-04-07 13:56:17,850: ganeti-noded pid=22029 backend:185 ERROR Failed to get migration status: Timeout while receiving a QMP message: timed out                                                                                                                                        
Traceback (most recent call last):                                                                                                                                                                                                                                                        
  File "/usr/share/ganeti/3.0/ganeti/hypervisor/hv_kvm/monitor.py", line 368, in _Recv                                                                                                                                                                                                    
    data = self.sock.recv(4096)                                                                                                                                                                                                                                                           
socket.timeout: timed out                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                          
During handling of the above exception, another exception occurred:                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                          
Traceback (most recent call last):                                                                                                                                                                                                                                                        
  File "/usr/share/ganeti/3.0/ganeti/backend.py", line 3179, in GetMigrationStatus                                                                                                                                                                                                        
    return hyper.GetMigrationStatus(instance)                                                                                                                                                                                                                                             
  File "/usr/share/ganeti/3.0/ganeti/hypervisor/hv_kvm/__init__.py", line 182, in wrapper                                                                                                                                                                                                 
    return fn(self, *args, **kwargs)                                                                                                                                                                                                                                                      
  File "/usr/share/ganeti/3.0/ganeti/hypervisor/hv_kvm/__init__.py", line 2645, in GetMigrationStatus                                                                                                                                                                                     
    query_migrate = self.qmp.GetMigrationStatus()                                                                                                                                                                                                                                         
  File "/usr/share/ganeti/3.0/ganeti/hypervisor/hv_kvm/monitor.py", line 223, in wrapper                                                                                                                                                                                                  
    mon.connect()                                                                                                                                                                                                                                                                         
  File "/usr/share/ganeti/3.0/ganeti/hypervisor/hv_kvm/monitor.py", line 321, in connect                                                                                                                                                                                                  
    self.Execute(self._CAPABILITIES_COMMAND)                                                                                                                                                                                                                                              
  File "/usr/share/ganeti/3.0/ganeti/hypervisor/hv_kvm/monitor.py", line 439, in Execute                                                                                                                                                                                                  
    ret = self._GetResponse(command)                                                                                                                                                                                                                                                      
  File "/usr/share/ganeti/3.0/ganeti/hypervisor/hv_kvm/monitor.py", line 459, in _GetResponse                                                                                                                                                                                             
    response = self._Recv()                                                                                                                                                                                                                                                               
  File "/usr/share/ganeti/3.0/ganeti/hypervisor/hv_kvm/monitor.py", line 380, in _Recv                                                                                                                                                                                                    
    "%s" % (err))                                                                                                                                                                                                                                                                         
ganeti.errors.HypervisorError: Timeout while receiving a QMP message: timed out                                                                                                                                                                                                           
2022-04-07 13:56:17,852: ganeti-noded pid=22029 server:439 INFO 51.15.185.125:35624 POST /instance_get_migration_status HTTP/1.1 200                                                                                                                                                      

I have tested with Qemu version 3.1 and 5.2.

The QMP connection seems to be properly closed on the first one (not on the second one, I don't know if it can be problematic).

As reported above, I think there is an issue in QMP connection handling on Qemu side, but having a way to configure the socket timeout can be a nice workaround.

@saschalucas
Copy link
Member

saschalucas commented Apr 9, 2022

While looking for some method to reproduce the QMP timeout error, at least I can make Qemu block infinitely on the QMP Socket (Environment is Debian Buster/Qemu-3.1/Ganeti-3.0.1).

On node02 I'm hammering the QMP socket as fast as possible. The commands should be the same like Ganeti does once per second during live migration:

root@node02:~# cat /tmp/cmds.txt 
{"execute": "qmp_capabilities"}
{"execute": "query-commands"}
{"execute": "query-migrate"}

root@node02:~# while :; do time cat /tmp/cmds.txt | socat -t 30 STDIO UNIX-CONNECT:/var/run/ganeti/kvm-hypervisor/ctrl/test.vm.qmp; done
...
...
{"QMP": {"version": {"qemu": {"micro": 0, "minor": 1, "major": 3}, "package": "Debian 1:3.1+dfsg-8+deb10u8"}, "capabilities": []}}

real	0m30.036s
user	0m0.002s
sys	0m0.005s

On the master node I run live migration until it's broken:

root@master:~# gnt-instance migrate test.vm 
Instance test.vm will be migrated. Note that migration might impact
the instance if anything goes wrong (e.g. due to bugs in the
hypervisor). Continue?
y/[n]/?: y
Sat Apr  9 14:03:28 2022 Migrating instance test.vm
Sat Apr  9 14:03:28 2022 * checking disk consistency between source and target
Sat Apr  9 14:03:28 2022 * closing instance disks on node node03.62fac895ccce4a55e671bfc81476b684d8f0cf2c.gnt.test
Sat Apr  9 14:03:28 2022 * changing into standalone mode
Sat Apr  9 14:03:29 2022 * changing disks into dual-master mode
Sat Apr  9 14:03:30 2022 * wait until resync is done
Sat Apr  9 14:03:30 2022 * opening instance disks on node node02.62fac895ccce4a55e671bfc81476b684d8f0cf2c.gnt.test in shared mode
Sat Apr  9 14:03:30 2022 * opening instance disks on node node03.62fac895ccce4a55e671bfc81476b684d8f0cf2c.gnt.test in shared mode
Sat Apr  9 14:03:30 2022 * preparing node03.62fac895ccce4a55e671bfc81476b684d8f0cf2c.gnt.test to accept the instance
Sat Apr  9 14:03:30 2022 * migrating instance to node03.62fac895ccce4a55e671bfc81476b684d8f0cf2c.gnt.test
Sat Apr  9 14:03:30 2022 * starting memory transfer
Sat Apr  9 14:03:39 2022 Migration failed, aborting
Sat Apr  9 14:03:44 2022 * closing instance disks on node node03.62fac895ccce4a55e671bfc81476b684d8f0cf2c.gnt.test
Sat Apr  9 14:03:44 2022 * changing into standalone mode
Sat Apr  9 14:03:44 2022 * changing disks into single-master mode
Sat Apr  9 14:03:45 2022 * wait until resync is done
Failure: command execution error:
Could not migrate instance test.vm: Failed to get migration status: Timeout while receiving a QMP message: timed out

If that happens, the while loop slows down. The greeting appears immediately, but no response to qmp_capabilities until socattimes out. Attached is a backtrace produce with:

root@node02:~# gdb -batch -p 31960 -ex 'set pagination off' -ex bt -ex 'bt full' -ex 'thread apply all bt full' -ex "detach" -ex "quit" > output-4.log

I'm unable to understand anything from the backtrace, probably some qemu people can? output-4.log

//EDIT As it turns out the instance is frozen. The qemu log is test.vm.log. HMP is still working:

root@node02:~# echo 'info migrate' | socat STDIO UNIX-CONNECT:/var/run/ganeti/kvm-hypervisor/ctrl/test.vm.monitor 
QEMU 3.1.0 monitor - type 'help' for more information
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off 
Migration status: completed
total time: 6637 milliseconds
downtime: 26 milliseconds
setup: 9 milliseconds
transferred ram: 217468 kbytes
throughput: 269.19 mbps
remaining ram: 0 kbytes
total ram: 1065800 kbytes
duplicate: 212904 pages
skipped: 0 pages
normal: 53794 pages
normal bytes: 215176 kbytes
dirty sync count: 4
page size: 4 kbytes
multifd bytes: 0 kbytes

root@node02:~# echo 'info status' | socat STDIO UNIX-CONNECT:/var/run/ganeti/kvm-hypervisor/ctrl/test.vm.monitor 
QEMU 3.1.0 monitor - type 'help' for more information
(qemu) info status
VM status: paused (postmigrate)

root@node02:~# echo 'cont' | socat STDIO UNIX-CONNECT:/var/run/ganeti/kvm-hypervisor/ctrl/test.vm.monitor 
QEMU 3.1.0 monitor - type 'help' for more information
(qemu) cont

root@node02:~# echo 'info status' | socat STDIO UNIX-CONNECT:/var/run/ganeti/kvm-hypervisor/ctrl/test.vm.monitor 
QEMU 3.1.0 monitor - type 'help' for more information
(qemu) info status
VM status: running

After continuing the instance QMP is still unresponsive.

//EDIT Seems not to happen with Debian Bookworm.

@xals
Copy link

xals commented Apr 11, 2022

@saschalucas I have a two nodes cluster to upgrade to Bullseye, I will try with the Qemu version in backports (https://packages.debian.org/fr/bullseye-backports/qemu-system-x86) which is the same as bookworm (6.2), and let you know (but probably not today).

@saschalucas
Copy link
Member

@saschalucas I have a two nodes cluster to upgrade to Bullseye, I will try with the Qemu version in backports (https://packages.debian.org/fr/bullseye-backports/qemu-system-x86) which is the same as bookworm (6.2), and let you know (but probably not today).

I asked on IRC #qemu if my backtrace is useful. As it turns out with qemu-3.1 it's to old. I was told that:

<mjt> sasluc: fwiw, "QMP socket stuck infinitely" is obviously not the same problem as "QMP is being slow to respond"
<mjt> sasluc: so if you're providing backtraces for the former, it wont help you with the latter
<mjt> sasluc: qemu monitor being slow is an inherent prob with how qemu locking and threading model works, sometimes the monitor thread is either doing something else or is waiting for a lock from some other thread. with time these "other" tasks influence the monitor less and less as more such cases are identified and reworked
...
<sasluc> mjt, OTHA it's good to know that QMP might be slow sometimes. Ganeti should probably raise it's timeout.
<mjt> 5s is already way too much

ATM I don't have this problem with Ganeti-3.0.1 on Ubuntu-18.04/qemu-2.11.

@xals if you could produce a backtrace with qemu-5.2 (Buster), it might be worth to ask in #qemu again.

@xals
Copy link

xals commented Apr 13, 2022

@xals if you could produce a backtrace with qemu-5.2 (Buster), it might be worth to ask in #qemu again.

Ok, I will do it this afternoon (UTC+2).

I am planing a small cluster upgrade to bullseye this afternoon too (after 16:30 UTC+2), I hope I will be able to do some tests on qemu 6 from bullseye-backports.

@xals
Copy link

xals commented Jun 13, 2022

Hello,

So, after some tests and the help of @saschalucas and @rbott on the IRC channel, I pinpointed the issue. This is related to the security_model being set to 'pool'. The commit fc9fe67 prevents the issue, and it seems so prevent the timeout too.

So I think this PR can be ignored and closed without merge.

@xals
Copy link

xals commented Jun 13, 2022

Hello,

So, after some tests and the help of @saschalucas and @rbott on the IRC channel, I pinpointed the issue. This is related to the security_model being set to 'pool'. The commit fc9fe67 prevents the issue, and it seems so prevent the timeout too.

So I think this PR can be ignored and closed without merge.

Actually, I'm not so sure about this. I need to do some more tests.

@xals
Copy link

xals commented Jun 13, 2022

Ok, I confirm the socket timeout needs to be increased to at most 15s. 5s is too short, maybe 10 could be enough but a bit short.

@dannyman
Copy link
Contributor

I am running into this bug again with ganeti-3.0.2-1ubuntu1 on Ubuntu 22.04.

I can confirm again that bumping the timeout on line 131 of /usr/share/ganeti/3.0/ganeti/hypervisor/hv_kvm/monitor.py and restarting the ganeti service on each node clears this issue.

I understand that there is a concern that there may be a bug in qemu or such, but perhaps it would be helpful to change this value within ganeti for the interim to reduce frustration for ganeti users.

@dannyman
Copy link
Contributor

I accidentally reproduced this bug again today because my patch application doesn't restart ganeti. I'll mention again that changing the timeout from 5 to 15 eliminates QMP error messages. Perhaps we can patch the timeout into the main codebase until the more sophisticated fix is formulated.

@dannyman
Copy link
Contributor

dannyman commented Nov 3, 2022

It looks like this patch has been incorporated for RHEL/AlmaLinux/CentOS/Rocky Linux/others at https://github.com/jfut/ganeti-rpm/releases/tag/v3.0.2-2 with enthusiastic feedback from John McNally on the mailing list.

@rbott
Copy link
Member

rbott commented Nov 3, 2022

Perhaps we can patch the timeout into the main codebase until the more sophisticated fix is formulated.

I do tend to agree with that view :-)

However, the PR does need some additions (README/instance parameter documentation, possibly upgrade/downgrade handling). I could incorporate @ksanislo's changes into a new PR (and attribute accordingly of course) and add the missing bits.

The commit message should probably also make clear that this is more of a workaround than a 'real' solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants