-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvm_qmp_timeout hypervisor parameter #1597
base: master
Are you sure you want to change the base?
Conversation
I was just about to bring this to the mailing list. I see messages of this nature when running
On my ganeti-3.0.1 cluster, I hit this error consistently when I try to If appropriate, please consider my comment as a vote for this feature/PR. Thanks, |
Thanks @ksanislo for your contribution. Can you please tell a bit more about your "edge cases that create QMP timeouts regularly"? I remember a conversation with @apoikos, about the QMP timeout, where Apollon seems to see no reason, why QMP is unresponsive for longer then 5s. @dannyman: Sadly I'm in real world workload still on 2.16, but I've backported commit c66db86 from 3.0 and observed a similar error during migration. Increasing QMP timeout helped a bit, but in the end even 30s were not enough for some instances. I suspected here a real bug, something that's blocking the QMP socket. I.e. parts in the code could use QMP directly without the QMP-Decorator and potentially fail to close the connection (like in 3.0beta1 setting the spice password). Any help on this is highly appreciated. |
@saschalucas In my environment, I'm getting seemingly random timeouts at 5 seconds with QMP on migrations, but not when collecting info for With this change, I'm no longer seeing any jobs with QMP timeouts, despite the nearly constant 24x7 rebalancing operations of an automated tool on the cluster. |
Thanks @ksanislo, it seems the same problem as @dannyman's, which is probably the same as I observed in the past by my backport attempt. AFAIK @ksanislo's $far_away_node_group has nothing to do with QMP timeout. The message flow is: master to node via noded-RPC and then local connection to qemu-socket. The QMP timeout only belongs to the local unix socket communication. Currently I'm unable to reproduce the problem in a lab environment on Debian Buster. What $Distro do you use? Regarding my speculation on codes paths using QMP directly, there seems only one left: ganeti/lib/hypervisor/hv_kvm/__init__.py Lines 1111 to 1120 in d6484fe
It's actively used before the migration is started. The brave may comment the whole try/expect block on the node, which is the source of migration (don't forget to restart ganeti service after editing the code).
|
@saschalucas I'm running Debian buster on all nodes. I've removed (commented out) the direct QMP call you specified, and set my timeout back to the default 5 seconds to match with it, and the timeout errors have returned in full force. I don't think that section of the code is directly related to this bug. Anything else you can think of that might be a worthwhile test I can run here? |
We have this problem at a site where we are running ganeti-3.0.1 on Ubuntu 20.04. Any attempt to migrate a DRBD VM results in a "Timeout while receiving a QMP message" and leaves the instance "up" from ganeti's point of view, but inaccessible. To get the instance fixed I have to SSH to the host node and kill the instance's PID, at which point I can tell ganeti to start it up again. This would be a HUGE problem if this weren't a DR site. I'm happy to put some investigation time in to get this fixed. It sounds like this bug may be related to c66db86 because I am seeing it in 3.0.1 and @saschalucas sees it in his older cluster when he backports the code? Update: digging around, it looks like ganeti/lib/hypervisor/hv_kvm/monitor.py Line 364 in d0bc86d
Line 3120 in c42bb43
Is this the right context for this information or would it be helpful to file a bug report? (Or the mailing list, which I was planning to hit when I saw this come up.) |
@dannyman You're definitely looking at the same flaw I am... bumping the timeout from 5 to 15 seconds seem to be a viable workaround for my environment, but it sounds like there's a deeper issue going on somewhere and just raising the timeout value isn't the best fix as the root cause remains. A possible workaround you can test yourself would be to edit the file /usr/share/ganeti/3.0/ganeti/hypervisor/hv_kvm/monitor.py in place on your machines, and change line 131 from: And then restart your ganeti service on each of the changed nodes. I would edit this on the master, then Effectively, this is all I'm doing after my custom patch from above is applied and this setting becomes a |
@ksanislo: Thanks for testing my assumption, seems the wrong direction. Also it seems, that there are two different behaviors: @ksanislo is "getting seemingly random timeouts at 5 seconds", which I interpret as "happens sometimes" and at different migrations progresses etc (that's what I observed). @dannyman says "Any attempt to migrate a DRBD VM results in a timeout", which means no migration is working at all, it even will not start. I've setup a Ubuntu Focal lab and live migration is working as expected. So ATM I'm unable to reproduce. In @dannyman's case it would be interesting to know:
mv /var/run/ganeti/kvm-hypervisor/ctrl/<instance>.qmp /var/run/ganeti/kvm-hypervisor/ctrl/<instance>.qmp.orig
socat -t100 -v UNIX-LISTEN:/var/run/ganeti/kvm-hypervisor/ctrl/<instance>.qmp,mode=777,reuseaddr,fork UNIX-CONNECT:/var/run/ganeti/kvm-hypervisor/ctrl/<instance>.qmp.orig If @ksanislo is also able to get a QMP trace, that would be helpful, too. |
@saschalucas I couldn't get the problem to occur at the default 5 second timeout on a VM I was watching, but if I switch to a 3 second timeout, it happens consistently for me... This seems to be the relevant part of the trace with a 3 second time out so it breaks at the delay. Let me know if you'd like more of it, or something else changed and tested. I'm definitely seeing a pause of 3-5 seconds in this same area even when it works and the timeout is set longer too. `
|
@saschalucas Just for completeness, here's what it looks like when the timeout is set higher than that pause and it keeps going: `
|
Thanks @ksanislo, I can see the delay being ~3.7s here between the last response and the next QMP greeting, while Ganeti tries to pull every second for the migration status. I can also see some of your migration parameters:
Right? For completeness you may attache the complete trace as a text file (maybe compressed) here. There I can see the initialization of the migration, too. I would speculate, that the QMP unresponsiveness is triggered, when You may have your reason for setting downtime to 5s. I observed that most workloads are able to migrate with 5Gb/s bandwidth (625MB/s) and 1s downtime (1000ms). BTW Ganeti 3.0 has support for postcopy-ram migration_caps, for faster and guaranteed migration convergence. Some migration_caps can be tuned or need special orchestration. xbzrle seem to work with defaults? correct postcopy-ram orchestration was implemented in 3.0. Therefore the simple list of migration_caps is not an optimal implementation. |
I tried to do some documentation research but I could not find anything with regards to expected timings / timeouts for QMP responses in the QEMU docs. I am not exactly sure where the five seconds in the Ganeti code have been derived from. I have also asked on the #qemu channel on OFTC IRC and will update this post as soon as I find any useful information on that. OTOH I do not see any obvious connection between e.g. a 5s Update: I got a quick reply on the #qemu channel. In summary, other implementations like libvirt do not check for a timeout at all and QEMU itself also does not track timeouts of commands. However, if a command hangs for more than 5 seconds that could be considered a QEMU bug and they ask kindly to file a bug with a stack dump of the qemu process while it appears to hang. For completeness, here is the entire communication:
|
I wanted to test the hot patch from @ksanislo, above. I was having a hard time reproducing the error:
Running Applying the hot patch at #1597 (comment) allows the cluster to Speculation: something about an instance that has been running a long time (months, here) increases the chances of a 5-second QMP timeout failure. Conclusion: I don't know the root cause, and I don't care if it takes more than 5 seconds for one step in the migration to complete. So, to me, it would definitely be helpful here to be able to set the timeout locally, hard-code a sufficiently high timeout, or perhaps remove any checking for the timeout entirely. "Wait 5 seconds and then fail the migration" has done me no favors and in some cases leaves instances in an up but unavailable state. Thank you for the fix, @ksanislo! |
Hi there. I am facing this issue, and I locally added some debug traces in There are two requests to The first one is handled properly, almost instantly:
The second one gets a
I have tested with Qemu version 3.1 and 5.2. The QMP connection seems to be properly closed on the first one (not on the second one, I don't know if it can be problematic). As reported above, I think there is an issue in QMP connection handling on Qemu side, but having a way to configure the socket timeout can be a nice workaround. |
While looking for some method to reproduce the QMP timeout error, at least I can make Qemu block infinitely on the QMP Socket (Environment is Debian Buster/Qemu-3.1/Ganeti-3.0.1). On root@node02:~# cat /tmp/cmds.txt
{"execute": "qmp_capabilities"}
{"execute": "query-commands"}
{"execute": "query-migrate"}
root@node02:~# while :; do time cat /tmp/cmds.txt | socat -t 30 STDIO UNIX-CONNECT:/var/run/ganeti/kvm-hypervisor/ctrl/test.vm.qmp; done
...
...
{"QMP": {"version": {"qemu": {"micro": 0, "minor": 1, "major": 3}, "package": "Debian 1:3.1+dfsg-8+deb10u8"}, "capabilities": []}}
real 0m30.036s
user 0m0.002s
sys 0m0.005s On the master node I run live migration until it's broken:
If that happens, the while loop slows down. The greeting appears immediately, but no response to
I'm unable to understand anything from the backtrace, probably some qemu people can? output-4.log //EDIT As it turns out the instance is frozen. The qemu log is test.vm.log. HMP is still working:
After continuing the instance QMP is still unresponsive. //EDIT Seems not to happen with Debian Bookworm. |
@saschalucas I have a two nodes cluster to upgrade to Bullseye, I will try with the Qemu version in backports (https://packages.debian.org/fr/bullseye-backports/qemu-system-x86) which is the same as bookworm (6.2), and let you know (but probably not today). |
I asked on IRC
ATM I don't have this problem with Ganeti-3.0.1 on Ubuntu-18.04/qemu-2.11. @xals if you could produce a backtrace with qemu-5.2 (Buster), it might be worth to ask in |
Ok, I will do it this afternoon (UTC+2). I am planing a small cluster upgrade to bullseye this afternoon too (after 16:30 UTC+2), I hope I will be able to do some tests on qemu 6 from bullseye-backports. |
Hello, So, after some tests and the help of @saschalucas and @rbott on the IRC channel, I pinpointed the issue. This is related to the security_model being set to 'pool'. The commit fc9fe67 prevents the issue,
|
Actually, I'm not so sure about this. I need to do some more tests. |
Ok, I confirm the socket timeout needs to be increased to at most 15s. 5s is too short, maybe 10 could be enough but a bit short. |
I am running into this bug again with I can confirm again that bumping the timeout on line 131 of I understand that there is a concern that there may be a bug in qemu or such, but perhaps it would be helpful to change this value within ganeti for the interim to reduce frustration for ganeti users. |
I accidentally reproduced this bug again today because my patch application doesn't restart ganeti. I'll mention again that changing the timeout from 5 to 15 eliminates QMP error messages. Perhaps we can patch the timeout into the main codebase until the more sophisticated fix is formulated. |
It looks like this patch has been incorporated for RHEL/AlmaLinux/CentOS/Rocky Linux/others at https://github.com/jfut/ganeti-rpm/releases/tag/v3.0.2-2 with enthusiastic feedback from John McNally on the mailing list. |
I do tend to agree with that view :-) However, the PR does need some additions (README/instance parameter documentation, possibly upgrade/downgrade handling). I could incorporate @ksanislo's changes into a new PR (and attribute accordingly of course) and add the missing bits. The commit message should probably also make clear that this is more of a workaround than a 'real' solution. |
This change makes the kvm hypervisor QMP timeout a settable hypervisor parameter. The default timeout remains the preconfigured 5 seconds, but can now be easily extended for edge cases that create QMP timeouts regularly.