Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to speed up qvm-ls #165

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

Conversation

qubesuser
Copy link
Contributor

This is the server part of a bunch of changes that get qvm-ls to around 200ms run time.

Warning: I haven't tested them much, and they change fundamental code, so they probably need fixes

The first major change is the introduction of APIs that allow to get all properties of a VM or of Qubes and all properties of all VMs in a single call.

The second major change is storing the libvirt state in qubesd (and changing it in response to events), so that libvirtd calls are not required to get the system state.

Finally, there's some optimization and fixes.

@codecov
Copy link

codecov bot commented Nov 10, 2017

Codecov Report

Merging #165 into master will decrease coverage by 0.03%.
The diff coverage is 41.29%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #165      +/-   ##
=========================================
- Coverage   53.93%   53.9%   -0.04%     
=========================================
  Files          55      55              
  Lines        8452    8608     +156     
=========================================
+ Hits         4559    4640      +81     
- Misses       3893    3968      +75
Flag Coverage Δ
#unittests 53.9% <41.29%> (-0.04%) ⬇️
Impacted Files Coverage Δ
qubes/config.py 100% <ø> (ø) ⬆️
qubes/vm/mix/net.py 56.14% <ø> (ø) ⬆️
qubes/vm/adminvm.py 72.13% <ø> (ø) ⬆️
qubes/tools/qubesd.py 0% <0%> (ø) ⬆️
qubes/vm/qubesvm.py 38.92% <27.71%> (+2.06%) ⬆️
qubes/app.py 66.71% <37.77%> (-1.64%) ⬇️
qubes/api/admin.py 86.22% <39.13%> (-5.6%) ⬇️
qubes/vm/__init__.py 83.9% <50%> (-0.53%) ⬇️
qubes/__init__.py 92.6% <89.18%> (-0.48%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 682d950...0175ad6. Read the comment docs.

@qubesuser qubesuser force-pushed the fast_qvm_ls branch 2 times, most recently from 1422b80 to 3605c92 Compare November 10, 2017 16:31
strs = []
for vm in sorted(domains):
self._vm_line_strs(strs, vm)
return ''.join(strs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is str.append really faster than str.format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to avoid doing any kind of string processing until the final join call, which should be the fastest since it should just allocate a single string in join and copy them all in, avoiding any intermediate allocations, partial concatenations, format string parsing, etc.

It's slightly uglier, but not that much, it looks like writing to an output stream (which would be even faster, but not really worth the effort).

Copy link
Member

@marmarek marmarek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of those commits are good as-is, some requires changes. We'll think about property.GetAll, so if you think other changes are important improvements anyway, splitting it into separate PR may be a good idea.


property_def = dest.property_get_def(self.arg)
# pylint: disable=no-self-use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@staticmethod

qubes/app.py Outdated
self._conn.registerCloseCallback(self._close_callback, None)
break
except libvirt.libvirtError:
subprocess.call(["systemctl", "start", "libvirtd"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad idea. It makes it impossible to stop libvirtd service - which admin should be allowed to do. I guess it will also lead to some corner cases during restart and system shutdown.
libvirtd service have configured automatic restart on crash, so this shouldn't be an issue in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it, although not sure which way is the best. The problem of not doing this is that qubesd will just freeze forever while trying to reconnect if libvirtd is stopped, which also doesn't seem great.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some timeout (5s?) and throw exception?

@@ -519,21 +517,50 @@ def __init__(self, xml, **kwargs):
name, self.__class__.__name__))

@classmethod
def property_dict(cls, load_stage=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about @functools.lru_cache() instead?

self._stubdom_xid = int(stubdom_xid_str)

if self.is_halted() and not was_halted:
self.fire_event('domain-shutdown')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #159 for related changes. Especially see the description of race condition fixed there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents the "double shutdown event" as well. It think just taking the startup lock while firing the event could fix the synchronization issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not prevent handling domain-shutdown after VM was started again, namely this sequence:

  1. VM shutdown
  2. vm.start()
  3. vm.fire_event('domain-shutdown')

You don't have any guarantee about order of input processing (request on qubesd.sock, event on libvirt socket). And additionally, you can't take lock directly in libvirt event handler, because it isn't coroutine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think this needs to be thought about and done more carefully: probably the simplest way is having non-coroutines schedule coroutines, and then pretty much serializing all methods that change the VM in any way, possibly including libvirt state updates, by taking the startup_lock (which should just be renamed to "lock") and updating libvirt state under lock if necessary.

Could also consider deleting the libvirt domain for VMs that are shutdown to prevent external tools starting it while we finalize storage and other such robustness measures against raw use of the libvirt/xen APIs.

I think neither the current code nor this patch does this correctly, and some more work is needed.

raise

assert False
state = self._libvirt_state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes here are quite intrusive - especially more relies on libvirt events being really delivered (which is fragile...). What about caching libvirt state just in this function, instead of calling state() in each if statement? That should be big improvement already.

Copy link
Contributor Author

@qubesuser qubesuser Nov 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to never call libvirtd for any read-only admin API call except for CPU/memory stats (and internal inspection).

It should theoretically not be fragile because it updates the state on every connection to libvirtd, on every lifecyle event received, and if the libvirt connection dies the close callback should be called, which triggers a reconnect, which triggers a state update.

Of course it's possible that the implementation is buggy

Without this doing qvm-ls requires at least one call to libvirtd per VM, which is probably going to take hundreds of milliseconds, and also any random "is_running" and similar call within qubes is going to cause I/O, which can also cause pervasive slowdown.

Plus we kind of want to know what the VMs are doing anyway to trigger events and perform appropriate actions, and if that can be done accurately, then the state can be accurately stored as well.

elif not self.is_running():
if not autostart:
raise qubes.exc.QubesVMNotRunningError(self)
yield from self.start(start_guid=gui)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will deadlock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, this commit is quite unrelated to the rest anyway.

@@ -145,6 +145,33 @@ def vm_list(self):
self._vm_line_strs(strs, vm)
return ''.join(strs)

@qubes.api.method('admin.vm.GetAllData', no_payload=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this is too much. Even if we agree for property.GetAll, getting all info of all the VMs in one call is beyond what we want in Admin API.

qubes/app.py Outdated
except: # pylint: disable=bare-except
pass
return cls._name

def _default_pool(app):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@functools.lru_cache instead?

@qubesuser qubesuser force-pushed the fast_qvm_ls branch 4 times, most recently from 1c01393 to 1db1dd1 Compare November 11, 2017 00:57
it's unused and has a netid property with name different than key
that would cause issues in the next commit
currently it takes 100ms+ to determine the default pool every time,
including subprocess spawning (!), which is unacceptable since
finding out the default value of properties should be instantaneous

instead of checking every thin pool to see if the root device is in
it, find and cache the name of the root device thin pool and see if
there is a configured pool with that name
Otherwise qubesd will die if you try to break in the debugger
- connect in a loop, starting libvirtd
- reconnect on demand in a loop
- register events on reconnection
info already contains domain state, so just use it, making a single
libvirtd call
This should prevent getting "qrexec not connected" due to trying
to start a service in the middle of VM startup
This allows MUCH faster qvm-ls and qvm-prefs

The code is optimized to be as fast as possible:
- optimized property access
- gather strings in a list and join them only at the end
@marmarek
Copy link
Member

The property.GetAll idea has been implemented in #298. Few individual commits may be worth including too, but the IMO most important part already is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants