GNOME Bugzilla – Bug 674352
Getting statistics slow down VMs a lot
Last modified: 2016-03-31 13:54:55 UTC
When using a f17 livecd on a f17 system, interactions with the VM are not really nice: scrolling down a webpage is choppy, keys sometimes get stuck while typing, ... Behaviour is much nicer when connecting to exactly the same VM with virt-viewer (NB: to reproduce this, it's probably better to have a few VMs listed in boxes). For testing, I added a return right a the beginning of LibvirtMachine::set_stats_enable to make sure the timeout callback gathering VM stats is never called, and boxes started to behave like virt-viewer (smoother animations, no keyboard keys getting stuck, ...). Looking some more into it, the timeout added by set_stats_enable calls LibvirtMachine::update_stats which does var info = domain.get_info (); which results in a call to virDomainGetInfo. This function currently has some performance issues: http://www.redhat.com/archives/libvir-list/2011-December/msg00321.html (unsolved as of today I think given that the last mail in the thread is http://www.redhat.com/archives/libvir-list/2012-April/msg00306.html ) One way to avoid this issue would be to move the stat gathering in its own thread, and propagate the stat info through an idle. Given that getting stat is costly, it's probably better to do it less often, say every 5 seconds or so
(In reply to comment #0) > One way to avoid this issue would be to move the stat gathering in its own > thread, and propagate the stat info through an idle. Given that getting stat is > costly, it's probably better to do it less often, say every 5 seconds or so Or add an async variant of GVirDomain.get_info() that makes use of g_simple_async_result_run_in_thread() like the other async methods in that class. Actually I think we should someday sit down and ensure we have an async wrapper for every libvirt call that could block or take a very long time.
I agree with Zeeshan, but is libvirt multithread-aware? Can it have multiple outstanding calls from various threads? If not, all the operations will have to be queued. It's a shame libvirt is not async, it makes me drop tears so often.
Christophe, the message you point to is perhaps not related. It would be worth to try to revert/remove the "save to XML" and verify if it changes anything. I might never have noticed this issue because I use an SSD.. Even if it that, I am still surprise that it slows down Boxes, libvirt could learn to also save/sync in background and not delay its reply...
(In reply to comment #2) > I agree with Zeeshan, but is libvirt multithread-aware? Can it have multiple > outstanding calls from various threads? Good point! We should make sure not to just queue-up async calls to virDomainGetInfo() but wait (again asynchronously) for previous call to return first. One issue that might be caused by this change is boxes getting more of these timeouts that I've seen when qemu is very busy. When that happens, connection to libvirt is lost and boxes need to be restarted. > If not, all the operations will have to > be queued. It's a shame libvirt is not async, it makes me drop tears so often. That is true. Any API that does blocking I/O and is not async is fundamentally flawed IMO (this is why i sometimes dream of writing my own libvirt that sits on top of KVM and is completely async but then i hope libvirt can be changed..any ways back to planet earth).
(In reply to comment #4) > (In reply to comment #2) > > I agree with Zeeshan, but is libvirt multithread-aware? Can it have multiple > > outstanding calls from various threads? > > Good point! We should make sure not to just queue-up async calls to > virDomainGetInfo() but wait (again asynchronously) for previous call to return > first. http://wiki.libvirt.org/page/FAQ#Is_libvirt_thread_safe.3F So that should be good enough. > That is true. Any API that does blocking I/O and is not async is fundamentally > flawed IMO (this is why i sometimes dream of writing my own libvirt that sits > on top of KVM and is completely async but then i hope libvirt can be > changed..any ways back to planet earth). It might not be necessary then. Properly wrapped in libvirt-glib, it should be good. (even if libvirt is queuing it's own request internally, which I guess it does for some function, since qemu monitor operations are synchronous, afaik)
Just FWY, A patch adding async variant of gvir_domain_get_info() sent to libvirt list..
Created attachment 212327 [details] [review] Asynchronously get information on libvirt domain When a libvirt machine is selected, information is fetched for it every second and the call to fetch the information can take from about 2.5s to 2 *minutes* 45s (currently thats the case due to a regression in libvirt).
(In reply to comment #7) > Created an attachment (id=212327) [details] [review] > Asynchronously get information on libvirt domain Here is a Boxes patch based on the libvirt-glib patch I mentioned in comment#6. It didn't break anything in my preliminary tests but I'm sure there is issues in it that hopefully Marc-Andre can spot instantly. :) There is something funny going on with the I/O stats but I see the same without this patch.
Review of attachment 212327 [details] [review]: ::: src/libvirt-machine.vala @@ +239,3 @@ + update_stats.begin (); + + return false; I don't think we should have 2 timers. Throw away the first call here without timeout, and piggy back with a timer. the function itself shouldn't know it might be wrapped in a timer.
As a datapoint (ie this is not a review nor an ACK), I tested these patches and they help
> I agree with Zeeshan, but is libvirt multithread-aware? libvirt is fully multi-threaded - you can have multiple APIs calls in progress at once. > When a libvirt machine is selected, information is fetched for it every > second and the call to fetch the information can take from about 2.5s to > 2 *minutes* 45s (currently thats the case due to a regression in libvirt) Both of those times are outrageously slow. If operating correctly, we expect virDomainGetInfo() call to take 100 ms or less - it is intended to be very low overhead
Dunno if other places using get_info should be audited too. LibvirtMachine.state is using it, and it's in turn used by is_running, take_screenshot, ... No idea whether it's worth looking at these too.
Created attachment 212371 [details] [review] Asynchronously get information on libvirt domain OK, this one has only one timeout call. You'll notice that I don't remove the timeout but rather not do anything in the timeout callback once stats are disabled. This will mean an extra wake-up at the end but IMHO its worth it to keep the code simple.
Review of attachment 212371 [details] [review]: ::: src/libvirt-machine.vala @@ +224,3 @@ + Timeout.add_seconds (1, () => { + if (stats_enabled) + update_stats.begin (); I think it's cleaner if update_stats() doesn't embed the timer. It should be external, and enabled/disable should remove it, all from set_stats_enable()
(In reply to comment #14) > Review of attachment 212371 [details] [review]: > > ::: src/libvirt-machine.vala > @@ +224,3 @@ > + Timeout.add_seconds (1, () => { > + if (stats_enabled) > + update_stats.begin (); > > I think it's cleaner if update_stats() doesn't embed the timer. It should be > external, and enabled/disable should remove it, all from set_stats_enable() Now I'm really confused, I thought you asked me to remove the Timeout.add() call from set_stats_enable().
(In reply to comment #15) No, "the function itself shouldn't know it might be wrapped in a timer."
Created attachment 212372 [details] [review] Asynchronously get information on libvirt domain How about this version?
Review of attachment 212372 [details] [review]: ack
Review of attachment 212372 [details] [review]: ::: src/libvirt-machine.vala @@ +231,2 @@ stats_id = Timeout.add_seconds (1, () => { + update_stats.begin (); although we should somehow prevent multiple outstading call. Just track this call with a local variable to see if it finished?
Created attachment 212379 [details] [review] Asynchronously get information on libvirt domain Something like this?
Review of attachment 212379 [details] [review]: ::: src/libvirt-machine.vala @@ +236,3 @@ stats_id = Timeout.add_seconds (1, () => { + if (stats_updating) + return true; // Lets wait for previous async call to finish first rather something local like: var stats_updating = true; update_stats.begin(() => { stats_updating = false; })
Created attachment 212384 [details] [review] Asynchronously get information on libvirt domain This uses local variable.
Review of attachment 212384 [details] [review]: ::: src/libvirt-machine.vala @@ +91,3 @@ construct { stats = new MachineStat[STATS_SIZE]; + stats_cancellable = new Cancellable (); Hmm, I didn't notice it first, but I think for the same reason, it would be preferable that the cancellable would be contained in set_stats_enable(). This avoid having lot of property members.
(In reply to comment #23) > Review of attachment 212384 [details] [review]: > > ::: src/libvirt-machine.vala > @@ +91,3 @@ > construct { > stats = new MachineStat[STATS_SIZE]; > + stats_cancellable = new Cancellable (); > > Hmm, I didn't notice it first, but I think for the same reason, it would be > preferable that the cancellable would be contained in set_stats_enable(). This > avoid having lot of property members. The object is the outer context of set_stats_enable() and we need to keep the same cancellable across multiple calls to this method.
(In reply to comment #24) > The object is the outer context of set_stats_enable() and we need to keep the > same cancellable across multiple calls to this method. Correct. Can you group stats_id with the rest of the members? Can you change the comment you left to warning, for the case where update_stats() is too slow, like: if (stats_updating) { warning ("update_stats() is too slow, waiting for last call to finish..."); return true; } With that 2 changes, ack.
also stats_id would probably be better named stats_update_timeout, feel free to rename it.
(In reply to comment #25) > (In reply to comment #24) > > > The object is the outer context of set_stats_enable() and we need to keep the > > same cancellable across multiple calls to this method. > > Correct. Can you group stats_id with the rest of the members? Sure! > Can you change the comment you left to warning, for the case where > update_stats() is too slow, like: > > if (stats_updating) { > warning ("update_stats() is too slow, waiting for last call to finish..."); Are you sure? that'll spam the console a lot I'm affraid and doesn't tell anything to users other than devs. How about a debug?
Created attachment 212397 [details] [review] Asynchronously get information on libvirt domain
Review of attachment 212397 [details] [review]: ::: src/libvirt-machine.vala @@ +232,3 @@ + stats_update_timeout = Timeout.add_seconds (1, () => { + if (stats_updating) { + warning ("Fetching of stats for '%s' is taking too long. Probably a libvirt bug." + name); /+ name/, name
Attachment 212397 [details] pushed as f68fe17 - Asynchronously get information on libvirt domain
Created attachment 213625 [details] [review] Remove some more sync calls to GVir.Domain.get_info() This is a blocking call so remove calls where redundant or make it async in hot (enough) paths.
(In reply to comment #31) > Created an attachment (id=213625) [details] [review] > Remove some more sync calls to GVir.Domain.get_info() > > This is a blocking call so remove calls where redundant or make it > async in hot (enough) paths. Not very sure about this since I'd rather like to listen to libvirt's state change events and keep the last known state in a prop/field but I wouldn't recommend doing that until this bug is fixed: https://bugzilla.redhat.com/show_bug.cgi?id=819617 I was actually cooking a patch that uses events but because of this issue, I totally failed to make it work. :(
(In reply to comment #32) > https://bugzilla.redhat.com/show_bug.cgi?id=819617 > > I was actually cooking a patch that uses events but because of this issue, I > totally failed to make it work. :( ack
Review of attachment 213625 [details] [review]: ack
Attachment 213625 [details] pushed as 358c2e9 - Remove some more sync calls to GVir.Domain.get_info()