GNOME Bugzilla – Bug 683042
Boxes goes out of fullscreen right after starting a box fullscreen
Last modified: 2016-03-31 14:00:30 UTC
1) create a new box with an installer ISO which cannot be auto-installed (I've used a rhel6 dvd 2) upon creation, the display is fullscreened (expected behaviour) 3) right after zooming to fullscreen, it goes back to being minimized in the collection view This does not happen always, but most of the time. When I don't hit it, deleting the box, restarting boxes and retrying makes it happen. I've reproduced this in 3.5.5 and with git master, with spice-gtk 0.12.102 and spice-gtk 0.13. This looks like a race condition, and seems to have been there for quite some time now.
I've seen this just starting a STOPPED vm from the collection view.
Yup, me too, but seems to be recent.
Hmm today I can't reproduce the issue with starting a STOPPED vm from the collection view, but I'm still hitting the initial issue.
Created attachment 223409 [details] [review] libvirt-machine: remove reconnect_display() This handler was a poor-man attempt at dealing with async domain starting events. It now races with initial display.connect_it(), since LibvirtMachine.connect_display() is now fully async. This causes spice server to disconnect the session and Boxes goes out of fullscreen display.
Review of attachment 223409 [details] [review]: i have to think a bit more about this, the patch I propose may break the "stay-on-display" behaviour, as I can't see who is reconnecting to display when it is rebooted...
Created attachment 223422 [details] [review] Fix going out of fullscreen right after starting a box When the VM is started, there is an additional handler reconnect_display() racing with connect_display(), and the spice server disconnect clients if multi-client isn't enabled. reconnect_display() is still needed for the "stay on display across reboots" functionnality. If the VM is actually resumed/started, rely on it.
Review of attachment 223422 [details] [review]: A bit worried that variations on this issue will come back to byte us later, but we'll see... ACK with the small changes in the comment folded in. ::: src/libvirt-machine.vala @@ +31,3 @@ + if (yield start ()) + // in which case, reconnect_display() is called when the + // machine is running, no need to connect again // in *this* case, reconnect_display() *will be* called...
(In reply to comment #7) > Review of attachment 223422 [details] [review]: > > A bit worried that variations on this issue will come back to byte us later, > but we'll see... ACK with the small changes in the comment folded in. > > ::: src/libvirt-machine.vala > @@ +31,3 @@ > + if (yield start ()) > + // in which case, reconnect_display() is called when the > + // machine is running, no need to connect again > > // in *this* case, reconnect_display() *will be* called... my experience with that commit is that it will rely even more on libvirt-glib events.. and that's still very much broken... Perhaps we want to disable reconnect_display() when !stay_on_display, and still continue here... I'll update thr patch
Something like this would also fix it: diff --git a/src/libvirt-machine.vala b/src/libvirt-machine.vala index e59f380..7f63d4a 100644 --- a/src/libvirt-machine.vala +++ b/src/libvirt-machine.vala @@ -27,11 +27,11 @@ public override async void connect_display () { if (_connect_display) return; - _connect_display = true; yield start (); update_display (); display.connect_it (); + _connect_display = true; } struct MachineStat {
(In reply to comment #9) > Something like this would also fix it: > > diff --git a/src/libvirt-machine.vala b/src/libvirt-machine.vala > index e59f380..7f63d4a 100644 > --- a/src/libvirt-machine.vala > +++ b/src/libvirt-machine.vala > @@ -27,11 +27,11 @@ public override async void connect_display () { > if (_connect_display) > return; > - _connect_display = true; > yield start (); > update_display (); > display.connect_it (); > + _connect_display = true; > } > struct MachineStat { Iirc, we used to have connect_display() reentered, but this is no longer the case. So we can probably move _connect_display = true at the end. Though it would no longer prevent correctly several consecutive ongoing connect_display()... It's probably fine too.
Created attachment 223551 [details] [review] Fix going out of fullscreen right after starting a box When the VM is started, there is an additional handler reconnect_display() racing with connect_display(), and the spice server disconnect clients if multi-client isn't enabled. reconnect_display() is still needed for the "stay on display across reboots" functionnality. Since the libvirt-glib events are not so reliable, we still need to call display.connect_it() right after starting the VM. However, update_display() is creating a new display widget and a new client connexion unconditionnaly. With this change, it now verify if the display is the same to avoid creating multiple racing clients.
Review of attachment 223551 [details] [review]: I'm currently testing a potential fix for this event issue, dunno if we should differ this patch for a bit longer, or if it's better to just have an additional cleanup patch after libvirt no longer has this issue. Apart from this patch looks good
Review of attachment 223551 [details] [review]: Hmm let's wait a bit on this, I applied it to test it for a bit, and got a Boxes segfault when trying to connect to a suspended VM. I'll test it more.
Review of attachment 223551 [details] [review]: Happened a second time on a suspended WinXP VM (out of 2 tests): (gdb) thread apply all bt
+ Trace 230808
Thread 1 (Thread 0x7fffefe6ba00 (LWP 7209))
I can't reproduce that crash, and never got any crash with that patch so far (restoring or not, from wizard creation etc).
running with valgrind or MALLOC_CHECK_=2 doesn't detect double-free either
I'm in the opposite situation, I cannot get it not to crash with this VM :( valgrind gives me: ==14963== Invalid read of size 1 ==14963== at 0xA734F2C: g_pointer_bit_lock (gbitlock.c:404) ==14963== by 0xA745BB4: g_datalist_lock (gdataset.c:213) ==14963== by 0xA745FFF: g_data_set_internal (gdataset.c:358) ==14963== by 0xA746618: g_datalist_id_set_data_full (gdataset.c:674) ==14963== by 0xA2D486E: g_object_real_dispose (gobject.c:1011) ==14963== by 0x77A8533: spice_channel_dispose (spice-channel.c:154) ==14963== by 0x77BA803: spice_main_channel_dispose (channel-main.c:272) ==14963== by 0xA2D994B: g_object_unref (gobject.c:2986) ==14963== by 0x77ADBF2: spice_channel_delayed_unref (spice-channel.c:2085) ==14963== by 0xA767ED0: g_idle_dispatch (gmain.c:4806) ==14963== by 0xA76588C: g_main_dispatch (gmain.c:2715) ==14963== by 0xA76643C: g_main_context_dispatch (gmain.c:3219) ==14963== Address 0x17c4c4b0 is 16 bytes inside a block of size 2,960 free'd ==14963== at 0x4A079AE: free (vg_replace_malloc.c:427) ==14963== by 0xA76D9EE: standard_free (gmem.c:98) ==14963== by 0xA76DBB1: g_free (gmem.c:252) ==14963== by 0xA7864CB: g_slice_free1 (gslice.c:1111) ==14963== by 0xA2F0115: g_type_free_instance (gtype.c:1935) ==14963== by 0xA2D9AF7: g_object_unref (gobject.c:3036) ==14963== by 0x46BE00: block33_data_unref (spice-display.c:545) ==14963== by 0xA2CBB4C: closure_invoke_notifiers (gclosure.c:249) ==14963== by 0xA2CD113: g_closure_unref (gclosure.c:599) ==14963== by 0xA2E39DD: handler_unref_R (gsignal.c:645) ==14963== by 0xA2E8621: g_signal_handlers_destroy (gsignal.c:2639) ==14963== by 0xA2D484F: g_object_real_dispose (gobject.c:1010) ==14963== ==14963== Invalid read of size 8 ==14963== at 0xA746007: g_data_set_internal (gdataset.c:360) ==14963== by 0xA746618: g_datalist_id_set_data_full (gdataset.c:674) ==14963== by 0xA2D486E: g_object_real_dispose (gobject.c:1011) ==14963== by 0x77A8533: spice_channel_dispose (spice-channel.c:154) ==14963== by 0x77BA803: spice_main_channel_dispose (channel-main.c:272) ==14963== by 0xA2D994B: g_object_unref (gobject.c:2986) ==14963== by 0x77ADBF2: spice_channel_delayed_unref (spice-channel.c:2085) ==14963== by 0xA767ED0: g_idle_dispatch (gmain.c:4806) ==14963== by 0xA76588C: g_main_dispatch (gmain.c:2715) ==14963== by 0xA76643C: g_main_context_dispatch (gmain.c:3219) ==14963== by 0xA76661F: g_main_context_iterate (gmain.c:3290) ==14963== by 0xA7666E3: g_main_context_iteration (gmain.c:3351) ==14963== Address 0x17c4c4b0 is 16 bytes inside a block of size 2,960 free'd ==14963== at 0x4A079AE: free (vg_replace_malloc.c:427) ==14963== by 0xA76D9EE: standard_free (gmem.c:98) ==14963== by 0xA76DBB1: g_free (gmem.c:252) ==14963== by 0xA7864CB: g_slice_free1 (gslice.c:1111) ==14963== by 0xA2F0115: g_type_free_instance (gtype.c:1935) ==14963== by 0xA2D9AF7: g_object_unref (gobject.c:3036) ==14963== by 0x46BE00: block33_data_unref (spice-display.c:545) ==14963== by 0xA2CBB4C: closure_invoke_notifiers (gclosure.c:249) ==14963== by 0xA2CD113: g_closure_unref (gclosure.c:599) ==14963== by 0xA2E39DD: handler_unref_R (gsignal.c:645) ==14963== by 0xA2E8621: g_signal_handlers_destroy (gsignal.c:2639) ==14963== by 0xA2D484F: g_object_real_dispose (gobject.c:1010) ==14963== ==14963== Invalid read of size 4 ==14963== at 0xA746038: g_data_set_internal (gdataset.c:367) ==14963== by 0xA746618: g_datalist_id_set_data_full (gdataset.c:674) ==14963== by 0xA2D486E: g_object_real_dispose (gobject.c:1011) ==14963== by 0x77A8533: spice_channel_dispose (spice-channel.c:154) ==14963== by 0x77BA803: spice_main_channel_dispose (channel-main.c:272) ==14963== by 0xA2D994B: g_object_unref (gobject.c:2986) ==14963== by 0x77ADBF2: spice_channel_delayed_unref (spice-channel.c:2085) ==14963== by 0xA767ED0: g_idle_dispatch (gmain.c:4806) ==14963== by 0xA76588C: g_main_dispatch (gmain.c:2715) ==14963== by 0xA76643C: g_main_context_dispatch (gmain.c:3219) ==14963== by 0xA76661F: g_main_context_iterate (gmain.c:3290) ==14963== by 0xA7666E3: g_main_context_iteration (gmain.c:3351) ==14963== Address 0xaaaaaaaaaaaaaaa8 is not stack'd, malloc'd or (recently) free'd ==14963== ==14963== ==14963== Process terminating with default action of signal 11 (SIGSEGV) ==14963== General Protection Fault ==14963== at 0xA746038: g_data_set_internal (gdataset.c:367) ==14963== by 0xA746618: g_datalist_id_set_data_full (gdataset.c:674) ==14963== by 0xA2D486E: g_object_real_dispose (gobject.c:1011) ==14963== by 0x77A8533: spice_channel_dispose (spice-channel.c:154) ==14963== by 0x77BA803: spice_main_channel_dispose (channel-main.c:272) ==14963== by 0xA2D994B: g_object_unref (gobject.c:2986) ==14963== by 0x77ADBF2: spice_channel_delayed_unref (spice-channel.c:2085) ==14963== by 0xA767ED0: g_idle_dispatch (gmain.c:4806) ==14963== by 0xA76588C: g_main_dispatch (gmain.c:2715) ==14963== by 0xA76643C: g_main_context_dispatch (gmain.c:3219) ==14963== by 0xA76661F: g_main_context_iterate (gmain.c:3290) ==14963== by 0xA7666E3: g_main_context_iteration (gmain.c:3351) Not much clue as to what to do next :(
please retest the second patch with patch from 683561
Woa, this one is iffy. The patch causes us to create a new SpiceDisplay, notice that its the same as the old one, yet we still call session.connect () but not on the new SpiceDisplay but the old one. This will cause it to first disconnect, and the connect again. spice_session_disconnect() calls spice_channel_destroy() on the old channel, but does not remove them from the session->channels ring, instead it relies on that happening in spice_session_channel_destroy which is called from spice_channel_dispose(). However, since we've connected to the main channel the spice_channel_coroutine is running (or at least scheduled), which has a ref on the channel. So, it doesn't actually dispose the channel yet, so its not removed from the session->channels list. Its also not told to stop connecting, so we end up with two main channel coroutines in the session, both connecting to the same spice server. The first succeeds, then the second, then the first gets kicked off by the second, so we get a disconnected error. This results in the SpiceSession object disconnection code running again, which correctly destroys the second channel (which is now correctly connected) and then the first channel. However, when the first channel is destroyed using spice_channel_destroy() the *second* time it unrefs the spice channel *again*, essentially stealing the ref that the coroutine for that channel owns, and then when we eventually schedule back to that corouting we crash as it has been freed. I tried to add a g_object_run_dispose() call to spice_channel_destroy() to ensure that we correctly remove the channel from session->channels, but this leads to other problems, as it NULLs out channel->session which later causes the running coroutine to crash. I think boxes shouldn't be calling connect twice on the same display anyway, which would fix this. But its also bad that we can end up with two main channels in a session.
Also, i don't see how the patch in 683561 would help things. There is still a ref from spice-channel.c:channel_connect() that will make the channel_destroy not dispose => not remove from session. So, you will still get two connecting channels. Additionally, even if the disconnect did work, it would still connect, disconnect and then connect again, which seems pretty stupid.
Created attachment 223767 [details] [review] Only reconnect after initial connect succeds Otherwise we race with connect_display() which causes us to connect twice during startup.
Attachment 223767 [details] pushed as 7fb07ef - Only reconnect after initial connect succeds
Although I thought also it would work, sadly, the last patch doesn't, we still get a 2nd reconnect when the VM is started: the event is received later on, not necessarily during yield start() time. I guess your libvirt was broken and you didn't receive all events when you tested your patch, or just some other raciness..
Also, the patch seems to have regressed password support. Machine.set_password_needed() does: if (needed) { password_entry.grab_focus (); machine.display = null; } And then on enter we machine.connect_display.begin () again, but now _connect_display is true so we ignore this :/
Created attachment 224927 [details] [review] Fix multiple connections on startup This is a better fix for bug 683042 (where we connect twice when clicking on a box). It fixes the race where we connect twice as well as fix the regression where password authentication was broken. We now handle the race by making connect_display() safe to run multiple times by: * Making update_display() only create a display if needed. This is safe as we always clear the display property before reconnecting. * The Display.connect_it() method was changed to only connect once. Additionally we make reconnect_display() only reconnect if there is already a display. This means that if we hit reconnect_display() before the initial connect_display() has started connecting we don't connect. This ensures connect_display() does the actual connection, so that it can properly report any errors.
Created attachment 224928 [details] [review] Ensure that you can only connect a Display once We currently handle reconnecting by creating a new Display object anyway. But we make Display.connect_it() callable mutlpiple times (all but first are ignored) so that we can handle some activation races better.
I got these in the wrong order, the second one should be applied first.
Review of attachment 224928 [details] [review]: The approach looks good to me, however 'connected' should be reset to false in disconnect_it() 'mutlpiple' typo in the commit log. ::: src/vnc-display.vala @@ +117,3 @@ + if (connected) + return; + connected = true; This can throw so resetting 'connected' to false when there's a exception would look better. However, if it's reset to 'false' in disconnect_it(), this might not be strictly required.
Review of attachment 224927 [details] [review]: Looks good, would be nice to get this in for 3.6 if we get freeze break approval.
> The approach looks good to me, however 'connected' should be reset to false in > disconnect_it() The whole point is that we should not really re-use a single Display instance for connecting multiple times, but rather create a new one, because of issues with reusing the Display, *including* once its disconnected. (i.e. in comment 19 above we call disconnect and then connect on the same display and it breaks shit).
(In reply to comment #31) > > The approach looks good to me, however 'connected' should be reset to false in > > disconnect_it() > > The whole point is that we should not really re-use a single Display instance > for connecting multiple times, but rather create a new one, because of issues > with reusing the Display, *including* once its disconnected. (i.e. in comment > 19 above we call disconnect and then connect on the same display and it breaks > shit). It seems we are missing error handling in some cases: LibvirtMachine::connect_display() sets display to non-null in update_display(), and then calls display.connect_it() which can fail in VncDisplay. In this case we'd end up with a non-null Display, but not being connected. I didn't see anything handling errors from connect_display()
App.select_item calls connect_display and handles errors, which is the main way this runs. However, it is correct that we fail to report errors in some cases, like when entering a password or when reconnecting. The reconnecting part will no longer race with the initial connect_display with the patch, but it may fail to report a connection error in the stay_on_display case. Neither of these report any errors currently though, so this is not really a regression.
Created attachment 225067 [details] [review] Ensure that you can only connect a Display once We currently handle reconnecting by creating a new Display object anyway. But we make Display.connect_it() callable multiple times (all but first are ignored) so that we can handle some activation races better.
Created attachment 225068 [details] [review] Fix multiple connections on startup This is a better fix for bug 683042 (where we connect twice when clicking on a box). It fixes the race where we connect twice as well as fix the regression where password authentication was broken. We now handle the race by making connect_display() safe to run multiple times by: * Making update_display() only create a display if needed. This is safe as we always clear the display property before reconnecting. * The Display.connect_it() method was changed to only connect once. Additionally we make reconnect_display() only reconnect if there is already a display. This means that if we hit reconnect_display() before the initial connect_display() has started connecting we don't connect. This ensures connect_display() does the actual connection, so that it can properly report any errors.
New patches in right order with typos fixed.
Review of attachment 225067 [details] [review]: Needs break request
Review of attachment 225068 [details] [review]: .
Attachment 225067 [details] pushed as 36589a3 - Ensure that you can only connect a Display once Attachment 225068 [details] pushed as 0d5f162 - Fix multiple connections on startup