GNOME Bugzilla – Bug 789086
Xwayland leaves no core dump when crashing
Last modified: 2017-11-17 13:49:58 UTC
Xwayland (spawned and managed by mutter) leaves no core dump when crashing This is a big deal because it seems to be the most common trigger for gnome-shell dying: https://launchpad.net/bugs/1505409 but see also bug 759538. So we need to find out exactly why Xwayland is dying so much...
Created attachment 361729 [details] [review] 0001-Allow-Xwayland-to-leave-core-dumps.patch A fix for mutter. Tested on Ubuntu 17.10.
Review of attachment 361729 [details] [review]: Have you checked that this does not cause core dumps when exiting gnome-shell? IIRC Xwayland will see that as the socket suddenly stops working.
Ugh. You appear to be right. So we need to fix Xwayland before we can enable it dumping core? :)
Or fix mutter to kill Xwayland cleanly.
Terminating Xwayland on tear-down seems like a good idea to me.
/me works on patch v2
Created attachment 361736 [details] [review] Allow-Xwayland-to-leave-core-dumps-on-real-crashes-v2.patch This one avoids spurious Xwayland core dumps on regular clean shutdown... Most of the time. It still happens occasionally, which I think is good because those are probably some of the crashes we need to analyse and fix.
FWIW, the other way around (Xwayland crashes, mutter crashes with it) is tracked in bug 782660.
Review of attachment 361736 [details] [review]: ::: src/wayland/meta-xwayland.c @@ +554,1 @@ NULL); style nit: fix the indentation @@ +559,3 @@ } + manager->crash_handler = g_cancellable_new (); nit: We usually call "GCancellable"s something with "cancellable", to make it clear what it is. Maybe "xwayland_wait_cancellable" or something. @@ +597,3 @@ +static +gboolean on_xwayland_hung_shutdown (gpointer user_data) Coding style: static gboolean on_xwayland_hung_shutdown (gpointer user_data) Though the meaning of the name is a bit unclear; maybe "on_xwayland_shutdown_timeout" or something, to make it clear this is a timeout on shutdown? @@ +599,3 @@ +gboolean on_xwayland_hung_shutdown (gpointer user_data) +{ + GCancellable *clean_exit = G_CANCELLABLE (user_data); style nit: empty line after declarations. @@ +608,3 @@ { char path[256]; + GCancellable *clean_exit; Can just call this "cancellable", or "shutdown_cancellable", since that is what it is. @@ +609,3 @@ char path[256]; + GCancellable *clean_exit; + GError *error; You must initialize this to NULL here. @@ +614,3 @@ meta_xwayland_shutdown_selection (); + g_cancellable_cancel (manager->crash_handler); Shouldn't you also handle this case xserver_died(). It should be communicated via the GAsyncResult, and handle the case where g_subprocess_wait_finish() returns a "cancelled" error. @@ +618,3 @@ + /* Ideally try to let Xwayland terminate cleanly. It's never been allowed + * to do that till now! + */ This is what you are doing, but it'd be more informative as of *why*. I guess something about terminating the X server before it would abort due to the Wayland socket being closed to avoid false errors. @@ +623,3 @@ + g_subprocess_send_signal (manager->proc, SIGTERM); + if (!g_subprocess_wait (manager->proc, clean_exit, &error)) + { Wouldn't hurt g_warning() with the error here I think. You also need to g_clear_error() after. @@ +625,3 @@ + { + g_subprocess_force_exit (manager->proc); + g_subprocess_wait (manager->proc, clean_exit, &error); You need to g_clear_error() here too, or use g_error_free().
Yeah, serves me right for trying to start a new bug fix at the end of the day and trying to rush it through. Also, I woke up thinking maybe the timeout is a mistake. Is there any event loop to even run it? (in another thread or dispatched inside g_subprocess_wait?) If it won't get dispatched then drop the timeout and the attempt to use SIGTERM, methinks. Just go straight for the SIGKILL (which actually is still better than how it dies right now - just crashing and hiding the core).
(In reply to Daniel van Vugt from comment #10) > Yeah, serves me right for trying to start a new bug fix at the end of the > day and trying to rush it through. > > Also, I woke up thinking maybe the timeout is a mistake. Is there any event > loop to even run it? (in another thread or dispatched inside > g_subprocess_wait?) If it won't get dispatched then drop the timeout and the > attempt to use SIGTERM, methinks. Just go straight for the SIGKILL (which > actually is still better than how it dies right now - just crashing and > hiding the core). SIGKILL? That sounds a bit rough. Why don't you want Xwayland to have a chance to gracefully shut down? FWIW, the synchronous wait I'm assuming will work as documented. Only the asynchronous one requires you to deal with GMainLoop.
My rationale for not minding SIGKILL is the assumption that Xwayland probably/shouldn't have anything to save to disk. I don't see a good case for letting it shut down "gracefully" with SIGTERM when that's possible to be ignored/hanging/crashing. All the difficulty and headache here is handling those cases where SIGTERM could be ignored intentionally or unintentionally. But I don't mind. I've written code for both and frankly it's an unimportant part of the fix. Anything is better than the status quo.
(In reply to Daniel van Vugt from comment #12) > My rationale for not minding SIGKILL is the assumption that Xwayland > probably/shouldn't have anything to save to disk. I don't see a good case > for letting it shut down "gracefully" with SIGTERM when that's possible to > be ignored/hanging/crashing. All the difficulty and headache here is > handling those cases where SIGTERM could be ignored intentionally or > unintentionally. > > But I don't mind. I've written code for both and frankly it's an unimportant > part of the fix. Anything is better than the status quo. The thing that comes to mind is being able to disconnect clients the right way. I don't know how it's done today, but using SIGTERM I imagine would allow X to disconnect them gracefully, instead of the clients thinking the server crashed.
Created attachment 361792 [details] [review] Allow-Xwayland-to-leave-core-dumps-on-real-crashes-v3.patch Patch version 3. Now simpler and seemingly more reliable.
Review of attachment 361792 [details] [review]: FYI, link [1] isn't publicly accessible, so I wouldn't mention that. I think there's something similar on fedora that is accessible and linkable.
Review of attachment 361792 [details] [review]: No point in linking to a moving target (the errors.ubuntu.com link) in the commit message. In a year it'll hopefully serve no purpose at all anymore since all crashes will be fixed by then ;P ::: src/wayland/meta-xwayland.c @@ +558,3 @@ "-listen", "5", "-displayfd", "6", NULL); style nit: You forgot to update the indentation here. FWIW, I also think it'd be clearer if we put each flag on its own line, so can fix that while at it. @@ +606,3 @@ +{ + /* Prevent ourselves from crashing when Xwayland exits. */ + g_cancellable_cancel (manager->xserver_died_cancellable); You don't handle the fact that the callback was cancelled in the callback handler. You need to finish the async call, check if it was cancelled, then handle that differently. @@ +608,3 @@ + g_cancellable_cancel (manager->xserver_died_cancellable); + + meta_xwayland_shutdown_selection (); Whats the reason you are not doing this in meta_xwayland_stop() now? Was shutting down the selection handler broken before this?
* Yeah, I will provide a more static bug link in future. * No, I didn't forget to update indentation. I was just confused as to what you meant by "style nit: fix the indentation" on something that's perfectly indented and I didn't change... Now I think I understand you're asking me to fix a pre-existing issue, and one that's not visible in your pasted explanation :) * "You need to finish the async call": I'm not sure about that. "finish" doesn't clean up anything, it only seems to return a GError. But now I'm typing these words I wonder... are you assuming the finish is required to get the GError in order to ensure it's unref'd later? If so, it's not documented, but that would make sense... https://developer.gnome.org/gio/stable/GSubprocess.html#g-subprocess-wait-async * Moving meta_xwayland_shutdown_selection out of meta_xwayland_stop was required at the time because I found that: 1. meta_screen_free in mutter caused the Xwayland process to crash. 2. mutter would catch the Xwayland crash, and so itself also crash (in a racy parallel-process way). 3. The mutter shutdown logic after meta_screen_free was (often) never executed. However that might have been a characteristic of intermediate versions of my branch, so I'll test reverting that part. Something that was in the back of my mind already...
(In reply to Daniel van Vugt from comment #17) > * Yeah, I will provide a more static bug link in future. > > * No, I didn't forget to update indentation. I was just confused as to what > you meant by "style nit: fix the indentation" on something that's perfectly > indented and I didn't change... Now I think I understand you're asking me to > fix a pre-existing issue, and one that's not visible in your pasted > explanation :) The change you introduced made the indentation of existing lines incorrect. What I asked was to fix the indentation of lines that were now wrongly indented. There was no prexisting issue as far as I can see. In other words, it looked like this foo = bar(asdf, aoeu); You changed it to foo_baar = bar(asdf, aoeu); while it should have been changed to foo_baar = bar(asdf, aoeu); > > * "You need to finish the async call": I'm not sure about that. "finish" > doesn't clean up anything, it only seems to return a GError. But now I'm > typing these words I wonder... are you assuming the finish is required to > get the GError in order to ensure it's unref'd later? If so, it's not > documented, but that would make sense... > https://developer.gnome.org/gio/stable/GSubprocess.html#g-subprocess-wait- > async What I meant was that 'g_cancellable_cancel()' doesn't mean xserver_died() wont be called, but that when you "finish" it with g_subprocess_wait_async_finish() you can detect that you were cancelled by checking the returned GError code. If I read things correctly, we might now instead get called, and we'd check whether Xwayland exited cleanly, then notice it did, then call g_error ("Spurious exit of X Wayland server"); Which we don't want to. Now, the async callback I supposed would only be called if we entered a GMainLoop again, which I suspect we won't but it is wrong none the less to depend on that.
Oh, I see. You mean: + manager->proc = g_subprocess_launcher_spawn (launcher, &error, XWAYLAND_PATH, manager->display_name, I didn't see it. But also that's because I was both keeping the diff small, and because reintroducing the alignment would take it beyond 80 columns (which on my screen is a greater sin). But yeah I see now... And yes, I had the same thought about g_error too. A proper cleanup is pointless if you assume g_error is there. It is there, but we shouldn't assume it is. Unfortunately fixing this means introducing nontrivial untested code (never reached due to g_error), which I would prefer not to. Still, easier to do as you ask than to argue further.
(In reply to Daniel van Vugt from comment #19) > Oh, I see. You mean: > > + manager->proc = g_subprocess_launcher_spawn (launcher, &error, > XWAYLAND_PATH, manager->display_name, > > I didn't see it. But also that's because I was both keeping the diff small, > and because reintroducing the alignment would take it beyond 80 columns > (which on my screen is a greater sin). But yeah I see now... Going beyond is a great sin on my screen too :P but there are other ways within the coding style scheme that can mitigate that, such as a line break after the '=', or using temporary variables that replaces lines that "stick out". > > And yes, I had the same thought about g_error too. A proper cleanup is > pointless if you assume g_error is there. It is there, but we shouldn't > assume it is. Unfortunately fixing this means introducing nontrivial > untested code (never reached due to g_error), which I would prefer not to. > Still, easier to do as you ask than to argue further. It will become tested eventually, because we will at one point support stopping Xwayland 'mid-session', and then later restarting it. For that we must properly deal with the cancelled callback actually being invoked with the cancelled-error. If you don't care at all about dealing with those things, then having any cancellable there at all is pointless.
It's no problem. I just haven't had time to finish the new patch while we deal with the flood of bug mail from last week's 17.10 release. Should calm down within a few days...
Oh I see... you need to call finish from inside the callback. That's more complicated than any other async API I've seen. Also confusing when the existing code already did it wrong, plus the relevant documentation was not in or mentioned in the GSubprocess docs. But got there in the end, thanks.
Created attachment 362163 [details] [review] Allow-Xwayland-to-leave-core-dumps-v4.patch Patch version 4.
Review of attachment 362163 [details] [review]: ::: src/wayland/meta-xwayland.c @@ +397,3 @@ + + if (!g_subprocess_wait_finish (proc, result, &error) && error) + cancelled = g_error_matches (error, G_IO_ERROR, G_IO_ERROR_CANCELLED); No need to log the error otherwise? Indeed there will be an "X Wayland crashed" error after, but maybe it could be informative to use this error here... PS: no need to check also for error validity, check it or the return value, they're guaranteed to be in sync.
No, the logging already exists. Although we don't log what the error was in detail, I feel that would be asking for an enhancement unrelated to what we're really trying to do here. Also, I would generally prefer to test error!=NULL before using it to avoid getting blamed for crashes. Although hiding such crashes could also be problematic. I'm on the fence.
(In reply to Daniel van Vugt from comment #25) > No, the logging already exists. Although we don't log what the error was in > detail, I feel that would be asking for an enhancement unrelated to what > we're really trying to do here. This is about rearranging how we deal with Xwayland tear down so I'd say proper changes to logging is in scope. The most consistent way to handle it would be something like: if (!g_subprocess_wait_finish (proc, result, &error)) { if (!g_error_matches (error, G_IO_ERROR, G_IO_ERROR_CANCELLED)) { g_error ("Failed to finish waiting for Xwayland: %s", error->message); } g_error_free (error); return; } ... regular path ... > > Also, I would generally prefer to test error!=NULL before using it to avoid > getting blamed for crashes. Although hiding such crashes could also be > problematic. I'm on the fence. A 'gboolean foo(..., GError **)' returning FALSE but not setting the error is invalid and should not ever be handled by the caller but be fixed in the callee.
Created attachment 362238 [details] [review] Allow-Xwayland-to-leave-core-dumps-v5.patch Patch version 5.
Review of attachment 362238 [details] [review]: I like this way better, thanks... I think we can go.
(In reply to Marco Trevisan (Treviño) from comment #29) > Review of attachment 362238 [details] [review] [review]: > > I like this way better, thanks... I think we can go. +1
The following fix has been pushed: 054c25f wayland: Allow Xwayland to leave core dumps
Created attachment 362380 [details] [review] wayland: Allow Xwayland to leave core dumps For historical Xorg-reasons, Xwayland would disable its own core dumps by default. This is a problem because Xwayland crashing is the biggest cause of gnome-shell crashes [1][2], and we still have no idea why due to there being no dumps from Xwayland. So enable core dumping from Xwayland. https://bugzilla.gnome.org/show_bug.cgi?id=789086 [1] https://bugs.launchpad.net/bugs/1505409 [2] https://bugs.launchpad.net/bugs/1556601