Bug 789086 – Xwayland leaves no core dump when crashing

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 789086 - Xwayland leaves no core dump when crashing


Summary:	Xwayland leaves no core dump when crashing


Status:	RESOLVED FIXED

Product:	mutter
Classification:	Core
Component:	wayland
Version:	3.26.x
Hardware:	Other Linux

Importance:	Normal major
Target Milestone:	---
Assigned To:	mutter-maint
QA Contact:	mutter-maint

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2017-10-17 09:40 UTC by Daniel van Vugt
Modified:	2017-11-17 13:49 UTC

See Also:	https://launchpad.net/bugs/1724185 790502
GNOME target:	---
GNOME version:	3.25/3.26

Attachments
0001-Allow-Xwayland-to-leave-core-dumps.patch (1.20 KB, patch) 2017-10-17 09:54 UTC, Daniel van Vugt	none	Details \| Review
Allow-Xwayland-to-leave-core-dumps-on-real-crashes-v2.patch (4.17 KB, patch) 2017-10-17 11:50 UTC, Daniel van Vugt	none	Details \| Review
Allow-Xwayland-to-leave-core-dumps-on-real-crashes-v3.patch (6.33 KB, patch) 2017-10-18 10:07 UTC, Daniel van Vugt	reviewed	Details \| Review
Allow-Xwayland-to-leave-core-dumps-v4.patch (4.71 KB, patch) 2017-10-24 09:12 UTC, Daniel van Vugt	none	Details \| Review
Allow-Xwayland-to-leave-core-dumps-v5.patch (4.79 KB, patch) 2017-10-25 07:47 UTC, Daniel van Vugt	committed	Details \| Review
wayland: Allow Xwayland to leave core dumps (4.79 KB, patch) 2017-10-27 00:55 UTC, Marco Trevisan (Treviño)	committed	Details \| Review

Description Daniel van Vugt 2017-10-17 09:40:03 UTC

Xwayland (spawned and managed by mutter) leaves no core dump when crashing

This is a big deal because it seems to be the most common trigger for gnome-shell dying:

https://launchpad.net/bugs/1505409
but see also bug 759538.

So we need to find out exactly why Xwayland is dying so much...

Comment 1 Daniel van Vugt 2017-10-17 09:54:46 UTC

Created attachment 361729 [details] [review]
0001-Allow-Xwayland-to-leave-core-dumps.patch

A fix for mutter. Tested on Ubuntu 17.10.

Comment 2 Jonas Ådahl 2017-10-17 10:04:51 UTC

Review of attachment 361729 [details] [review]:

Have you checked that this does not cause core dumps when exiting gnome-shell? IIRC Xwayland will see that as the socket suddenly stops working.

Comment 3 Daniel van Vugt 2017-10-17 10:07:49 UTC

Ugh. You appear to be right. So we need to fix Xwayland before we can enable it dumping core? :)

Comment 4 Daniel van Vugt 2017-10-17 10:08:25 UTC

Or fix mutter to kill Xwayland cleanly.

Comment 5 Jonas Ådahl 2017-10-17 10:10:09 UTC

Terminating Xwayland on tear-down seems like a good idea to me.

Comment 6 Daniel van Vugt 2017-10-17 10:13:10 UTC

/me works on patch v2

Comment 7 Daniel van Vugt 2017-10-17 11:50:18 UTC

Created attachment 361736 [details] [review]
Allow-Xwayland-to-leave-core-dumps-on-real-crashes-v2.patch

This one avoids spurious Xwayland core dumps on regular clean shutdown... Most of the time. It still happens occasionally, which I think is good because those are probably some of the crashes we need to analyse and fix.

Comment 8 Jonas Ådahl 2017-10-17 13:48:28 UTC

FWIW, the other way around (Xwayland crashes, mutter crashes with it) is tracked in bug 782660.

Comment 9 Jonas Ådahl 2017-10-17 14:09:25 UTC

Review of attachment 361736 [details] [review]:

::: src/wayland/meta-xwayland.c
@@ +554,1 @@
                                       NULL);

style nit: fix the indentation

@@ +559,3 @@
     }
 
+  manager->crash_handler = g_cancellable_new ();

nit: We usually call "GCancellable"s something with "cancellable", to make it clear what it is. Maybe "xwayland_wait_cancellable" or something.

@@ +597,3 @@
 
+static
+gboolean on_xwayland_hung_shutdown (gpointer user_data)

Coding style:

static gboolean
on_xwayland_hung_shutdown (gpointer user_data)

Though the meaning of the name is a bit unclear; maybe "on_xwayland_shutdown_timeout" or something, to make it clear this is a timeout on shutdown?

@@ +599,3 @@
+gboolean on_xwayland_hung_shutdown (gpointer user_data)
+{
+  GCancellable *clean_exit = G_CANCELLABLE (user_data);

style nit: empty line after declarations.

@@ +608,3 @@
 {
   char path[256];
+  GCancellable *clean_exit;

Can just call this "cancellable", or "shutdown_cancellable", since that is what it is.

@@ +609,3 @@
   char path[256];
+  GCancellable *clean_exit;
+  GError *error;

You must initialize this to NULL here.

@@ +614,3 @@
   meta_xwayland_shutdown_selection ();
 
+  g_cancellable_cancel (manager->crash_handler);

Shouldn't you also handle this case xserver_died(). It should be communicated via the GAsyncResult, and handle the case where g_subprocess_wait_finish() returns a "cancelled" error.

@@ +618,3 @@
+  /* Ideally try to let Xwayland terminate cleanly. It's never been allowed
+   * to do that till now!
+   */

This is what you are doing, but it'd be more informative as of *why*. I guess something about terminating the X server before it would abort due to the Wayland socket being closed to avoid false errors.

@@ +623,3 @@
+  g_subprocess_send_signal (manager->proc, SIGTERM);
+  if (!g_subprocess_wait (manager->proc, clean_exit, &error))
+    {

Wouldn't hurt g_warning() with the error here I think. You also need to g_clear_error() after.

@@ +625,3 @@
+    {
+      g_subprocess_force_exit (manager->proc);
+      g_subprocess_wait (manager->proc, clean_exit, &error);

You need to g_clear_error() here too, or use g_error_free().

Comment 10 Daniel van Vugt 2017-10-18 01:21:14 UTC

Yeah, serves me right for trying to start a new bug fix at the end of the day and trying to rush it through.

Also, I woke up thinking maybe the timeout is a mistake. Is there any event loop to even run it? (in another thread or dispatched inside g_subprocess_wait?) If it won't get dispatched then drop the timeout and the attempt to use SIGTERM, methinks. Just go straight for the SIGKILL (which actually is still better than how it dies right now - just crashing and hiding the core).

Comment 11 Jonas Ådahl 2017-10-18 05:21:20 UTC

(In reply to Daniel van Vugt from comment #10)
> Yeah, serves me right for trying to start a new bug fix at the end of the
> day and trying to rush it through.
> 
> Also, I woke up thinking maybe the timeout is a mistake. Is there any event
> loop to even run it? (in another thread or dispatched inside
> g_subprocess_wait?) If it won't get dispatched then drop the timeout and the
> attempt to use SIGTERM, methinks. Just go straight for the SIGKILL (which
> actually is still better than how it dies right now - just crashing and
> hiding the core).

SIGKILL? That sounds a bit rough. Why don't you want Xwayland to have a chance to gracefully shut down?

FWIW, the synchronous wait I'm assuming will work as documented. Only the asynchronous one requires you to deal with GMainLoop.

Comment 12 Daniel van Vugt 2017-10-18 05:28:07 UTC

My rationale for not minding SIGKILL is the assumption that Xwayland probably/shouldn't have anything to save to disk. I don't see a good case for letting it shut down "gracefully" with SIGTERM when that's possible to be ignored/hanging/crashing. All the difficulty and headache here is handling those cases where SIGTERM could be ignored intentionally or unintentionally.

But I don't mind. I've written code for both and frankly it's an unimportant part of the fix. Anything is better than the status quo.

Comment 13 Jonas Ådahl 2017-10-18 05:35:47 UTC

(In reply to Daniel van Vugt from comment #12)
> My rationale for not minding SIGKILL is the assumption that Xwayland
> probably/shouldn't have anything to save to disk. I don't see a good case
> for letting it shut down "gracefully" with SIGTERM when that's possible to
> be ignored/hanging/crashing. All the difficulty and headache here is
> handling those cases where SIGTERM could be ignored intentionally or
> unintentionally.
> 
> But I don't mind. I've written code for both and frankly it's an unimportant
> part of the fix. Anything is better than the status quo.

The thing that comes to mind is being able to disconnect clients the right way. I don't know how it's done today, but using SIGTERM I imagine would allow X to disconnect them gracefully, instead of the clients thinking the server crashed.

Comment 14 Daniel van Vugt 2017-10-18 10:07:33 UTC

Created attachment 361792 [details] [review]
Allow-Xwayland-to-leave-core-dumps-on-real-crashes-v3.patch

Patch version 3. Now simpler and seemingly more reliable.

Comment 15 Marco Trevisan (Treviño) 2017-10-20 20:22:13 UTC

Review of attachment 361792 [details] [review]:

FYI, link [1] isn't publicly accessible, so I wouldn't mention that.

I think there's something similar on fedora that is accessible and linkable.

Comment 16 Jonas Ådahl 2017-10-21 02:08:50 UTC

Review of attachment 361792 [details] [review]:

No point in linking to a moving target (the errors.ubuntu.com link) in the commit message. In a year it'll hopefully serve no purpose at all anymore since all crashes will be fixed by then ;P

::: src/wayland/meta-xwayland.c
@@ +558,3 @@
                                       "-listen", "5",
                                       "-displayfd", "6",
                                       NULL);

style nit: You forgot to update the indentation here. FWIW, I also think it'd be clearer if we put each flag on its own line, so can fix that while at it.

@@ +606,3 @@
+{
+  /* Prevent ourselves from crashing when Xwayland exits. */
+  g_cancellable_cancel (manager->xserver_died_cancellable);

You don't handle the fact that the callback was cancelled in the callback handler. You need to finish the async call, check if it was cancelled, then handle that differently.

@@ +608,3 @@
+  g_cancellable_cancel (manager->xserver_died_cancellable);
+
+  meta_xwayland_shutdown_selection ();

Whats the reason you are not doing this in meta_xwayland_stop() now? Was shutting down the selection handler broken before this?

Comment 17 Daniel van Vugt 2017-10-23 09:18:17 UTC

* Yeah, I will provide a more static bug link in future.

* No, I didn't forget to update indentation. I was just confused as to what you meant by "style nit: fix the indentation" on something that's perfectly indented and I didn't change... Now I think I understand you're asking me to fix a pre-existing issue, and one that's not visible in your pasted explanation :)

* "You need to finish the async call": I'm not sure about that. "finish" doesn't clean up anything, it only seems to return a GError. But now I'm typing these words I wonder... are you assuming the finish is required to get the GError in order to ensure it's unref'd later? If so, it's not documented, but that would make sense...
https://developer.gnome.org/gio/stable/GSubprocess.html#g-subprocess-wait-async

* Moving meta_xwayland_shutdown_selection out of meta_xwayland_stop was required at the time because I found that:
  1. meta_screen_free in mutter caused the Xwayland process to crash.
  2. mutter would catch the Xwayland crash, and so itself also crash (in a racy parallel-process way).
  3. The mutter shutdown logic after meta_screen_free was (often) never executed.
However that might have been a characteristic of intermediate versions of my branch, so I'll test reverting that part. Something that was in the back of my mind already...

Comment 18 Jonas Ådahl 2017-10-23 09:29:50 UTC

(In reply to Daniel van Vugt from comment #17)
> * Yeah, I will provide a more static bug link in future.
> 
> * No, I didn't forget to update indentation. I was just confused as to what
> you meant by "style nit: fix the indentation" on something that's perfectly
> indented and I didn't change... Now I think I understand you're asking me to
> fix a pre-existing issue, and one that's not visible in your pasted
> explanation :)

The change you introduced made the indentation of existing lines incorrect. What I asked was to fix the indentation of lines that were now wrongly indented. There was no prexisting issue as far as I can see.

In other words, it looked like this

foo = bar(asdf,
          aoeu);

You changed it to

foo_baar = bar(asdf,
          aoeu);

while it should have been changed to

foo_baar = bar(asdf,
               aoeu);

> 
> * "You need to finish the async call": I'm not sure about that. "finish"
> doesn't clean up anything, it only seems to return a GError. But now I'm
> typing these words I wonder... are you assuming the finish is required to
> get the GError in order to ensure it's unref'd later? If so, it's not
> documented, but that would make sense...
> https://developer.gnome.org/gio/stable/GSubprocess.html#g-subprocess-wait-
> async

What I meant was that 'g_cancellable_cancel()' doesn't mean xserver_died() wont be called, but that when you "finish" it with g_subprocess_wait_async_finish() you can detect that you were cancelled by checking the returned GError code. If I read things correctly, we might now instead get called, and we'd check whether Xwayland exited cleanly, then notice it did, then call

      g_error ("Spurious exit of X Wayland server");

Which we don't want to.

Now, the async callback I supposed would only be called if we entered a GMainLoop again, which I suspect we won't but it is wrong none the less to depend on that.

Comment 19 Daniel van Vugt 2017-10-23 09:56:09 UTC

Oh, I see. You mean:

+  manager->proc = g_subprocess_launcher_spawn (launcher, &error,
                                       XWAYLAND_PATH, manager->display_name,

I didn't see it. But also that's because I was both keeping the diff small, and  because reintroducing the alignment would take it beyond 80 columns (which on my screen is a greater sin). But yeah I see now...

And yes, I had the same thought about g_error too. A proper cleanup is pointless if you assume g_error is there. It is there, but we shouldn't assume it is. Unfortunately fixing this means introducing nontrivial untested code (never reached due to g_error), which I would prefer not to. Still, easier to do as you ask than to argue further.

Comment 20 Jonas Ådahl 2017-10-23 12:35:28 UTC

(In reply to Daniel van Vugt from comment #19)
> Oh, I see. You mean:
> 
> +  manager->proc = g_subprocess_launcher_spawn (launcher, &error,
>                                        XWAYLAND_PATH, manager->display_name,
> 
> I didn't see it. But also that's because I was both keeping the diff small,
> and  because reintroducing the alignment would take it beyond 80 columns
> (which on my screen is a greater sin). But yeah I see now...

Going beyond is a great sin on my screen too :P but there are other ways within the coding style scheme that can mitigate that, such as a line break after the '=', or using temporary variables that replaces lines that "stick out".

> 
> And yes, I had the same thought about g_error too. A proper cleanup is
> pointless if you assume g_error is there. It is there, but we shouldn't
> assume it is. Unfortunately fixing this means introducing nontrivial
> untested code (never reached due to g_error), which I would prefer not to.
> Still, easier to do as you ask than to argue further.

It will become tested eventually, because we will at one point support stopping Xwayland 'mid-session', and then later restarting it. For that we must properly deal with the cancelled callback actually being invoked with the cancelled-error. If you don't care at all about dealing with those things, then having any cancellable there at all is pointless.

Comment 21 Daniel van Vugt 2017-10-24 01:45:08 UTC

It's no problem. I just haven't had time to finish the new patch while we deal with the flood of bug mail from last week's 17.10 release. Should calm down within a few days...

Comment 22 Daniel van Vugt 2017-10-24 08:52:28 UTC

Oh I see... you need to call finish from inside the callback. That's more complicated than any other async API I've seen. Also confusing when the existing code already did it wrong, plus the relevant documentation was not in or mentioned in the GSubprocess docs. But got there in the end, thanks.

Comment 23 Daniel van Vugt 2017-10-24 09:12:26 UTC

Created attachment 362163 [details] [review]
Allow-Xwayland-to-leave-core-dumps-v4.patch

Patch version 4.

Comment 24 Marco Trevisan (Treviño) 2017-10-24 09:28:52 UTC

Review of attachment 362163 [details] [review]:

::: src/wayland/meta-xwayland.c
@@ +397,3 @@
+
+  if (!g_subprocess_wait_finish (proc, result, &error) && error)
+    cancelled = g_error_matches (error, G_IO_ERROR, G_IO_ERROR_CANCELLED);

No need to log the error otherwise?

Indeed there will be an "X Wayland crashed" error after, but maybe it could be informative to use this error here...

PS: no need to check also for error validity, check it or the return value, they're guaranteed to be in sync.

Comment 25 Daniel van Vugt 2017-10-25 06:32:35 UTC

No, the logging already exists. Although we don't log what the error was in detail, I feel that would be asking for an enhancement unrelated to what we're really trying to do here.

Also, I would generally prefer to test error!=NULL before using it to avoid getting blamed for crashes. Although hiding such crashes could also be problematic. I'm on the fence.

Comment 26 Jonas Ådahl 2017-10-25 06:47:18 UTC

(In reply to Daniel van Vugt from comment #25)
> No, the logging already exists. Although we don't log what the error was in
> detail, I feel that would be asking for an enhancement unrelated to what
> we're really trying to do here.

This is about rearranging how we deal with Xwayland tear down so I'd say proper changes to logging is in scope. The most consistent way to handle it would be something like:

  if (!g_subprocess_wait_finish (proc, result, &error))
    {
      if (!g_error_matches (error, G_IO_ERROR, G_IO_ERROR_CANCELLED))
        {
          g_error ("Failed to finish waiting for Xwayland: %s",
                   error->message);
        }
      g_error_free (error);
      return;
    }

  ... regular path ...

> 
> Also, I would generally prefer to test error!=NULL before using it to avoid
> getting blamed for crashes. Although hiding such crashes could also be
> problematic. I'm on the fence.

A 'gboolean foo(..., GError **)' returning FALSE but not setting the error is invalid and should not ever be handled by the caller but be fixed in the callee.

Comment 27 Daniel van Vugt 2017-10-25 07:47:22 UTC

Created attachment 362238 [details] [review]
Allow-Xwayland-to-leave-core-dumps-v5.patch

Patch version 5.

Comment 28 Marco Trevisan (Treviño) 2017-10-25 13:29:32 UTC

Review of attachment 362238 [details] [review]:

I like this way better, thanks... I think we can go.

Comment 29 Marco Trevisan (Treviño) 2017-10-25 13:29:59 UTC

Review of attachment 362238 [details] [review]:

I like this way better, thanks... I think we can go.

Comment 30 Jonas Ådahl 2017-10-26 12:50:51 UTC

(In reply to Marco Trevisan (Treviño) from comment #29)
> Review of attachment 362238 [details] [review] [review]:
> 
> I like this way better, thanks... I think we can go.

+1

Comment 31 Marco Trevisan (Treviño) 2017-10-27 00:55:05 UTC

The following fix has been pushed:
054c25f wayland: Allow Xwayland to leave core dumps

Comment 32 Marco Trevisan (Treviño) 2017-10-27 00:55:16 UTC

Created attachment 362380 [details] [review]
wayland: Allow Xwayland to leave core dumps

For historical Xorg-reasons, Xwayland would disable its own core dumps by
default. This is a problem because Xwayland crashing is the biggest cause of
gnome-shell crashes [1][2], and we still have no idea why due to there being
no dumps from Xwayland. So enable core dumping from Xwayland.

https://bugzilla.gnome.org/show_bug.cgi?id=789086

[1] https://bugs.launchpad.net/bugs/1505409
[2] https://bugs.launchpad.net/bugs/1556601