Bug 790502 – Xwayland abort() when mutter/gnome-shell crashes

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 790502 - Xwayland abort() when mutter/gnome-shell crashes


Summary:	Xwayland abort() when mutter/gnome-shell crashes


Status:	RESOLVED NOTGNOME

Product:	mutter
Classification:	Core
Component:	wayland
Version:	3.26.x
Hardware:	Other Linux

Importance:	Normal major
Target Milestone:	---
Assigned To:	mutter-maint
QA Contact:	mutter-maint

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2017-11-17 13:49 UTC by Olivier Fourdan
Modified:	2017-11-30 16:04 UTC

See Also:	789086
GNOME target:	---
GNOME version:	---

Description Olivier Fourdan 2017-11-17 13:49:41 UTC

Description:

Since GNOME 3.26 I noticed an increase in bug reports downstream against Xwayland aborting.

On further investigation, is shows that many of these reports against Xwayland turned out to be caused originally by a crash of the Wayland compositor, i.e. gnome-shell/mutter.

Previously, when gnome-shell/mutter crashed, Xwayland would simply log an error and exit. Now it aborts() and save a core file, and users report that as a crash in Xwayland, while this is not.

Turns out to be cause by commit 054c25f6 from bug 789086 which added "-core" to the Xwayland command line, meaning that now, when the compositor (gnome-shell) crashes and Xwayland calls FatalError(), it will core dump...

Comment 1 Olivier Fourdan 2017-11-17 14:17:06 UTC

I think there is some confusion in bug 789086, "-core" in the Xserver is to make it generate a corefile in case of fatal error, not to be confused with a crash.

  https://cgit.freedesktop.org/xorg/xserver/tree/os/utils.c?h=server-1.19-branch#n744

and:

  https://cgit.freedesktop.org/xorg/xserver/tree/os/log.c?h=server-1.19-branch#n865

When Xwayland crashes for real (i.e. with a segfault or other signals, not when the compositor is dead and Xwayland calls FatalError()), it does generate a core file as expected (this is not the same as a DIX which would need "NoTrapSignal" for this):

  https://cgit.freedesktop.org/xorg/xserver/tree/hw/xfree86/common/xf86Init.c?h=server-1.19-branch#n300

But this is Xorg code, not Xwayland. Xwayland doesn't do any https://cgit.freedesktop.org/xorg/xserver/tree/os/log.c?h=server-1.19-branch#n865(xf86SigWrapper).

Besides, in my experience, Xwayland is not the main source of crashes of gnome-shell on Wayland, it's rather the opposite, gnome-shell crashes and Xwayland calls FatalError() because the Wayland file descriptor is dead.

Can we consider reverting commit 054c25f6, it doesn't do what you think it does and falsely blame Xwayland for gnome-shell crashes.

Comment 2 Jonas Ådahl 2017-11-20 08:35:11 UTC

(In reply to Olivier Fourdan from comment #1)
> I think there is some confusion in bug 789086, "-core" in the Xserver is to
> make it generate a corefile in case of fatal error, not to be confused with
> a crash.
> 
>  
> https://cgit.freedesktop.org/xorg/xserver/tree/os/utils.c?h=server-1.19-
> branch#n744
> 
> and:
> 
>  
> https://cgit.freedesktop.org/xorg/xserver/tree/os/log.c?h=server-1.19-
> branch#n865
> 
> When Xwayland crashes for real (i.e. with a segfault or other signals, not
> when the compositor is dead and Xwayland calls FatalError()), it does
> generate a core file as expected (this is not the same as a DIX which would
> need "NoTrapSignal" for this):

Are you sure about this? I have an awkward diff (that comments out SIGSEGV and other crashy signals in the catch-signal code, because otherwise I end up only seeing the real crash go by "unnoticed" in the journal. And that is for real Xwayland crashes, not because gnome-shell crashed.

Comment 3 Olivier Fourdan 2017-11-20 09:11:38 UTC

(In reply to Jonas Ådahl from comment #2)
> Are you sure about this? I have an awkward diff (that comments out SIGSEGV

No, you're right.

> and other crashy signals in the catch-signal code, because otherwise I end
> up only seeing the real crash go by "unnoticed" in the journal. And that is
> for real Xwayland crashes, not because gnome-shell crashed.

I think we'd better partially revert commit 054c25f6 (ie just remove "-core" from the command line, the rest of the patch remains interesting) because we do not want a core fie for every fatal error (which is what "-core" is about) and change Xwayland to do just like Xquartz, i.e. keep the various signals such as SIGSEGV to remain as default.

Comment 4 Jonas Ådahl 2017-11-20 09:13:39 UTC

Is there no way to make reaching EOF on the wayland socket not a fatal error, just exit cleanly?

Comment 5 Olivier Fourdan 2017-11-20 09:50:40 UTC

We could do that of course, but there are possibly more FatalError() out there.

Ideally, we would exit with a FatalError() printing a backtrace but no core dump, but generate a core dump when we hit one of the SIGSEGV, SIGABRT, SIGILL, etc.

Comment 6 Olivier Fourdan 2017-11-20 10:15:03 UTC

Basically, something like this in Xwayland: https://patchwork.freedesktop.org/patch/189275/

Comment 7 Jonas Ådahl 2017-11-20 10:30:20 UTC

That is better that what we had before indeed. But I still don't fully grasp why we don't want core dumps on other FatalError() (except for wayland socket EOF). Don't we want to see those in coredumpctl/retrace as well?

Comment 8 Olivier Fourdan 2017-11-20 13:23:05 UTC

(In reply to Jonas Ådahl from comment #7)
> That is better that what we had before indeed. But I still don't fully grasp
> why we don't want core dumps on other FatalError() (except for wayland
> socket EOF). Don't we want to see those in coredumpctl/retrace as well?

I don't think so, looking at the Xserver source code, FatalError() seems to be used mostly to terminate the server when something “external“ to the Xserver prevents it from running, like OOM errors, or even wrong command line options (see xwl_screen_init() for example).

Comment 9 Jonas Ådahl 2017-11-21 03:52:26 UTC

(In reply to Olivier Fourdan from comment #8)
> (In reply to Jonas Ådahl from comment #7)
> > That is better that what we had before indeed. But I still don't fully grasp
> > why we don't want core dumps on other FatalError() (except for wayland
> > socket EOF). Don't we want to see those in coredumpctl/retrace as well?
> 
> I don't think so, looking at the Xserver source code, FatalError() seems to
> be used mostly to terminate the server when something “external“ to the
> Xserver prevents it from running, like OOM errors, or even wrong command
> line options (see xwl_screen_init() for example).

We won't ever see regular OOM errors, but all the others seems to be relevant to catch, like shader issues, glamour EGL issues, and other assert like fatals.

Comment 10 Olivier Fourdan 2017-11-21 08:12:52 UTC

(In reply to Jonas Ådahl from comment #9)
> We won't ever see regular OOM errors, but all the others seems to be
> relevant to catch, like shader issues, glamour EGL issues, and other assert
> like fatals.

In that case we can still keep "-core" but we'll need to amend Xwayland to not call FatalError() on wayland socket errors.

I shall also check with xorg-devel the actual intended use of FatalError() in the Xserver, it seems to be used for different things.

Comment 11 Jonas Ådahl 2017-11-21 08:24:12 UTC

(In reply to Olivier Fourdan from comment #10)
> (In reply to Jonas Ådahl from comment #9)
> > We won't ever see regular OOM errors, but all the others seems to be
> > relevant to catch, like shader issues, glamour EGL issues, and other assert
> > like fatals.
> 
> In that case we can still keep "-core" but we'll need to amend Xwayland to
> not call FatalError() on wayland socket errors.

Yea. Maybe something like:

                CloseWellKnownConnections();
                OsCleanup(1);
                exit(1);

?

That is what Xephyr seems to do in ephyrXcbProcessEvents() when it hits an error from the host X server.

Comment 12 Olivier Fourdan 2017-11-30 16:04:42 UTC

Closing as "notgnome" since patches have been submitted to the Xserver/Xwayland code (https://patchwork.freedesktop.org/series/34091/)