GNOME Bugzilla – Bug 790502
Xwayland abort() when mutter/gnome-shell crashes
Last modified: 2017-11-30 16:04:42 UTC
Description: Since GNOME 3.26 I noticed an increase in bug reports downstream against Xwayland aborting. On further investigation, is shows that many of these reports against Xwayland turned out to be caused originally by a crash of the Wayland compositor, i.e. gnome-shell/mutter. Previously, when gnome-shell/mutter crashed, Xwayland would simply log an error and exit. Now it aborts() and save a core file, and users report that as a crash in Xwayland, while this is not. Turns out to be cause by commit 054c25f6 from bug 789086 which added "-core" to the Xwayland command line, meaning that now, when the compositor (gnome-shell) crashes and Xwayland calls FatalError(), it will core dump...
I think there is some confusion in bug 789086, "-core" in the Xserver is to make it generate a corefile in case of fatal error, not to be confused with a crash. https://cgit.freedesktop.org/xorg/xserver/tree/os/utils.c?h=server-1.19-branch#n744 and: https://cgit.freedesktop.org/xorg/xserver/tree/os/log.c?h=server-1.19-branch#n865 When Xwayland crashes for real (i.e. with a segfault or other signals, not when the compositor is dead and Xwayland calls FatalError()), it does generate a core file as expected (this is not the same as a DIX which would need "NoTrapSignal" for this): https://cgit.freedesktop.org/xorg/xserver/tree/hw/xfree86/common/xf86Init.c?h=server-1.19-branch#n300 But this is Xorg code, not Xwayland. Xwayland doesn't do any https://cgit.freedesktop.org/xorg/xserver/tree/os/log.c?h=server-1.19-branch#n865(xf86SigWrapper). Besides, in my experience, Xwayland is not the main source of crashes of gnome-shell on Wayland, it's rather the opposite, gnome-shell crashes and Xwayland calls FatalError() because the Wayland file descriptor is dead. Can we consider reverting commit 054c25f6, it doesn't do what you think it does and falsely blame Xwayland for gnome-shell crashes.
(In reply to Olivier Fourdan from comment #1) > I think there is some confusion in bug 789086, "-core" in the Xserver is to > make it generate a corefile in case of fatal error, not to be confused with > a crash. > > > https://cgit.freedesktop.org/xorg/xserver/tree/os/utils.c?h=server-1.19- > branch#n744 > > and: > > > https://cgit.freedesktop.org/xorg/xserver/tree/os/log.c?h=server-1.19- > branch#n865 > > When Xwayland crashes for real (i.e. with a segfault or other signals, not > when the compositor is dead and Xwayland calls FatalError()), it does > generate a core file as expected (this is not the same as a DIX which would > need "NoTrapSignal" for this): Are you sure about this? I have an awkward diff (that comments out SIGSEGV and other crashy signals in the catch-signal code, because otherwise I end up only seeing the real crash go by "unnoticed" in the journal. And that is for real Xwayland crashes, not because gnome-shell crashed.
(In reply to Jonas Ådahl from comment #2) > Are you sure about this? I have an awkward diff (that comments out SIGSEGV No, you're right. > and other crashy signals in the catch-signal code, because otherwise I end > up only seeing the real crash go by "unnoticed" in the journal. And that is > for real Xwayland crashes, not because gnome-shell crashed. I think we'd better partially revert commit 054c25f6 (ie just remove "-core" from the command line, the rest of the patch remains interesting) because we do not want a core fie for every fatal error (which is what "-core" is about) and change Xwayland to do just like Xquartz, i.e. keep the various signals such as SIGSEGV to remain as default.
Is there no way to make reaching EOF on the wayland socket not a fatal error, just exit cleanly?
We could do that of course, but there are possibly more FatalError() out there. Ideally, we would exit with a FatalError() printing a backtrace but no core dump, but generate a core dump when we hit one of the SIGSEGV, SIGABRT, SIGILL, etc.
Basically, something like this in Xwayland: https://patchwork.freedesktop.org/patch/189275/
That is better that what we had before indeed. But I still don't fully grasp why we don't want core dumps on other FatalError() (except for wayland socket EOF). Don't we want to see those in coredumpctl/retrace as well?
(In reply to Jonas Ådahl from comment #7) > That is better that what we had before indeed. But I still don't fully grasp > why we don't want core dumps on other FatalError() (except for wayland > socket EOF). Don't we want to see those in coredumpctl/retrace as well? I don't think so, looking at the Xserver source code, FatalError() seems to be used mostly to terminate the server when something “external“ to the Xserver prevents it from running, like OOM errors, or even wrong command line options (see xwl_screen_init() for example).
(In reply to Olivier Fourdan from comment #8) > (In reply to Jonas Ådahl from comment #7) > > That is better that what we had before indeed. But I still don't fully grasp > > why we don't want core dumps on other FatalError() (except for wayland > > socket EOF). Don't we want to see those in coredumpctl/retrace as well? > > I don't think so, looking at the Xserver source code, FatalError() seems to > be used mostly to terminate the server when something “external“ to the > Xserver prevents it from running, like OOM errors, or even wrong command > line options (see xwl_screen_init() for example). We won't ever see regular OOM errors, but all the others seems to be relevant to catch, like shader issues, glamour EGL issues, and other assert like fatals.
(In reply to Jonas Ådahl from comment #9) > We won't ever see regular OOM errors, but all the others seems to be > relevant to catch, like shader issues, glamour EGL issues, and other assert > like fatals. In that case we can still keep "-core" but we'll need to amend Xwayland to not call FatalError() on wayland socket errors. I shall also check with xorg-devel the actual intended use of FatalError() in the Xserver, it seems to be used for different things.
(In reply to Olivier Fourdan from comment #10) > (In reply to Jonas Ådahl from comment #9) > > We won't ever see regular OOM errors, but all the others seems to be > > relevant to catch, like shader issues, glamour EGL issues, and other assert > > like fatals. > > In that case we can still keep "-core" but we'll need to amend Xwayland to > not call FatalError() on wayland socket errors. Yea. Maybe something like: CloseWellKnownConnections(); OsCleanup(1); exit(1); ? That is what Xephyr seems to do in ephyrXcbProcessEvents() when it hits an error from the host X server.
Closing as "notgnome" since patches have been submitted to the Xserver/Xwayland code (https://patchwork.freedesktop.org/series/34091/)