GNOME Bugzilla – Bug 672419
shell crashed at login and ended up in a dead end session
Last modified: 2021-06-14 18:22:58 UTC
Created attachment 210131 [details] screenshot I am testing a jhbuild session and after logging in I see nothing but an empty background. It is entirely possible that I've misconfigured something but even so we should *never* end up in a state like this where there is no way out.
Created attachment 210132 [details] xsession errors trimmed of noisy warnings
I've faced this too, today. The problem in my case was a 3.3.92 shell + gcr without introspection data. So, I guess to reproduce, just remove the gcr gir
gnome-session[23923]: DEBUG(+): GsmXSMPClient: getting restart style gnome-session[23923]: DEBUG(+): GsmManager: autorestart not set, not restarting application Is gnome-shell telling us to not autorestart?
Likely caused by http://git.gnome.org/browse/gnome-shell/commit/?id=9bb9999b46cc2c759d4e0a5c5f7515e32eafc0f0 which was an attempt to fix bug 648384.
That a component is set to autorestart or not should certainly not affect whether we show a fail whale ?!
(In reply to comment #5) > That a component is set to autorestart or not should certainly not affect > whether we show a fail whale ?! Because if we don't need autorestart, then it means we don't care if the component goes away (also, fail whale is triggered if the app crashes twice in a minute, which can't happen without autorestart).
(In reply to comment #6) > (In reply to comment #5) > > That a component is set to autorestart or not should certainly not affect > > whether we show a fail whale ?! > > Because if we don't need autorestart, then it means we don't care if the > component goes away (also, fail whale is triggered if the app crashes twice in > a minute, which can't happen without autorestart). My expectation would be that we show the fail whale right away for a required component that is not set to autorestart, and show it after the second crash for one that is set to autorestart. But then we are back in the alt-f2-r-fail-whales territory where this thing started...
I just noticed this: gnome-session[23923]: DEBUG(+): GsmAutostartApp: (pid:24094) done (status:1) This actually means that gnome-shell exited properly (as in WIFEXITED), and didn't crash; that's why it's not restarted. A real crash would result in automatic autorestart of the required component.
an exit status of 1 is effectively the same as a crash for most X apps, so probably should be treated the same as a crash.
(In reply to comment #9) > an exit status of 1 is effectively the same as a crash for most X apps, so > probably should be treated the same as a crash. My point is not about the exit status. It's that if this is a crash, WIFEXITED() should return false, and WIFSIGNALED() should return true. Why would this be different for most X apps?
because when X crashes the app because of BadDrawable (or whatever) it does exit(1) instead of raise(SIGABRT). Same with libdbus, when it crashes the app it does exit(1) too.
Ah, didn't know that. I'm a bit reluctant to still assume "exit(1) == crash", though, as it's perfectly valid exit status otherwise. Also, most crashes I've experienced during the last 10 years or so are app-specific, and not related to BadDrawable (or similar X errors), so I wouldn't think it's that much of an issue here. If people care really strongly about this, though, we can try this out -- but I don't think it's wise to change this just before a stable release :-)
Created attachment 210446 [details] [review] gsm: Properly move to next phase if an app dies on startup There is no reason to wait for the timeout if an app dies and fails to be restarted. Also, only do this if we're in a startup phase.
Created attachment 210447 [details] [review] gsm: Share code to restart an app
Created attachment 210448 [details] [review] gsm: Stop disconnecting "registered" signal for GsmApp The reason we were doing this is that the code to move to the next phase when an app is registered was not checking for the current phase. This is done now.
Created attachment 210449 [details] [review] gsm: Pass exit code in "exited" signal of GsmApp
Created attachment 210450 [details] [review] gsm: Consider that a required component that exits with 1 has crashed This way, we will attempt to restart it.
Created attachment 210451 [details] [review] gsm: Remove duplicated code
Created attachment 210452 [details] [review] gsm: Pass signal id in "died" signal of GsmApp
Created attachment 210453 [details] [review] gsm: On an app crash, only depend on autorestart for apps with a client If an app has no registered client, the autorestart behavior cannot work (since it occurs when the client gets disconnected). So if there's no registered client, just proceed with a manual restart.
This patch series is an attempt to fix this; it will only consider that exit(1) = crash for required components. It needs some testing, though -- I've barely played with it.
Comment on attachment 210453 [details] [review] gsm: On an app crash, only depend on autorestart for apps with a client This part needs to be clever: there might be a client already being started but not registered yet.
So, any opinion on pushing this for 3.4.0 (at least the patches up to attachment 210450 [details] [review])? Again, I'm a bit reluctant to push this at this point because I'd prefer to have some real testing over a longer period of time. FWIW, another thing we could do to reduce the risks would be to consider that exit(1)=crash only during the startup phases.
seems tight for 3.4.0
So I've released 3.4.0 without this to be on the safe side, and then pushed the code reorg in the patch series that shouldn't affect the behavior of gnome-session. We're left with attachment 210450 [details] [review] (comment 17). I'll likely push this to 3.5.x, but I'm not sure this will help get enough testing to help deciding if we want this in 3.4.1.
*** Bug 674840 has been marked as a duplicate of this bug. ***
Comment on attachment 210450 [details] [review] gsm: Consider that a required component that exits with 1 has crashed Attachment 210450 [details] pushed as c6e23f8 - gsm: Consider that a required component that exits with 1 has crashed
*** Bug 645928 has been marked as a duplicate of this bug. ***
Today I'm getting this: gnome-session[12056]: DEBUG(+): GsmAutostartApp: (pid:12238) done (status:127) gnome-session[12056]: DEBUG(+): App gnome-shell.desktop exited with 127 And ending up in a dead end.
And do you know why gnome-shell exits with 127? I don't think it's reasonable to consider that all exit codes != 0 mean a crash in general, but maybe we can do that for the shell?
A mismatch between mutter and the shell. But the fact remains that we never want the shell to fail and not show the fail whale.
127 means "command not found" to the shell, so we're probably running the command through a shell and the program wasn't installed where it thought it was. I don't think it's "wrong" to consider anything but 0 as a failure fwiw. certainly the test command and the if command etc treat them all as false.
"the shell" in comment 32 meant e.g. bash, yay for namespace clashes
Created attachment 218964 [details] [review] manager: treat non-0 exit status for required components as fail The only exit status that truely, definitely means 'success' is 0. Anything else is almostly certainly a failure of some sort. For required components, we can be extra sure that's true, so enforce it there. This avoids cases where exec() fails in a subshell, and other cases.
Comment on attachment 218964 [details] [review] manager: treat non-0 exit status for required components as fail (pushed as e79b73a3)
All patches committed. Any specific reasons / undone work to not close this ticket?
I believe there's till one patch pending that I need to look at and finish up.
I'm just wondering if this is what I'm seeing on a Fedora 18 session. I get GDM coming up and it looks fine, then the desktop background I chose appears after logging in... then after about 30 seconds of no activity I get an "Oops something went wrong screen" telling me to logout. Interestingly it shows a cut out on the back oops screen where the gnome-shell top bar should appear but it instead shows the wallpaper through. I was in the middle of delivering a Linux training session which makes this look doubly bad - I can't figure out what's wrong at the moment and have taken to launching nautilus and metacity --replace from a separate tty and then using ALT+SPACE and selecting 'Close' to remove the Oops screen in order to get at a nautilus window. I'm happy to post logs or try things as I really need to be able to use this for work.
Created attachment 238789 [details] var log messages I didn't receive any a .xsession-errors file but I did see a lot about gnome-screensaver and shell stuff in my /var/log/messages so I've attached that in hope of resolving.
might be related to: https://bugzilla.gnome.org/show_bug.cgi?id=727817
Version 3.12.2 when logging in with a user one gnoem wayland it saves the session for next user this time got a blank screen after reboot on autologin different user still blank screen, had to kill xserver via tty
Removing the GNOME 3.4 target. Is the last patch still neded?
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version of gnome-session, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/gnome-session/-/issues/ Thank you for your understanding and your help.