GNOME Bugzilla – Bug 752388
gdm doesn't always restart cleanly
Last modified: 2015-07-14 21:02:52 UTC
Sometimes when gdm is stopped, it's not done so in an orderly manner. The main daemon process dies before all its children have been reaped. This leads to a race that makes restart sometimes fail.
Created attachment 307431 [details] [review] Make gdm-session-worker exit cleanly Calling gdm_session_stop_conversation() in gdm_launch_environment_stop() sends a SIGTERM to gdm-session-worker without waiting for it to die. The next step is calling gdm_session_close() to close the session, which stops all conversations of that session object, sending a 2nd SIGTERM to gdm-session-worker, this time waiting on its PID. On gdm-session-worker side, the first SIGTERM is caught by on_shutdown_signal(), its custom SIGTERM handler, which quits the mainloop and unrefs the worker object. Quiting the mainloop replaces the custom SIGTERM handler with the system default one (exit immediately). During the worker object class finalization gdm-session-worker may receive the 2nd SIGTERM, which leads to its immediate termination, without waiting for its children, which in turn leads to the main gdm process exit. Since systemd relies on the SIGCHLD from the main gdm process to tell when the service has stopped, this behavior breaks any unit that has a Conflicts=gdm.service entry and relies on the X server not being around when it is started. This commit removes the call to gdm_session_stop_conversation() in gdm_launch_environment_stop() and leaves it to be stopped in gdm_session_close(). [endlessm/eos-shell#4921]
Created attachment 307432 [details] [review] gdm-x-session: Do not cancel cancelable on SIGTERM <dsd> Jasper: can you explain the logic bug? just curious <Jasper> dsd, GCancellable is a way of cancelling an ongoing operation. <Jasper> dsd, so we start up a wait_async, which says "call this callback when the subprocess exits". And then when we get a SIGTERM, we fire off the cancellable, saying "no no, don't actually wait for to finish, let's just cancel that now" <Jasper> dsd, the callback is called, but wait_finish returns FALSE, meaning that the wait was cancelled. And then the callback says "oh, cool, we're done here" and clears the x_session object. <Jasper> dsd, I'd have to inspect closer to determine a proper fix should we still cancel the wait but not actually clear the object, so we wait synchronously in cleanup?) <Jasper> Depends on what else uses the cancellable. <dsd> ahh <dsd> got it <Jasper> dsd, the other thing that should fix it would be moving the g_clear_object above the "out" label here: https://git.gnome.org/browse/gdm/tree/daemon/gdm-x-session.c?h=gnome-3-16#n185 <Jasper> And doing that for all of session / bus / x subprocesses. <Jasper> Maybe that's the cleaner fix. <Jasper> Then we could also remove the main_loop_quit from the sigterm handler as well and simply relying on the cancellable. [endlessm/eos-shell#4921]
Review of attachment 307431 [details] [review]: yea seems right, thanks for the analysis and fix
Review of attachment 307432 [details] [review]: I agree with Jasper that the second fix is better, since in theory the cancellable can be used for other things. I don't think it's okay to drop the main_loop_quit since the handlers that call g_main_loop_quit may not be hooked up yet at the time SIGTERM is called
Created attachment 307435 [details] [review] gdm-x-session: kill subprocesses on sigterm <dsd> Jasper: can you explain the logic bug? just curious <Jasper> dsd, GCancellable is a way of cancelling an ongoing operation. <Jasper> dsd, so we start up a wait_async, which says "call this callback when the subprocess exits". And then when we get a SIGTERM, we fire off the cancellable, saying "no no, don't actually wait for to finish, let's just cancel that now" <Jasper> dsd, the callback is called, but wait_finish returns FALSE, meaning that the wait was cancelled. And then the callback says "oh, cool, we're done here" and clears the x_session object. <Jasper> dsd, I'd have to inspect closer to determine a proper fix should we still cancel the wait but not actually clear the object, so we wait synchronously in cleanup?) <Jasper> Depends on what else uses the cancellable. <dsd> ahh <dsd> got it <Jasper> dsd, the other thing that should fix it would be moving the g_clear_object above the "out" label here: https://git.gnome.org/browse/gdm/tree/daemon/gdm-x-session.c?h=gnome-3-16#n185 <Jasper> And doing that for all of session / bus / x subprocesses. <Jasper> Maybe that's the cleaner fix. Based on a patch by João Paulo Rechi Vita <jprvita@endlessm.com> [endlessm/eos-shell#4921]
Attachment 307431 [details] pushed as 1871802 - Make gdm-session-worker exit cleanly Attachment 307435 [details] pushed as 90a095e - gdm-x-session: kill subprocesses on sigterm
Created attachment 307436 [details] [review] gdm-wayland-session: kill subprocesses on sigterm This is like commit 90a095e1a3ef26391da835ebe155deae26eeac5f but for wayland sessions. (probably should deduplicate that code at some point, but that's a project for another day)
Attachment 307436 [details] pushed as 8f21c90 - gdm-wayland-session: kill subprocesses on sigterm