GNOME Bugzilla – Bug 419301
GDM cannot be restarted in Ubuntu by pressing ctrl-alt-backspace
Last modified: 2008-04-11 19:28:57 UTC
Please describe the problem: Symptom: Pressing ctrl-alt-backspace once logged in the ubuntu desktop doesn't restart gdm Debugging: stopping gdm in a terminal window through sudo /etc/init.d/gdm stop or directly with start-stop-daemon do not stop gdm. Work-around solution: open a terminal -> login -> restart with sudo /etc/init.d/gdm restart or sudo /etc/init.d/gdm stop; sudo /etc/init.d/gdm start Steps to reproduce: 1. Press ctrl-alt-backspace in an ubuntu session (sometimes also in the login screen) 2. OR enter sudo /etc/init.d/gdm restart in a terminal window 3. Actual results: The system exit the graphic mode and goes back to displaying the login messages and stops (no crash, no oops, just sits there). Expected results: GDM should restart Does this happen every time? Yes (sometime also in the login screen). Other information: Before anybody asks, ctrl-alt-backspace is enabled ;)
I don't know it this helps, but the following processes are still reported as running after the attempt to stop gdm: 6372 00:00:00 gdm 6376 00:00:08 gdm 6434 00:00:00 gconfd-2 6460 00:00:00 bonobo-activati 6505 00:00:00 evolution-data- All other gnome processes (including Xorg and x-session-manager) are successfully killed.
I suspect this is not a GDM bug since ctrl-alt-backspace is managed by the Xserver itself, not by GDM. Hitting ctrl-alt-backspace will kill the Xserver which causes GDM to restart as a side effect. You might need this section in your xorg.conf file: Section "ServerFlags" Option "HandleSpecialKeys" "Always" EndSection And I think there is an X option where you can specify that this feature is on or not, so you might check the Xserver command GDM is using and verify that it isn't being disabled. Perhaps Ubuntu turned off this feature by default because it can be abused, and can add security risk. Or perhaps your default xorg.conf file just is missing the needed configuration. Since the xorg.conf file may be generated on the fly when you first install the system, it might not be set as expected for certain graphics cards. Please confirm if this is your problem. I will close the bug if I don't hear back after a while. If you think this is truly a GDM bug, let me know. Perhaps I misunderstand what you are saying?
Hi Brian, I really appreciate your superfast reply! I knew this was bound to be asked :) This is why I added the following in my report: <quote> Other information: Before anybody asks, ctrl-alt-backspace is enabled ;) <endquote> I'm not ruling out a problem with the X-server, but this doesn't explain why gdm cannot be stopped, even by start-stop-daemon and why the restart works in a console!?
BTW: >ctrl-alt-backspace will kill the Xserver >which causes GDM to restart as a side effect. The Xserver is indeed killed (see my comment #1) but gdm do not restart. This might be a clue?
Yes, this does help me understand what problem you are seeing. It sounds like GDM is somehow not recognizing that the server has died, and doesn't know to remanage the display. I'd really need to take a look at your GDM debug output to get a feel for why this might be happening. Turn on Enable=true in the [debug] section of your GDM configuration file, and share with me the debug messages that get sent to your system log (normally /var/log/messages or /var/adm/messages). These are the messages that say "gdm" in them (since other programs also send messages to the system log - I don't need to see those, or the whole file).
Note - you should send me the last few dozen messages immediately after you cause the problem to happen. There probably will be some messages about GDM noticing the server died and failing to restart or something?
Well, I get gdm related messages in daemon.log and syslog; they fill up quite fast if I'm not quick to kill gdm. Immediately after attempting the restart, I have these messages: Mar 18 00:08:06 desktop gdm[5066]: gdm_slave_session_stop: cesare on :0 Mar 18 00:08:06 desktop gdm[5066]: Fatal X error detected. Ignoring same during session shut down. Mar 18 00:08:06 desktop gdm[5066]: gdm_slave_session_stop: back here from xioerror The last two are repeated ad-libitum until either I kill GDM or my HD is filled..... Do we have an endless loop bug!?
This is really weird. I haven't heard anybody else complain of this sort of problem. It looks like the messages you are seeing are coming from gdm_slave_ignore_xioerror_handler, and from gdm_slave_session_stop. The gdm_slave_ignore_xioerror_handler function is registered with the Xserver with XSetIOErrorHandler. Right after the 2nd message, the code should try to call gdm_server_whack_clients, which unfortunately doesn't print any debug messages. I wonder if during the whack process if something is going wrong. Would you be able to add some gdm_debug messages to this function to see if it is failing here? If it isn't failing here, perhaps adding some debug messages into gdm_slave_session_stop after the gdm_server_whack_clients call and see how far the code is getting before it fails? It seems to not be getting far enough to call the PostSession script, so it probably isn't getting too far. Perhaps something weird with the Xserver is going on where it is sending multiple xioerror signals and causing the code to fall into a loop, so it might be useful to see if there is anything funny going on with your Xserver.
Created attachment 84876 [details] Modified gdm_server_whack_client
Dear Brian, I have followed your suggestions. In the attachment you will see where I inserted the debug messages. If I alt-ctrl-backspace from the login screen everything is OK, the 14 children of screen 0 are successfully killed and gdm restarts. If I do it from the gdm session, I get the following (as usual filling up rapidly the log file): Mar 19 13:31:31 desktop gdm[6883]: gdm_slave_session_stop: cesare on :0 Mar 19 13:31:31 desktop gdm[6883]: gdm_server_whack_clients: Entering Mar 19 13:31:31 desktop gdm[6883]: gdm_server_whack_clients: Processing screen 0 Mar 19 13:31:31 desktop gdm[6883]: Fatal X error detected. Ignoring same during session shut down. Mar 19 13:31:31 desktop gdm[6883]: gdm_slave_session_stop: back here from xioerror Mar 19 13:31:31 desktop gdm[6883]: gdm_server_whack_clients: Entering Mar 19 13:31:31 desktop gdm[6883]: Fatal X error detected. Ignoring same during session shut down. Mar 19 13:31:31 desktop gdm[6883]: gdm_slave_session_stop: back here from xioerror The last 3 get repeated forever. >Perhaps something weird with the Xserver is going on where it is sending >multiple xioerror signals and causing the code to fall into a loop, so it might >be useful to see if there is anything funny going on with your Xserver. You could be nailing it there . Any suggestion on the best way to do it? Many thanks for your support on this issue. I've seen several people in ubuntu suffer from it.
>Perhaps something weird with the Xserver is going on where it is sending >multiple xioerror signals and causing the code to fall into a loop Just to have a confirmation of this, I modified gdm_slave_ignore_xioerror_handler as follows: gdm_slave_ignore_xioerror_handler (Display *disp) { static int count = 0; gdm_debug ("Fatal X error detected. Ignoring same during session shut down."); Longjmp (ignore_xioerror_jmp, ++count); } and gdm_slave_session_stop to display it. Indeed, I get a series of: Mar 19 16:08:00 desktop gdm[5208]: gdm_slave_session_stop: back here from xioerror 43751 That's 43751 in 7 sec :) I really like the comment: /* xioerror will cause this to drop back into whack_clients, but I think that is okay because I haven't seen it do so more than once */
As a final check, I tried to disable the error handler (after the first error is trapped) by adding the following line in the default branch of the jump set point in gdm_slave_session_stop: XSetIOErrorHandler (ignore_xerror_handler); GDM restarts without any problem.
I'm glad to hear that you have found a workaround that gets things working for you, but I think we need more discussion before we decide if you idea of just removing XSetErrorHandler is the right way to fix this. Shouldn't gdm_server_whack_clients kill all running Xclients (note whack_clients is called before XSetErrorHandler), so isn't it weird to get an XIOError if the clients are all dead? Maybe your clients are not exiting or something, indicating a bug in the whack_clients code, or perhaps we should call XSetErrorHandler after the sched_yield? Does moving the XSetErrorHandler down in the logic make a difference? Perhaps we should set it after calling PostSession or something? If moving this call farther down makes a difference, this might help us track down exactly what is causing the problem. Perhaps the PostSession script is calling some X command which fails (sort of assuming that X is in a reasonable state when it isn't). Note that we call XSetHandler (ignore_xerror_handler) 49 lines farther down, so perhaps we shouldn't be caring about xioerrors here? It would be nice if we had a bit better understanding of why the Xserver is sending so many XIOErrors, and what code is triggering them. As I say, if you could experiement by moving the call around and see if it makes a change in behavior, that would be useful.
>I'm glad to hear that you have found a workaround that gets things working for >you, but I think we need more discussion before we decide if you idea of just >removing XSetErrorHandler is the right way to fix this. Indeed, that was only meant to check if our hypotheses were correct. I did some more investigations: 1) By just deleting the gdm_server_whack_clients call alt-ctrl-bs works flawlessy. Final confirmation that the multiple Xio errors are caused there. I must say, I actually don't understand the need for the gdm_server_whack_clients call. Aren't they going to be killed anyway by the kernel? In truth, I have no idea what the pam referred to by Bug 126071 is :) 2) With additional debug messages, I can pinpoint that the the first time gdm_server_whack_clients is entered, the Xio error happens during the XQueryTree call. After error trapping, and every consecutive time, the Xio error happens during the XGrabServer call. I thought that was normal, as we didn't ungrab the server, but, if the grabbing is disabled, the Xio error happens again during the XQueryTree calls. 3) I checked that the dsp structure which is passed to the call is correct, and this is the case. 4) If the trapping is disabled as follows (note that the error handler gdm_slave_ignore_xioerror_handler is modified with just a debug message and a return(0)): XSetIOErrorHandler (gdm_slave_ignore_xioerror_handler); gdm_server_whack_clients (d->dsp); XSetIOErrorHandler (gdm_slave_xioerror_handler); then it seems that alt-ctrl-bs is working, but by looking at the log file it seems that the Xio error (again apparently caused by the XQueryTree call) is trapped by the other instance of gdm -> just after the gdm_slave_ignore_xioerror_handler message I see a message of mainloop_sig_callback: signal 17 from the other gdm instance.
Thoughts: 1) Regarding PAM. The bug mentioned is probably a problem with the PAM program assuming that X needs to be done before pam_close is called. As George pointed out this could be fixed in the PAM module. However, it is nice that GDM tries to clean up your Xclients so that the Xclients die at a known point in time. 2) I am a bit confused about why we are seeing the problem. Note that GDM sets the XIOErrorHandler to gdm_slave_ignore_xioerror_handler so that xioerror should be ignored until after gdm_server_whack_clients is finished. We don't reset the error handler until *after* the gdm_server_whack_clients call. So during this time xioerrors should be ignored. Perhaps the Xserver isn't killing the clients when we call XKillClient immediately, but after some time, causing the XIOErrors to be recieved after we reset the XIOhandler function? Or perhaps on your system the mechanism used to ignore the handlers isn't working? I don't see this problem on Solaris, so this might be the case. 3) Sounds good 4) But the code already does this. Note the case statement above. The first time a longjump is called it should have ignore_xioerror_jmp equal to 0 so it sets the IOhandler to the gdm_slave_ignore_xioerror_handler. Then this function calls longjmp with 1 causing so it should ignore future requests until after the whack_clients call. Any ideas why this isn't working on your platform? Perhaps something weird like the Xclients not dying right away might cause this problem because the Xclient may generate such an error after GDM has finished calling gdm_server_whack_clients and reset the IOErrorHandler function. Or maybe the LONGJMP stuff doesn't work right on your platform? Maybe you can verify that it falls into case 0 the first time and case "default" on subsequent times? I'm not really sure I understand what you mean by "the other instance of gdm". There is the main GDM daemon that runs as root and the GDM slave daemon that manages the display. The code in daemon/slave.c should only be used by the slave daemon. In fact only the slave daemon interacts with the Xserver at all. The message "mainloop_sig_callback" indicates that you are talking about the main GDM daemon that runs as root. I guess the slave daemon propegates the signal to the main daemon? Perhaps the slave should consume the signal since it isn't really necessary for the main daemon to waste cycles thinking about this. You'll note that the main daemon only pays attention to SIGCHLD, SIGINT, SIGTERM, SIGXFSZ, SIGXCPU, SIGHUP, and SIGUSR so if you send it other signals, it just ignores them.
By looking at the code, I think that (just my personal speculations here): 1) The call to gdm_server_whack_clients was inserted to close 126071. This is the HACK mentioned in the source's comments. 2) Further debugging showed that this wasn't completely working, as Xio errors were happening inside it, which caused the handler to trip. 3) To fix this (see HACK FIX in the source's comment) the error handler was modified so that, after the first IO error, the program was falling back and into gdm_server_whack_clients again. This works, as long as there is no more than 1 Xio error (again, see comments). 4) Unfortunately, in my (and other's people too within the ubuntu distribution, dunno about other distros) case there are instances where we see multiple Xio errors, hence the endless loop. Why we see the multiple errors still eludes me. >I am a bit confused about why we are seeing the problem. Note that >GDM sets the XIOErrorHandler to gdm_slave_ignore_xioerror_handler >so that xioerror should be ignored until after gdm_server_whack_clients >is finished. We don't reset the error handler until *after* the >gdm_server_whack_clients call. So during this time xioerrors should >be ignored. Actually not, this would be the behaviour with my hacking. In the original code gdm_slave_ignore_xioerror_handler causes a fallback just outside the whack code. >Or perhaps on your system the mechanism used to ignore the handlers >isn't working? I don't see this problem on Solaris, so this might be the case. Why is it working then if I do the restart in a console!? >Maybe you can verify that it falls into case 0 the first time >and case "default" on subsequent times? This is indeed the case (ops, sorry for the pun). I noticed that 3 potential gdm children are not killed (even by a successfull gdm restart) : gconfd-2, bonobo-activati and evolution-data-. Could they have anything to do with the Xio errors? Even if I manually kill these processes before the restart, the restart still fails. >I guess the slave daemon propegates the signal to the main daemon? Perhaps the >slave should consume the signal since it isn't really necessary for the main >daemon to waste cycles thinking about this. You'll note that the main daemon >only pays attention to SIGCHLD, SIGINT, SIGTERM, SIGXFSZ, SIGXCPU, SIGHUP, and >SIGUSR so if you send it other signals, it just ignores them. This make me think that the Xio error which causes all this has to relate to one of these. Do they all lead to the same behaviour (mainloop_sig_callback: signal 17)?
Actually, is it true that we see multiple Xio errors, or it is always the same (lost connection with the Xserver)? If whacking the clients makes sense, why not doing it before sending SIGTERM to kill the server!?
I think the reason some programs don't get killed (e.g. gconf) is that these are not X programs, but just daemons that don't connect to the Xserver, so you probably can't kill them with XKillClient. I think in the normal case, GDM does call the function to whack clients before the Xserver dies. Perhaps the problem is that when you use control-alt-backspace, the Xserver dies in a way controlled by GDM. This, I think, is why it works from a console, but not via control-alt-backspace. After thinking about all this, I guess I don't understand the need to check for xioerrors at all here. Why do we care about them if we're just killing the clients before we shutdown? They will be shutdown anyway, even if we don't whack the clients. I'd say we could do one of a few things: 1) Just get rid of this whack_clients code. If this causes bug 126071 to reappear, then it probably should be fixed a better way. Perhaps broken PAM modules should be fixed rather than trying to fix this in GDM. 2) Fix the code to not call XKillClient, but instead kill all child processes which would require navigating the process tree and killing them in a sane order (leaf nodes inwards) by sending them a HUP signal or something reasonable. This would also fix bug #152907. 3) Fix the code so that it ignores xioerrors completely after the whack clients call. This isn't as good as solution #2, I think, but is less work and keeps the functionality mostly the same, warts and all. Would also be nice to make the slave consume the Xioerror event so the daemon doesn't get bothered by it. What do you think. I think I'm agreeable to any of these solutions. Note related bug #402360 - perhaps we should try to fix this issue at the same time? Does our discussion so far help us understand what might be going on there?
I'm quite amenable to any of the 3 and I'm more than willing to do any testing/debugging for any options you choose. I have a sympathy for 1) but I do appreciate that with 2) you get two birds with a stone ....... 3) could be tricky actually; I think that the best way to do it is with something similar to my first hack (well, not using ignore_xerror_handler but a new empty ignore_xioerror_handler). What I did in 4) comment #14 is not correctly working (for instance post-session scripts are not called) because if the error is not trapped in slave.c is then propagated to the daemon. >After thinking about all this, I guess I don't understand the need to check for >xioerrors at all here. Why do we care about them if we're just killing the >clients before we shutdown? The way I see it is that if you keep the whack code, you also need to have a local xioerror handler in slave, otherwise the xioerror(s) which are caused by that code are propagated to the daemon and the shutdown is not done the way you would like it to be done. I don't see a link with bug #402360. Isn't that an unexpected X server crash?
The more I think about this, the more I think that we should just remove the whack_clients code from this function. I think option #2 would be the right way to do things, so if removing the code causes someone a problem then I think we can suggest that the right fix be implemented. Could you provide a tested patch that removes this code, or if you go ahead and implement #2 that would be cool also. I notice sometimes that if I kill GDM via control-alt-backspace or gdm-stop, that my next login fails about 5% of the time. Then I have to kill the Xserver again with control-alt-backspace and login to a failsafe session to kill gconf and other daemons left around. Then I can log in. While I know how to do this, it probably would be hard for new or non-techie users to figure this out. I'd guess users probably reboot when they run into this sort of problem. Fixing this with option #2 would eliminate such usability issues. Although I've been aware of this for some time, I've never considered it real high priority since people don't normally exit their sessions this way. I wouldn't recommend it either since it avoids the normal cleanup that happens when you do a real session logout (via the Start menu). If you are an advanced enough user to know to use control-alt-backspace, then you probably can figure out how to kill the daemons, I guess. :) If you don't fix with #2, then lets leave this bug open after fixing the short-term issue (via #1) and see if someone has an interest in implementing this bit of cleanup code.
Created attachment 85106 [details] [review] Patch to slave.c Tested successfully on my Gnome setup: Distribution: Ubuntu 7.04 Herd 5 (with daily updates) Gnome: 2.18 Linux kernel: 2.6.20-12-generic x86_64
Well, easy-does-it first :) I might have some time next week to see if I can propose a patch for #152907. This patch deletes the call to gdm_server_whack_clients which was part of the gdm shutdown code, together with its associated error handling calls, functions and variables. This will close bug #419301 but most probably reopen bug #126071. This latter would have to be resolved by fixing the related broken PAM modules. [[[ 2007-03-22 Cesare Tirabassi <norsetto@alice.it> * daemon/slave.c: remove in the shutdown code the call to gdm_server_whack_clients, together with the associated error handling calls, functions and variables. Fix issue #419301. ]]]
Doesn't this mean that the signal will get propegated to the daemon? since the slave daemon will no longer listen for it? If so, should we at the very least add some code to the handler in mainloop_sig_callback so it doesn't print out a "Got signal message" for a signal it will just ignore. Perhaps it should check for this signal and just return TRUE if it gets it?
Thanks. This is fixed in SVN head. I will close the bug after I hear feedback from you on my last comment.
The normal Xioerror handler (gdm_slave_xioerror_handler)is still active and will react to Xio errors as nominal (restarting the display). But no Xioerror will be trapped since there are no more call to the X server.....
Thanks. I'm closing this bug, and if you can look into fixing bug #152907, that would be really cool. Thanks! This is only fixed in 2.19. I won't backport the fix to 2.18 since I'm uncomfortable changing this in a stable release. However, I would consider applying this patch and a patch that fixes 152907 to 2.18.
Created attachment 109023 [details] [review] Check for XIO error after session exit Hi guys, A RHEL 4 customer just hit this issue, so I investigated a bit. The problem seems to be that the slave never tries to talk to the display after the session exits, so it never gets an XIO error to tell it the X server died. A well placed XSync call corrects the issue.
Ray, thanks for looking into this. Is this patch for GDM 2.20? If so, please feel free to commit to the branch.
committed