GNOME Bugzilla – Bug 357632
gdm hangs altogether after timeout on the gdm socket
Last modified: 2006-10-25 18:12:57 UTC
That bug has been opened on https://launchpad.net/distros/ubuntu/+source/gdm/+bug/62139 "Since upgrade to dapper final, we experience frequent breakage of gdm on an amd64 (64-bit dapper) XDMCP server serving regularly around 40 clients. Reproducibility: happens when e.g. many people log out at the same time (once in a few days), gdm must be killed and started manually afterwards (killing all the existing sessions as well). Symptoms: slave gdm processe continue to work, but the main gdm process does not spawn new slaves, it does not ping existing ones every 15s as it does normally (from the debug syslog), does not repond to TERM (must be KILLed) - as if it were waiting for something (race?). Logs reveal the only difference between normal situation and the bug in the timeout on the gdm socket (I will attach the full log): Sep 22 14:16:40 [gdm] Sending LOGGED_IN == 0 for slave 317 Sep 22 14:16:40 [gdm] Timeout occurred for sending message LOGGED_IN 317 0 What might be the reason? In slave.c:gdm_slave_send, up to 10 attempts are made to deliver the message (select on &rfds), but select apparently return error, since the timeout never expires (otherwise, it would have to take 10s between the message sending and the timeout). PS. I compiled gdm with an added line for tracing the message sending and will post results if they are relevant. (daemon/slave.c): @@ -2767,6 +2766,7 @@ if (in_usr2_signal > 0) { fd_set rfds; struct timeval tv; + int select_retval; FD_ZERO (&rfds); FD_SET (d->slave_notify_fd, &rfds); @@ -2775,9 +2775,10 @@ tv.tv_sec = 1; tv.tv_usec = 0; - if (select (d->slave_notify_fd+1, &rfds, NULL, NULL, &tv) > 0) { + if ((select_retval = select (d->slave_notify_fd+1, &rfds, NULL, NULL, &tv)) > 0) { gdm_slave_handle_usr2_message (); } + if (select_retval < 0) gdm_debug("TRACE (%s,%d): select returned errno %d (%s)",__FILE__,__LINE__,select_retval,strerror(select_retval)); } else { struct timeval tv; /* Wait 1 second. */ @@ -2787,6 +2788,7 @@ /* don't want to use sleep since we're using alarm for pinging */ } + gdm_debug ("TRACE (%s,%d): Passed gdm_slave_send cycle, i=%d, in_usr2_signal=%d, wait_for_ack=%d, gdm_got_ack=%d.",__FILE__,__LINE__,i,in_usr2_signal,wait_for_ack,gdm_got_ack); } if G_UNLIKELY (wait_for_ack && http://librarian.launchpad.net/4416825/a part of the gdm debug log This is the log from where the error shows up, up to restart (happened friday afternoon, restarted monday). Note that the pinging from the main gdm process is not present anymore. The "Fatal X error detected." is present even under normal conditions and hence is not cause of the bug.
Typically the problem when the GDM daemon becomes so unresponsive is because it has crashed. Note that GDM works like this; 1. The main GDM daemon starts. This is the process that has the pid listed in /var/run/gdm.pid. Note that most programs try to connect to this process when they want to communicate. For example gdmflexiserver looks up the daemon process Pid in this file. 2. For each display, a forked daemon (aka the slave daemon) is started. 3. The forked daemon starts the GUI program (gdmlogin/gdmgreeter). Note it can look like daemon's are still running even when the main daemon has crashed since the forked slave daemons (#2 above) may still be running. If the main daemon is crashing, has it left a core file somewhere. Look for GDM related core files in the /, /var/lib/gdm, /var/lib/log/gdm, or the GDM user's $HOME directory (if any). I suspect all of the above won't help you since it sounds like the main daemon isn't crashing. I just mention this so you can double check, since this is typically the problem for this sort of symptom. GDM uses a couple of different mechanisms to communicate between the main daemon and other processes. There is the GDM_SUP* messages that are used by various programs to query the daemon about different hings. What happens when you run this command. It tests the GDM_SUP* messages. Does it work? gdmflexiserver --command=VERSION The other kind of messages are GDM_SOP_* messages. For these messages, the forked slave process sends the main daemon a SIGUSR1 signal and then uses a fifo protocol for passing the messages. Note the function create_connections in daemon/gdm.c and note gdm_slave_send in daemon/slave.c. In your syslog, it seems like messages like GDM_SOP_LOGGED_IN (aka LOGGED_IN) is failing. So this is probably where it is breaking. I'm wondering if this is the *only* kind of message that is breaking or if just GDM_SOP messages are broken for you. If I were to guess what the problem might be, it could be that GDM may be executing some code in the SIGUSR1 signal handler that is not signal handler \ safe. Since it works most of the time, but fails in weird situations where many users log out at once, this might be triggering a normally rare event that makes the signal handler mad. Note in slave.c, the gdm_slave_usr2_handler function. Might be worth looking into this function more closely. At this point, I'm just giving you some pointers to hopefully narrow down what the problem may be. Perhaps you could investigate, and if possible add some more gdm_debug messages so we can have a better understanding of where the failure may be happening?
thank you for the detailled comment, I've forwarded that on the distribution bug and will let you know if the submitter does figure something
Hello, I tracked the issue to gdm master daemon crashing after all. I installed gdm compiled with -ggdb3 for copious debugging output, but I am not able to get the core dump. It is probably a distro issue or my stupidity - but it seems that /etc/security/limits.conf is not honored ene though pam_limits is run... Will investigate and get back. Regards, VS
Yes. Some systems are configured to not save core files for certain types of processes like GDM, since you don't want core dumps to fill your file system when certain system programs crash. This can be configured, I'm sure, but how you do it tends to be very system specific. I listed the common directories where GDM will leave core files, but I imagine you already looked there and didn't find any. What I'd recommend doing is run the gdm daemon in a debugger with the --nodaemon flag so it doesn't spawn as a child process. Note you should run gdm-binary directly rather than starting the "gdm" program, which is just a shell script that sets some environment variables and launches gdm-binary. You can set this environment stuff up by hand if you need it (like if you are using non-English and need the l10n environment variables set, etc.). But probably not necessary for just debugging the crash. Then the debugger should report the line number when it crashes. Note if you have a running session, you probably will need to switch VT's or kill your GDM slave daemon so that you can get back to the console to see the mesasge. Hopefuly this will also help us understand what might be causing the crash. If that doesn't work, there are other ways you can probably also trap the process so that when it SEGV's it hangs rather than crashing, so you can attach a debugger to it. Keep me posted.
Hello, the problem with attaching debugger is that the crash is not regular and occurs at a few hours interval; secondly, that it is an application server that is rather heavily loaded (0-40 sessions at a time). Unless I get Ubuntu to set the ulimits right (still no success here), we will be blindly guessing... Some observations however. I run minutely a cron script that checks for existence of the process w/ pid=`cat /var/run/gdm.pid`; if the process doesn't exist, gdm is started. Apparently, existing sessions continue to work, new logins work as well. In short, everything is (almost - see below) as it should. If parent_exists()==false, I have gdm_debug write it to the log; it is then clear from which point to expect breakage. (The problem with message timeout comes from the fact that parent_exists() returns false (master daemon dead) and the delivery is not even tried: the 'Timeout' is somewhat misleading in this respect, since it fails instantly). HOWEVER, the dead parent in the log is _much_ (say 3x) more frequent than cron restarting the master gdm daemon. That means that parent_exists()==false doesn't always mean that the parent is dead forever. That means that sometimes, after a few timeouts, the message passing continues to work. How is that possible? Have to investigate. What changes after the master being restarted is that it no longer pings displays (hence it no longer handles ALRM?) which may hint at some signal-handling race leading to a silent crash. I can send the gdm debug logs, but the are rather long (around 10MB/week uncompressed), Regards, VS
There are two places in the slave.c code where parent_exists is called (gdm_slave_send and gdm_slave_start). Do you know which case is where the problem occurs? Yes, existing sessions will mostly work, since killing the master daemon doesn't cause the forked slave daemons to die, and the forked slave daemon is really what controlls the session. However, it is broken in the sense that the master daemon isn't really aware of these existing sessions and you wouldn't get expected behavior if you tried to do things where the daemon needs to talk to the forked slave daemon (such as runnign gdmflexiser --commands that would affect that or report about the sessions - e.g. ATTACHED_SERVERS). Note that even though you restart GDM, it doesn't know when to restart the GUI for this display because normally GDM expects to get a return code back from the GUI. When you restart the GDM daemon, it fails to start the GUI on displays with running sessions. Might be nice to deal with this sort of problem by adding a new gdmflexiserver --command so that the GUI tells the daemon when it is done running. Then the daemon could be notified to start a new session for that display. Though it should, of course, first check to make sure that the GDM configuration file specifies that the display *should* be managed. In other words, the display should be in the "displays" extern defined in gdmconfig.c. However, this just works around the problem of the GUI failing to be remanaged when the master daemon restarts. Probably best to track down exactly why the daemon is core dumping and fix that so the problem doesn't happen. Perhaps you could investigate with your distro how to configure your system so that you can get the core file that GDM should be creating when it crashes? If you compile with "-g" then it would be much easier to track down where the crash is happening. Sometimes I find that if I do *not* launch GDM via the normal boot time startup, but instead start GDM by hand from the console that it will generate core when it dies. Sometimes starting up processes via the startup scrips (e.g. /etc/rc*) turns off core dumping since you normally do not want system tools to fill up your system with big core files if they are crashing. This might help? Good luck, and thanks for your help making GDM better.
Hi, even though I am still unable to get coredump of gdm (despite bugreports about /etc/limits/security.conf etc.), I got 4 crashes today while running gdm in with catchsegv, but I did not notice that debian strips the executable after compilation, so the output is with adresses only. What is positive is that all four cases have the backtrace identical. I attach one of them, if it can be of any use to you, I can get you additional stuff (unstripped binary - hopefully really the one that run, the stripped binary that was run, or the objdump thereof). I will try to get a better catchsegv output or will try running gdm in valgrind. I will get back as soon as I have something relevant here.
Created attachment 74815 [details] catchsegv output of crashed gdm daemon
It's cool that you have figured out how to get stacktraces. However, looking at it, it doesn't seem to include any information about which specific functions in GDM is causing the problem, so it doesn't help me much. Perhaps someone else reviewing these bug reports can decypher the report and provide us with more of an idea what is going on? Would it be possible for you to recompile GDM with -g so we could get more information about the problem? I think it would be much easier to debug with -g. Also, might be useful if you actually ran GDM with the debugger so that when it crashes we could print out variables, etc. and otherwise explore why it crashed?
I tried to attach gdb to the running gdm (compiled with -ggdb3) and sent it SEGV (with the handling of SEGV set appropriately) and gdb says: "Cannot find user-level thread for LWP [xxx]: generic error" which seems to be a bug that has been reported several times without any resolution. I was able, however, to interpret portions of the backtrace with the great tool addr2line I found today; Applied on the addresses from the attached catchsegv output, I get: root@gamma:~/gdm-crashes# addr2line -f -e gdm-binary_unstripped 40fb9d gdm_handle_user_message /root/gdm/gdm-2.16.0/daemon/gdm.c:3283 root@gamma:~/gdm-crashes# addr2line -f -e gdm-binary_unstripped 4369a4 gdm_connection_handler /root/gdm/gdm-2.16.0/daemon/gdm-net.c:179 root@gamma:~/gdm-crashes# addr2line -f -e gdm-binary_unstripped 40886a _start ../sysdeps/x86_64/elf/start.S:116 So the crash is definitely located in gdm_handle_user_message, the line reading " sysmenu = gdm_get_value_bool_per_display (disp->name, GDM_KEY_SYSTEM_MENU);", which is executed when the message is GDM_SUP_QUERY_LOGOUT_ACTION. Looking quickly at the preceeding line ("disp = gdm_connection_get_display (conn);"), where there is a possible source of dipl being something bizzare (like NULL), I examine the function called. GdmDisplay * gdm_connection_get_display (GdmConnection *conn) { g_return_val_if_fail (conn != NULL, NULL); return conn->disp; } Documentation of g_return_val_if macro says: "Returns from the current function, returning the value val, if the expression is not true. If the expression evaluates to FALSE, a critical message is logged and val is returned." (http://developer.gnome.org/doc/API/glib/glib-warnings-and-assertions.html#G-RETURN-VAL-IF-FAIL). A possible source therefore is that NULL GdmDisplay is returned (critical message going to nowhere, most likely?) and the return value is not checked EVEN it the developer apparently did count with the possibility of NULL in the gdm_connection_get_display function itself. I will try 2 things: 1. discard the message if the disp is NULL; 2. put bunch of gdm_debug()s into the relevant code to see if it is really the cause of the crash and what conditions may yield such behavior. The results will not be available before tomorrow, since there is already 14 people logged in now and I don't want to kill their sessions. Comments appreciated, will post back soon. Vaclav
This seems like dupe of #352838 (already fixed in CVS, not in Dapper build however of which source I was - pardon me - using). I found it out by making a patch against current CVS: it turned out that the very same fix (daemon/gdm.c:3277) was already commited in the CVS. Even though the cause is not 100% sure yet, but if it doesn't crash in a few days to come, this bug might be closed. I will keep you posted. Thanks and sorry.
This sounds a lot like a problem that has already been fixed in CVS head. Note bug #352838 and the ChangeLog request. I think you can rebuild GDM with the patch in the bug #352838 and you'll probably be happier. :) 2006-09-11 Brian Cameron <brian.cameron@sun.com> * daemon/gdm.c: Move call to get GDM_KEY_SYSTEM_MENU until after authentication check. Fixes bug #352838. Patch by Frederic Crozat <fcrozat@mandriva.com>. * docs/C/gdm.xml: Make gdmdynamic section a bit more clear.
Hello, I confirm that there were no more segfaults recently,therefore this bug may be closed (as a dupe of bug #352838). To clarify things for reference, the buggy version was a mistaken (and longly unperceived) upgrade to early Ubuntu Edgy which had the post-2.14 development version of gdm where the bug was introduced and fixed later as mentioned above. Since I downgraded (not upgraded - to avoid many dependencied being pulled) gdm to the stable Dapper release version (2.14), segfault did not occur anymore. Regards, Vaclav Smilauer
*** This bug has been marked as a duplicate of 352838 ***