GNOME Bugzilla – Bug 336549
gdm_slave_usr2_handler is non re-enterant
Last modified: 2010-06-06 23:41:32 UTC
I don't know how it happened - perhaps a bug in gdmsetup? - but I managed to find gdm (the slave process) blocked in gdm_slave_handle_usr2_message: (gdb) bt
+ Trace 67328
Obviously this is replicable by sending SIGUSR2 to the slave gdm process and not writing to its notify fd; I'm not in the habit of sending signals to random processes, so I'm certain a gdmsetup bug must have caused this. There's no immediate sign that anything's wrong; some time after the slave process gets blocked (enough time to fill up the stderr pipe that gets fed to .xsession-errors) apps start hanging whenever they write to stderr, because the slave process isn't reading from the stderr pipe. Yes, this can only get triggered by a bug in gdmsetup or playing with kill, but the effects are nasty enough that it should be guarded against anyway. A possible fix would be to make gdm_slave_handle_usr2_message() use non-blocking read on d->slave_notify_fd, or maybe select().
I'd like to have a bit better understanding of what is causing this problem. Can you replicate the problem in gdmsetup? If so, could you turn on debug and send me the log so I can see why the slave is being send a SIGUSR2 signal and not having anything written to the socket?
/me dances :D-< /me dances :D|-< /me dances :D/-< Yes, I've found a way to replicate it. In gdmsetup, in the Accessibility tab, toggle off "Enable accessible login". It only happens occasionally even by doing that, but I have managed to replicate it at least three times; it may help to have a high system load at the time. Mar 31 06:01:31 capella gdm[14638]: Handling user message: 'GET_CONFIG daemon/AddGtkModules' Mar 31 06:01:31 capella gdm[14638]: Handling user message: 'GET_CONFIG daemon/GtkModulesList' Mar 31 06:01:31 capella gdm[14638]: Handling user message: 'GET_CONFIG daemon/PidFile' Mar 31 06:01:31 capella gdm[14638]: Handling user message: 'UPDATE_CONFIG daemon/GtkModulesList=' Mar 31 06:01:31 capella gdm[14638]: set config key daemon/GtkModulesList= to string <NULL> Mar 31 06:01:31 capella gdm[14638]: Handling user message: 'GET_CONFIG daemon/PidFile' Mar 31 06:01:31 capella gdm[14638]: Handling user message: 'UPDATE_CONFIG daemon/AddGtkModules=false' Mar 31 06:01:31 capella gdm[14638]: set config key daemon/AddGtkModules=false to boolean false Mar 31 06:01:31 capella gdm[14638]: gdm_socket_handler: Accepting new connection fd 9 What I think is happening is a race in notify_displays_{int,string}: they write to the slave fd and then send USR2. If two keys are updated at the same time you get: gdmsetup gdm slave ---------------|------------ write key1 send USR2 get USR2 write key2 read key1+key2 send USR2 get USR2 read <-- blocks Does this sound plausible?
Yes, I think you have correctly diagnosed the problem. I think this bug has been fixed in CVS head and would appreciate if you can test it. Now GDM slaves call "CLOSE" on the connection each time it sends data to the daemon so key2 should still be pending when the 2nd USR2 signal happens. I've tested turning on/off a11y in gdmsetup to test this and it seems to work okay in CVS head, can you verify? Thanks!
I've installed 2.14.1 and I can still reproduce the bug. Actually, I can't see anywhere in the code that CLOSE is called on the connection.
I'm sorry to hear that 2.14.1 still has the problem. Note in the function gdmcomm_call_gdm_real in the file gui/gdmcomm.c, right at the end it calls CLOSE if bulk_acs is FALSE. The only time bulk_acs is TRUE is if the client calls gdmcomm_comm_bulk_start/gdmcomm_comm_bulk_stop which is done around areas where the client reads a lot of config data at once. This allows the client to use a single connection to get all the config data and is faster. However, by the time gdmsetup is done initializing and before it shows the screen, it should have called gdm_comm_bulk_stop already and so the logic should be calling close after each config read. If you turn on Enable=true in the [debug] section of the config file, then you should see debug messages go to your syslog (/var/log/messages). You should see each connection open, read a config value, and then close. It shouldn't try to open the second connection before the first one gets data back. Could you try this and share the syslog output, it would help me see what may be going wrong with the sockets connection here.
Here's what I get in syslog toggling accessibility on: Apr 14 07:19:47 capella gdm[4680]: gdm_socket_handler: Accepting new connection fd 8 Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'VERSION' Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'GET_CONFIG daemon/AddGtkModules' Apr 14 07:19:47 capella gdm[4680]: gdm_socket_handler: Accepting new connection fd 10 Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'CLOSE' Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'VERSION' Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'GET_CONFIG daemon/GtkModulesList' Apr 14 07:19:47 capella gdm[4680]: gdm_socket_handler: Accepting new connection fd 8 Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'CLOSE' Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'VERSION' Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'GET_CONFIG daemon/PidFile' Apr 14 07:19:47 capella gdm[4680]: gdm_socket_handler: Accepting new connection fd 10 Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'CLOSE' Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'VERSION' Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'UPDATE_CONFIG daemon/GtkModulesList=' Apr 14 07:19:47 capella gdm[4680]: set config key daemon/GtkModulesList= to string <NULL> Apr 14 07:19:47 capella gdm[4680]: gdm_socket_handler: Accepting new connection fd 8 Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'CLOSE' Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'VERSION' Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'GET_CONFIG daemon/PidFile' Apr 14 07:19:47 capella gdm[4680]: gdm_socket_handler: Accepting new connection fd 10 Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'CLOSE' Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'VERSION' Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'UPDATE_CONFIG daemon/AddGtkModules=false' Apr 14 07:19:47 capella gdm[4680]: set config key daemon/AddGtkModules=false to boolean false Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'CLOSE' Apr 14 07:19:47 capella gdm[4681]: Handling slave notify: 'GtkModulesList ' Apr 14 07:19:47 capella gdm[4681]: set config key daemon/GtkModulesList= to string <NULL> Apr 14 07:19:47 capella gdm[4681]: Handling slave notify: 'AddGtkModules 0' Apr 14 07:19:47 capella gdm[4681]: set config key daemon/AddGtkModules=false to boolean false Also: I attached gdb to the slave pid (4681) and set it to report some info: (gdb) handle SIGUSR2 nostop print pass Signal Stop Print Pass to program Description SIGUSR2 No Yes Yes User defined signal 2 (gdb) b gdm_slave_handle_usr2_message Breakpoint 1 at 0x805e2b0: file slave.c, line 4763. (gdb) comm Type commands for when breakpoint 1 is hit, one per line. End with a line saying just "end". >c >end (gdb) b slave.c:4764 Breakpoint 2 at 0x805e2db: file slave.c, line 4764. (gdb) comm Type commands for when breakpoint 2 is hit, one per line. End with a line saying just "end". >p buf >c >end (gdb) c Continuing. Program received signal SIGUSR2, User defined signal 2. Breakpoint 1, gdm_slave_handle_usr2_message () at slave.c:4763 4763 in slave.c Breakpoint 2, gdm_slave_handle_usr2_message () at slave.c:4764 4764 in slave.c $3 = "!GtkModulesList gail:atk-bridge:/usr/lib/gtk-2.0/modules/libkeymouselistener:/usr/lib/gtk-2.0/modules/libdwellmouselistener\n!AddGtkModules 1\n\016\000\000\002\000\000\000\002\000\000\000\b\000\000\000\000\000\000\000\222\204þC\000\000\000\000\222\204þC\000\000\000\000ª\204þC\000\000\000\000¹\204\004\000\000\000\000\000lH²¿ßM]·"... Program received signal SIGUSR2, User defined signal 2. Breakpoint 1, gdm_slave_handle_usr2_message () at slave.c:4763 4763 in slave.c Program received signal SIGINT, Interrupt. 0xffffe410 in __kernel_vsyscall () (gdb) bt
+ Trace 67613
The second breakpoint is just after the read call the slave hangs in. I think the contents of the buffer after the read the first time through demonstrate that there is a race between the parent and slave gdm processes. A useful fact: attaching to the slave gdm process in gdb slows it down sufficiently that the race succeeds over half of the time.
Okay I understand what's going on now. The problem looks more like this: gdmsetup gdm daemon gdm slave -------------------|----------------------|---------------- write key1 call UPDATE_CONFIG receive UPDATE_CONFIG send USR2 write key2 call UPDATE_CONFIG receive UPDATE_CONFIG send USR2 Receive first USR2 Receive second USR2 & block on read I suspect what is happening is the slave reads both "update notifications" when it receives the first USR2 so when it gets into the 2nd USR2 there is nothing more to read from the daemon, causing the hang. Could you verify this? Perhaps add a gdm_debug() message to print out the contents of "buf" to the syslog. I'm betting in the error condition, both messages will be received together the first time USR2 is received. I think this has to be the case, otherwise the syslog you posted wouldn't be reporting that it is "handling" both of them. I agree the appropriate fix would be to fix the read to be nonblocking and to make sure that the usr2_handler in the slave gracefully handles this situation, probably just by doing nothing if there is nothing to read on the pipe. Would you be willing to provide a patch to fix the management of this pipe to be nonblocking? Thanks for taking the time to report this bug. This bug report isn't very useful because it doesn't describe the bug well. If you have time and can still reproduce the bug, please read http://bugzilla.gnome.org/bug-HOWTO.html and add a description of how to reproduce this bug. You'll also need to add a stack trace; please see http://live.gnome.org/GettingTraces for more information about how to do so.
Created attachment 64297 [details] [review] gdm-hang.patch Right, yeah. I didn't really understand gdm's internal architecture, but your diagram makes sense. > Perhaps add a gdm_debug() message to print out the contents of "buf" to the > syslog. I'm betting in the error condition, both messages will be received > together the first time USR2 is received. That's what the gdb breakpoint command list above does. $3 = "!GtkModulesList gail:atk-bridge:/usr/lib/gtk-2.0/modules/libkeymouselistener:/usr/lib/gtk-2.0/modules/libdwellmouselistener\n!AddGtkModules 1\n\016\000\000\002\000\000\000\002\000\000\000\b\000\000\000\000\000\000\000\222\204þC\000\000\000\000\222\204þC\000\000\000\000ª\204þC\000\000\000\000¹\204\004\000\000\000\000\000lH²¿ßM]·"... i.e. buf contains a !GtkModulesList message and a !AddGtkModules message. This patch sets slave_notify_fd to be nonblocking. It also changes count to ssize_t (signed; read() returns -1 on error).
I worry that this may not be the right fix. Have you tested that GDM works properly when the code enters this state of needing to open in non-blocking mode and verify that everything works properly? If it opens non-blocking, wouldn't it be possible for one slave to get a message that was intended for another slave, or the wrong message? Basically, I worry that we may just be hiding a deeper problem with this fix. We should discuss whether it would be good to fix the code so it never enters a state of needing to open the socket in non-blocking mode. I don't think we yet have an understanding of the problem to decide this. If my diagram above is correct, then I'm confused why the problem is happening. Perhaps the problem is that when the slave recieves the first USR2, it accepts and processes both requests, and the second USR2 hangs because there is nothing more to read and it just waits. If this is the case, then perhaps the slaves need to be made smarter and only read one message from the queue each time it gets a USR2 signal. Or perhaps we could do something like make the slave send a message back to the daemon when it is done processing a USR2 request (or after it has read the data from the socket), and the daemon will hold off on sending the next request until it gets the confirmation it processed the last one. Also we could put a delay in gdmsetup so it waits a bit between sending two updates in a row. This last idea doesn't really fix the race condition but hides it a bit. I hope that you and I, though this bug report discussion, work through the analysis of the code so that we make the right change. I'm happy to help. Perhaps putting some debug code in the slave so that we can get a clearer picture of what is going on. Is the code processing the two messages with one USR2 signal? If so, does hacking the code so it doesn't process more than one at a time fix the problem as well?
> I worry that this may not be the right fix. Have you tested that GDM works > properly when the code enters this state of needing to open in non-blocking > mode and verify that everything works properly? As far as I can. gdm doesn't have a test suite, and I'm unable to simulate heavy load. > If it opens non-blocking, wouldn't it be possible for one slave to get a > message that was intended for another slave, or the wrong message? Basically, > I worry that we may just be hiding a deeper problem with this fix. No; the fd is the [0] end of a pipe. The pipe is created just before fork, and the [1] end is assigned to the master_notify_fd member of the GdmDisplay * corresponding to the slave in question. See gdm_display_manage() in daemon/display.c. > Perhaps the problem is that when the slave recieves the first USR2, it accepts > and processes both requests, and the second USR2 hangs because there is nothing > more to read and it just waits. Yes; that's what's happening. > If this is the case, then perhaps the slaves > need to be made smarter and only read one message from the queue each time it > gets a USR2 signal. Quickly grepping the code, there doesn't seem to be any cases of sending more than one message at a time. However, gdm_slave_handle_usr2_message() splits on \n so the machinery to handle multiple messages is there; messing with it might not be the best idea. Anyway, reading one message at a time would mean reading the fd one byte at a time; very inefficient. > Or perhaps we could do something like make the slave send > a message back to the daemon when it is done processing a USR2 request (or > after it has read the data from the socket), and the daemon will hold off on > sending the next request until it gets the confirmation it processed the last > one. Seems a bit over-engineered. > Also we could put a delay in gdmsetup so it waits a bit between sending > two updates in a row. This last idea doesn't really fix the race condition > but hides it a bit. Ugh. If you do that, don't let my name appear in the commit message. To be honest, I don't want to touch the GDM code more than necessary; it's a bit of a tangled mess. Yeah, signals, I know. I've written some debug code and I've got a log. I'll attach them.
Created attachment 64711 [details] [review] gdm-debug.patch Some debugging log statements.
Created attachment 64713 [details] [review] gdm-debug.patch Second try.
Let's keep thinking about this. This is a tricky bit of code, as you notice, so we want to make sure we fix it properly. After all, we are fixing a bug that only seems to show up if users click too frequently on certain buttons in gdmsetup, which probably doesn't happen too often. Still, hanging is bad, so we should fix it...but we don't want to introduce any worse problems. Could we do something to the slave, do you think, to make it abort the read if it doesn't get a message in, say, 5 seconds. A timeout, so the slave will just abort waiting on the pipe after a while? That might fix the problem more elegantly than making the slave and daemon handshake more.
Another idea may be to hack the slave to read multiple messages from the pipe, but stick any extra messages it reads on an internal queue. Then when it gets called a second time, it would just read off the internal queue rather than trying to read over the pipe.
Created attachment 64715 [details] [review] gdm-debug.patch 3rd time lucky...
Created attachment 64717 [details] gdm.log Action to generate this log: 1. Patch and compile gdm with above debugging patch 2. Enable debugging in gdm.conf 3. Restart gdm, log in 4. Place gdm slave process under heavy load, or simulate by attaching debugger (making sure to pass signals) 5. In gdmsetup, turn on accessibility. Maybe toggle a few times. 6. Watch syslog. End of log (relevant part) May 3 03:28:38 minerva gdm[8416]: (child 8417) gdm_slave_handle_usr2_message on display :0: about to read from slave fd May 3 03:28:38 minerva gdm[8416]: (child 8417) gdm_slave_handle_usr2_message on display :0: read 141 bytes from slave fd May 3 03:28:38 minerva gdm[8416]: (child 8417) gdm_slave_handle_usr2_message on display :0: message is: May 3 03:28:38 minerva gdm[8416]: (child 8417) gdm_slave_handle_usr2_message on display :0: <line 0>: !GtkModulesList gail:atk-bridge:/usr/lib/gtk-2.0/modules/libkeymouselistener:/usr/lib/gtk-2.0/modules/libdwellmouselistener May 3 03:28:38 minerva gdm[8416]: (child 8417) gdm_slave_handle_usr2_message on display :0: <line 1>: !AddGtkModules 1 May 3 03:28:38 minerva gdm[8416]: (child 8417) gdm_slave_handle_usr2_message on display :0: <line 2>: May 3 03:28:38 minerva gdm[8416]: (child 8417) gdm_slave_handle_usr2_message on display :0: end of message May 3 03:28:38 minerva gdm[8416]: (child 8417) gdm_slave_handle_usr2_message on display :0: about to read from slave fd At this point, the slave gdm process is hung.
> Could we do something to the slave, do you think, to make it abort the read if > it doesn't get a message in, say, 5 seconds. A timeout, so the slave will just > abort waiting on the pipe after a while? That might fix the problem more > elegantly than making the slave and daemon handshake more. No point: the parent process (the main gdm daemon) sends SIGUSR2 *after* it writes to the pipe. AFAIK Unix semantics preserve temporal ordering of inter-process communication. > Another idea may be to hack the slave to read multiple messages from the pipe, > but stick any extra messages it reads on an internal queue. Then when it gets > called a second time, it would just read off the internal queue rather than > trying to read over the pipe. I don't like it. Currently the code allows for sending multiple messages in one go. (It's not used, but that might change.) Also, an internal queue seems excessive state for a fairly simple one-way message passing protocol. I do think the patch I posted is close to the correct solution. If the problem is that setting O_NONBLOCK is too potentially global in its effects, we could poll (or select() with zero timeval) the read end of the pipe prior to calling gdm_slave_handle_usr2_message(). Note that gdm_slave_send() select() on the read end of the pipe before calling sgdm_slave_handle_usr2_message(), so semantics would be preserved.
Created attachment 64718 [details] [review] gdm-slave-poll.patch Select() on the notify fd. I'm not entirely sure about this, though; what happens if a signal arrives between the select() call (at which point the pipe can be read without blocking) and the read() call in gdm_slave_handle_usr2_message()? (Yes, it's a very thin race. So what?) The second SIGUSR2 handler would read the lot: (gdmsetup omitted from diagram) Main daemon |Slave process -----------------|------------- write to pipe send SIGUSR2 receive SIGUSR2 enter gdm_slave_usr2_handler() poll pipe; data available call gdm_slave_handle_usr2_message() write to pipe send SIGUSR2 + receive SIGUSR2 + enter gdm_slave_usr2_handler() + poll pipe; data available + call gdm_slave_handle_usr2_message() + read from pipe; get 2 messages + handle messages + return from gdm_slave_handle_usr2_message() + exit gdm_slave_usr2_handler() read from pipe <-- BLOCKS
Hm. I think the use of select() on the notify fd in the current code is also vulnerable to this race. Note that gdm_slave_handle_usr2_message() is also called from gdm_slave_send(). gdm_slave_send() can be called from in a signal handler, e.g. from within gdm_slave_child_handler() (calls gdm_slave_send_num()). Main daemon |Children, etc.|Slave process -----------------|--------------|------------- write to pipe send SIGUSR2 recieve SIGUSR2 enter gdm_slave_usr2_handler() increment in_usr2_signal (now 1) call gdm_slave_handle_usr2_message() die (SIGCHLD) + recieve SIGCHLD + enter gdm_slave_child_handler() + call gdm_slave_send_num() + call gdm_slave_send() + if (in_usr2_signal > 0) { + select() on pipe: data available + call gdm_slave_handle_usr2_message() write to pipe send SIGUSR2 + + recieve SIGUSR2 + + enter gdm_slave_usr2_handler() + + increment in_usr2_signal (now 2) + + call gdm_slave_handle_usr2_message() + + read from pipe; get 2 messages + + handle messages + + return from gdm_slave_handle_usr2_message() + + exit gdm_slave_usr2_handler() + read from pipe <-- BLOCKS The moral of this is that if you have control jumping all over the place into and out of signal handlers, you cannot assume there is data available on a fd after select()ing on it; a signal handler might pop up and grab all the data before you can read it. The only solution to this is an atomic poll-and-read; lacking that (and if Unix has atomic poll-and-read it's news to me), set the fd to O_NONBLOCK.
Thanks for being so patient with me and for being willing to test your work. You are talking me into accepting your original patch, I think. Could you explain to me, though, how it deals with the fact that there's nothing left to read on the pipe the 2nd time through. It isn't clear to me how making the read nonblock fixes this. Does it cause the read to simply return instead of hanging? And if we are confidant that this change will not cause the slave to ever get a message intended for another slave, then I feel more comfortable accepting the patch.
> Could you explain to me, though, how it deals with the fact that there's > nothing left to read on the pipe the 2nd time through. It isn't clear to me > how making the read nonblock fixes this. Does it cause the read to simply > return instead of hanging? Yes. From read(2): On error, -1 is returned, and errno is set appropriately. ... ERRORS EAGAIN Non-blocking I/O has been selected using O_NONBLOCK and no data was immediately available for reading. > And if we are confidant that this change will not cause the slave to > ever get a message intended for another slave, Yes. The messages are sent on a pipe (not a named pipe) that is created immediately prior to the slave being forked off; each slave has its own corresponding pipe; if the master daemon is to send a message to all slaves (notifying update of a config setting, say) it writes the message to each pipe (and signals each slave) in turn. To be honest, the more I read the code, the more I dislike it; most of the signal handlers don't look reentrant (things as simple as not using atomic increment). But a proper fix would mean a major rewrite (have the signal handlers atomically increment flags which then are handled and atomically decremented in a GLib mainloop, I suppose); I'm just happy if this particular hang is fixed.
Yes, you are not the first to complain about the ugliness of the signal handlers. I appreciate your help, and if you want to make more specific suggestions about how to clean them up, I'd be interested to hear. To be honest, I don't really understand the problem of using atomic increment, for example. Why is it necessary to change size_t to ssize_t in daemon/slave.c in your patch?
Well, on some architectures and in some conditions (PowerPC, for example) incrementing a counter is a three-op: read memory location, increment, write back. If a signal handler modifies the counter after the main thread has read the counter, then once the signal handler exits the main thread will increment and write back the old value, so the change the signal handler made will be lost. Using atomic operations (g_atomic_int_inc()) will prevent this happening. Hm - actually, using atomic increment isn't necessary, because the signal handlers pair increment with decrement, so the counters are unchanged when the signal handler exits. OK, forget about that. Signal handling is a special case of concurrency, because the execution threads are nested instead of interleaved. Here's a genuine bug: in misc.c, the paired gdm_sig{chld,term,usr2}_block_{push,pop} functions are race-prone: void gdm_sigusr2_block_push (void) { sigusr2_blocked++; if (sigusr2_blocked == 1) { /* Set signal mask */ sigemptyset (&sigusr2block_mask); sigaddset (&sigusr2block_mask, SIGUSR2); sigprocmask (SIG_BLOCK, &sigusr2block_mask, &sigusr2block_oldmask); } } void gdm_sigusr2_block_pop (void) { sigusr2_blocked --; if (sigusr2_blocked == 0) { /* Reset signal mask back */ sigprocmask (SIG_SETMASK, &sigusr2block_oldmask, NULL); } } After gdm_sigusr2_block_pop decrements sigusr2_blocked to 0, if a signal handler calls gdm_sigusr2_block_push at that point then sigusr2block_oldmask will be set to the mask with SIGUSR2 blocked; when gdm_sigusr2_block_pop is called in the signal handler it will restore the oldmask with SIGUSR2 blocked and so will the outer gdm_sigusr2_block_pop. The result will be that SIGUSR2 remains blocked for the lifetime of the process. Also, if gdm_sigusr2_block_push is called by a signal handler immediately after gdm_sigusr2_block_push in the main context increments sigusr2_blocked to 1, that signal handler will believe SIGUSR2 is blocked when in fact the main gdm_sigusr2_block_push has not yet blocked it. The result will be that the signal handler executes code without SIGUSR2 blocked that should only be executed with SIGUSR2 blocked. The code needs to satisfy the following conditions: 1. When gdm_sigusr2_block_push returns: SIGUSR2 is blocked. 2. When gdm_sigusr2_block_pop returns: the signal mask is what it was before the corresponding gdm_sigusr2_block_push call. Environment: 1. push/pop calls come paired. 2. A paired push/pop call can occur between any two machine instructions. Fix: void gdm_sigusr2_block_push (void) { sigset_t oldmask; if (sigusr2_blocked == 0) { /* Set signal mask */ sigemptyset (&sigusr2block_mask); sigaddset (&sigusr2block_mask, SIGUSR2); sigprocmask (SIG_BLOCK, &sigusr2block_mask, &oldmask); } sigusr2_blocked++; sigusr2_block_oldmask = oldmask; } void gdm_sigusr2_block_pop (void) { sigset_t oldmask; oldmask = sigusr2block_oldmask; sigusr2_blocked--; if (sigusr2_blocked == 0) { /* Reset signal mask back */ sigprocmask (SIG_SETMASK, &sigusr2block_oldmask, NULL); } } Other signal handler issues - OK, gdm_slave_xfsz_handler is reentrant; gdm_slave_term_handler looks OK; gdm_slave_alrm_handler is probably OK; gdm_slave_child_handler seems OK (it reads slave_waitpids, but SIGCHLD is blocked when slave_waitpids is modified). gdm_slave_usr2_handler calls gdm_slave_handle_usr2_message, which is non-reentrant: first of all, it reads from slave_notify_fd which as discussed above can block unless set to O_NONBLOCK. Also, it calls g_list_append, which is non-reentrant; if a signal handler calls g_list_append interrupting another call to g_list_append then the signal handler's item can be lost: last = g_list_last (list); <---- if signal handler called here, signal handler's item is lost last->next = new_list; new_list->prev = last; Also, it allocates and frees memory (gdm_ack_response), which is non-reentrant (malloc keeps a static linked list of free memory; if it is reentered, then bad things happen). This is admitted at the top of slave.c: /* Yay thread unsafety */ static gboolean x_error_occurred = FALSE; static gboolean gdm_got_ack = FALSE; static char * gdm_ack_response = NULL; static GList *unhandled_notifies = NULL; Changes: I would make the SIGUSR2 just set (increment?) a flag and let the main context handle reading from the fd. ssize_t: POSIX read has signature ssize_t read(int fd, void *buf, size_t count); size_t is unsigned; read returns -1 on error. ssize_t is signed.
Thanks, Ed, for helping so much. I applied your patch and the fix to the sigusr2 pop/push signal handlers. I'm changing the synopsis of the bug to reflect that the gdm_slave_usr2_handler has issues.
I think there may be a problem in USR1 handling as well, although it may be a different problem. I just found my gdm deadlocked after sending USR1 to it, with the following stack trace:
+ Trace 212039
I know that I don't have debugging symbols installed, but I don't think that the signal handler should be calling Xsync. That's not safe to do in a signal handler. About the only thing that is safe to do is to set a flag. This is on Ubuntu Hardy Heron (8.04.2) with gdm_2.20.5-0ubuntu3_i386.deb.
Thanks for taking the time to report this bug. However, you are using a version that is too old and not supported anymore. GNOME developers are no longer working on that version, so unfortunately there will not be any bug fixes for the version that you use. By upgrading to a newer version of GNOME you could receive bug fixes and new functionality. You may need to upgrade your Linux distribution to obtain a newer version of GNOME. Please feel free to reopen this bug if the problem still occurs with a newer version of GNOME.
For the record: the issues discussed in comment 23 are still apparent (as of 2.20.10); in particular gdm_slave_usr2_handler is still not reentrant.