Bug 336549 – gdm_slave_usr2_handler is non re-enterant

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 336549 - gdm_slave_usr2_handler is non re-enterant


Summary:	gdm_slave_usr2_handler is non re-enterant


Status:	RESOLVED OBSOLETE

Product:	gdm
Classification:	Core
Component:	general
Version:	2.14.x
Hardware:	Other All

Importance:	High critical
Target Milestone:	---
Assigned To:	GDM maintainers
QA Contact:	GDM maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2006-03-29 21:36 UTC by Ed Catmur
Modified:	2010-06-06 23:41 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
gdm-hang.patch (596 bytes, patch) 2006-04-26 01:25 UTC, Ed Catmur	committed	Details \| Review
gdm-debug.patch (683 bytes, patch) 2006-05-03 01:08 UTC, Ed Catmur	none	Details \| Review
gdm-debug.patch (863 bytes, patch) 2006-05-03 01:26 UTC, Ed Catmur	none	Details \| Review
gdm-debug.patch (867 bytes, patch) 2006-05-03 02:31 UTC, Ed Catmur	reviewed	Details \| Review
gdm.log (86.73 KB, text/plain) 2006-05-03 02:36 UTC, Ed Catmur		Details
gdm-slave-poll.patch (617 bytes, patch) 2006-05-03 03:12 UTC, Ed Catmur	reviewed	Details \| Review

Description Ed Catmur 2006-03-29 21:36:18 UTC

I don't know how it happened - perhaps a bug in gdmsetup? - but I managed to find gdm (the slave process) blocked in gdm_slave_handle_usr2_message:

(gdb) bt

+ Trace 67328

#0 __kernel_vsyscall
#1 __write_nocancel
from /lib/libc.so.6
#2 gdm_slave_handle_usr2_message
at slave.c line 4571
#3 gdm_slave_usr2_handler
at slave.c line 4639
#4 <signal handler called>
#5 __kernel_vsyscall
#6 __pselect
at pselect.c line 46
#7 gdm_slave_session_start
at slave.c line 4114
#8 gdm_slave_run
at slave.c line 1494
#9 gdm_slave_start
at slave.c line 839
#10 gdm_display_manage
at display.c line 320
#11 mainloop_sig_callback
at gdm.c line 965


Obviously this is replicable by sending SIGUSR2 to the slave gdm process and not writing to its notify fd; I'm not in the habit of sending signals to random processes, so I'm certain a gdmsetup bug must have caused this.

There's no immediate sign that anything's wrong; some time after the slave process gets blocked (enough time to fill up the stderr pipe that gets fed to .xsession-errors) apps start hanging whenever they write to stderr, because the slave process isn't reading from the stderr pipe.

Yes, this can only get triggered by a bug in gdmsetup or playing with kill, but the effects are nasty enough that it should be guarded against anyway.

A possible fix would be to make gdm_slave_handle_usr2_message() use non-blocking read on d->slave_notify_fd, or maybe select().

Comment 1 Brian Cameron 2006-03-31 02:56:51 UTC

I'd like to have a bit better understanding of what is causing this problem.  Can you replicate the problem in gdmsetup?  If so, could you turn on debug and send me the log so I can see why the slave is being send a SIGUSR2 signal and not having anything written to the socket?

Comment 2 Ed Catmur 2006-03-31 05:16:01 UTC

/me dances :D-<
/me dances :D|-<
/me dances :D/-<

Yes, I've found a way to replicate it. In gdmsetup, in the Accessibility tab, toggle off "Enable accessible login". It only happens occasionally even by doing that, but I have managed to replicate it at least three times; it may help to have a high system load at the time.

Mar 31 06:01:31 capella gdm[14638]: Handling user message: 'GET_CONFIG daemon/AddGtkModules'
Mar 31 06:01:31 capella gdm[14638]: Handling user message: 'GET_CONFIG daemon/GtkModulesList'
Mar 31 06:01:31 capella gdm[14638]: Handling user message: 'GET_CONFIG daemon/PidFile'
Mar 31 06:01:31 capella gdm[14638]: Handling user message: 'UPDATE_CONFIG daemon/GtkModulesList='
Mar 31 06:01:31 capella gdm[14638]: set config key daemon/GtkModulesList= to string <NULL>
Mar 31 06:01:31 capella gdm[14638]: Handling user message: 'GET_CONFIG daemon/PidFile'
Mar 31 06:01:31 capella gdm[14638]: Handling user message: 'UPDATE_CONFIG daemon/AddGtkModules=false'
Mar 31 06:01:31 capella gdm[14638]: set config key daemon/AddGtkModules=false to boolean false
Mar 31 06:01:31 capella gdm[14638]: gdm_socket_handler: Accepting new connection fd 9

What I think is happening is a race in notify_displays_{int,string}: they write to the slave fd and then send USR2. If two keys are updated at the same time you get:

gdmsetup          gdm slave
---------------|------------
write key1
send USR2
                  get USR2
write key2
                  read key1+key2
send USR2
                  get USR2
                  read           <-- blocks

Does this sound plausible?

Comment 3 Brian Cameron 2006-04-05 17:10:49 UTC

Yes, I think you have correctly diagnosed the problem.  I think this bug has been fixed in CVS head and would appreciate if you can test it.

Now GDM slaves call "CLOSE" on the connection each time it sends data to the daemon so key2 should still be pending when the 2nd USR2 signal happens.

I've tested turning on/off a11y in gdmsetup to test this and it seems to work okay in CVS head, can you verify?  Thanks!

Comment 4 Ed Catmur 2006-04-13 17:29:04 UTC

I've installed 2.14.1 and I can still reproduce the bug. Actually, I can't see anywhere in the code that CLOSE is called on the connection.

Comment 5 Brian Cameron 2006-04-13 22:05:49 UTC

I'm sorry to hear that 2.14.1 still has the problem.  Note in the function
gdmcomm_call_gdm_real in the file gui/gdmcomm.c, right at the end it calls
CLOSE if bulk_acs is FALSE.  The only time bulk_acs is TRUE is if the client
calls gdmcomm_comm_bulk_start/gdmcomm_comm_bulk_stop which is done around areas
where the client reads a lot of config data at once.  This allows the client to use a single connection to get all the config data and is faster.

However, by the time gdmsetup is done initializing and before it shows the screen, it should have called gdm_comm_bulk_stop already and so the logic should be calling close after each config read.  

If you turn on Enable=true in the [debug] section of the config file, then you should see debug messages go to your syslog (/var/log/messages).  You should see each connection open, read a config value, and then close.  It shouldn't try to open the second connection before the first one gets data back.

Could you try this and share the syslog output, it would help me see what may be going wrong with the sockets connection here.

Comment 6 Ed Catmur 2006-04-14 06:28:45 UTC

Here's what I get in syslog toggling accessibility on:

Apr 14 07:19:47 capella gdm[4680]: gdm_socket_handler: Accepting new connection fd 8
Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'VERSION'
Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'GET_CONFIG daemon/AddGtkModules'
Apr 14 07:19:47 capella gdm[4680]: gdm_socket_handler: Accepting new connection fd 10
Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'CLOSE'
Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'VERSION'
Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'GET_CONFIG daemon/GtkModulesList'
Apr 14 07:19:47 capella gdm[4680]: gdm_socket_handler: Accepting new connection fd 8
Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'CLOSE'
Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'VERSION'
Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'GET_CONFIG daemon/PidFile'
Apr 14 07:19:47 capella gdm[4680]: gdm_socket_handler: Accepting new connection fd 10
Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'CLOSE'
Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'VERSION'
Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'UPDATE_CONFIG daemon/GtkModulesList='
Apr 14 07:19:47 capella gdm[4680]: set config key daemon/GtkModulesList= to string <NULL>
Apr 14 07:19:47 capella gdm[4680]: gdm_socket_handler: Accepting new connection fd 8
Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'CLOSE'
Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'VERSION'
Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'GET_CONFIG daemon/PidFile'
Apr 14 07:19:47 capella gdm[4680]: gdm_socket_handler: Accepting new connection fd 10
Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'CLOSE'
Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'VERSION'
Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'UPDATE_CONFIG daemon/AddGtkModules=false'
Apr 14 07:19:47 capella gdm[4680]: set config key daemon/AddGtkModules=false to boolean false
Apr 14 07:19:47 capella gdm[4680]: Handling user message: 'CLOSE'
Apr 14 07:19:47 capella gdm[4681]: Handling slave notify: 'GtkModulesList '
Apr 14 07:19:47 capella gdm[4681]: set config key daemon/GtkModulesList= to string <NULL>
Apr 14 07:19:47 capella gdm[4681]: Handling slave notify: 'AddGtkModules 0'
Apr 14 07:19:47 capella gdm[4681]: set config key daemon/AddGtkModules=false to boolean false

Also: I attached gdb to the slave pid (4681) and set it to report some info:

(gdb) handle SIGUSR2 nostop print pass
Signal        Stop      Print   Pass to program Description
SIGUSR2       No        Yes     Yes             User defined signal 2
(gdb) b gdm_slave_handle_usr2_message
Breakpoint 1 at 0x805e2b0: file slave.c, line 4763.
(gdb) comm
Type commands for when breakpoint 1 is hit, one per line.
End with a line saying just "end".
>c
>end
(gdb) b slave.c:4764
Breakpoint 2 at 0x805e2db: file slave.c, line 4764.
(gdb) comm
Type commands for when breakpoint 2 is hit, one per line.
End with a line saying just "end".
>p buf
>c
>end
(gdb) c
Continuing.

Program received signal SIGUSR2, User defined signal 2.

Breakpoint 1, gdm_slave_handle_usr2_message () at slave.c:4763
4763    in slave.c

Breakpoint 2, gdm_slave_handle_usr2_message () at slave.c:4764
4764    in slave.c
$3 = "!GtkModulesList gail:atk-bridge:/usr/lib/gtk-2.0/modules/libkeymouselistener:/usr/lib/gtk-2.0/modules/libdwellmouselistener\n!AddGtkModules 1\n\016\000\000\002\000\000\000\002\000\000\000\b\000\000\000\000\000\000\000\222\204þC\000\000\000\000\222\204þC\000\000\000\000ª\204þC\000\000\000\000¹\204\004\000\000\000\000\000lH²¿ßM]·"...

Program received signal SIGUSR2, User defined signal 2.

Breakpoint 1, gdm_slave_handle_usr2_message () at slave.c:4763
4763    in slave.c

Program received signal SIGINT, Interrupt.
0xffffe410 in __kernel_vsyscall ()
(gdb) bt

+ Trace 67613

#0 __kernel_vsyscall
#1 __read_nocancel
from /lib/libc.so.6
#2 gdm_slave_handle_usr2_message
at slave.c line 4763
#3 gdm_slave_usr2_handler
at slave.c line 4831
#4 <signal handler called>
#5 __kernel_vsyscall
#6 ___newselect_nocancel
from /lib/libc.so.6
#7 slave_waitpid
at slave.c line 388
#8 gdm_slave_session_start
at slave.c line 4306
#9 gdm_slave_run
at slave.c line 1529
#10 gdm_slave_start
at slave.c line 851
#11 gdm_display_manage
at display.c line 320
#12 main
at gdm.c line 265
#13 __libc_start_main
at libc-start.c line 238
#14 _start


The second breakpoint is just after the read call the slave hangs in. I think the contents of the buffer after the read the first time through demonstrate that there is a race between the parent and slave gdm processes.

A useful fact: attaching to the slave gdm process in gdb slows it down sufficiently that the race succeeds over half of the time.

Comment 7 Brian Cameron 2006-04-14 21:26:47 UTC

Okay I understand what's going on now. The problem looks more like this:

gdmsetup gdm daemon gdm slave
-------------------|----------------------|----------------
write key1
call UPDATE_CONFIG
receive UPDATE_CONFIG
send USR2

write key2
call UPDATE_CONFIG
receive UPDATE_CONFIG
send USR2
Receive first USR2
Receive second USR2 & block on read

I suspect what is happening is the slave reads both "update notifications"
when it receives the first USR2 so when it gets into the 2nd USR2 there is
nothing more to read from the daemon, causing the hang. Could you verify this?
Perhaps add a gdm_debug() message to print out the contents of "buf" to the
syslog. I'm betting in the error condition, both messages will be received
together the first time USR2 is received.

I think this has to be the case, otherwise the syslog you posted wouldn't be reporting that it is "handling" both of them.

I agree the appropriate fix would be to fix the read to be nonblocking and
to make sure that the usr2_handler in the slave gracefully handles this
situation, probably just by doing nothing if there is nothing to read on the
pipe.

Would you be willing to provide a patch to fix the management of this pipe
to be nonblocking?

Thanks for taking the time to report this bug.
This bug report isn't very useful because it doesn't describe the bug well. If you have time and can still reproduce the bug, please read http://bugzilla.gnome.org/bug-HOWTO.html and add a description of how to reproduce this bug.

You'll also need to add a stack trace; please see http://live.gnome.org/GettingTraces for more information about how to do so.

Comment 8 Ed Catmur 2006-04-26 01:25:40 UTC

Created attachment 64297 [details] [review]
gdm-hang.patch

Right, yeah. I didn't really understand gdm's internal architecture, but your diagram makes sense.

> Perhaps add a gdm_debug() message to print out the contents of "buf" to the
> syslog.  I'm betting in the error condition, both messages will be received
> together the first time USR2 is received.

That's what the gdb breakpoint command list above does.

$3 = "!GtkModulesList
gail:atk-bridge:/usr/lib/gtk-2.0/modules/libkeymouselistener:/usr/lib/gtk-2.0/modules/libdwellmouselistener\n!AddGtkModules
1\n\016\000\000\002\000\000\000\002\000\000\000\b\000\000\000\000\000\000\000\222\204þC\000\000\000\000\222\204þC\000\000\000\000ª\204þC\000\000\000\000¹\204\004\000\000\000\000\000lH²¿ßM]·"...

i.e. buf contains a !GtkModulesList message and a !AddGtkModules message.

This patch sets slave_notify_fd to be nonblocking. It also changes count to ssize_t (signed; read() returns -1 on error).

Comment 9 Brian Cameron 2006-05-02 05:52:48 UTC

I worry that this may not be the right fix. Have you tested that GDM works properly when the code enters this state of needing to open in non-blocking mode and verify that everything works properly?

If it opens non-blocking, wouldn't it be possible for one slave to get a message that was intended for another slave, or the wrong message? Basically, I worry that we may just be hiding a deeper problem with this fix.

We should discuss whether it would be good to fix the code so it never enters a state of needing to open the socket in non-blocking mode. I don't think we yet have an understanding of the problem to decide this.

If my diagram above is correct, then I'm confused why the problem is happening. Perhaps the problem is that when the slave recieves the first USR2, it accepts and processes both requests, and the second USR2 hangs because there is nothing more to read and it just waits. If this is the case, then perhaps the slaves need to be made smarter and only read one message from the queue each time it gets a USR2 signal. Or perhaps we could do something like make the slave send a message back to the daemon when it is done processing a USR2 request (or after it has read the data from the socket), and the daemon will hold off on sending the next request until it gets the confirmation it processed the last one. Also we could put a delay in gdmsetup so it waits a bit between sending two updates in a row. This last idea doesn't really fix the race condition but hides it a bit.

I hope that you and I, though this bug report discussion, work through the analysis of the code so that we make the right change. I'm happy to help.

Perhaps putting some debug code in the slave so that we can get a clearer picture of what is going on. Is the code processing the two messages with one
USR2 signal? If so, does hacking the code so it doesn't process more than one at a time fix the problem as well?

Comment 10 Ed Catmur 2006-05-03 01:07:52 UTC

> I worry that this may not be the right fix.  Have you tested that GDM works
> properly when the code enters this state of needing to open in non-blocking
> mode and verify that everything works properly?
As far as I can. gdm doesn't have a test suite, and I'm unable to simulate heavy load.

> If it opens non-blocking, wouldn't it be possible for one slave to get a
> message that was intended for another slave, or the wrong message?  Basically,
> I worry that we may just be hiding a deeper problem with this fix.
No; the fd is the [0] end of a pipe. The pipe is created just before fork, and the [1] end is assigned to the master_notify_fd member of the GdmDisplay * corresponding to the slave in question. See gdm_display_manage() in daemon/display.c.

> Perhaps the problem is that when the slave recieves the first USR2, it accepts
> and processes both requests, and the second USR2 hangs because there is nothing
> more to read and it just waits.
Yes; that's what's happening.

> If this is the case, then perhaps the slaves
> need to be made smarter and only read one message from the queue each time it
> gets a USR2 signal.  
Quickly grepping the code, there doesn't seem to be any cases of sending more than one message at a time. However, gdm_slave_handle_usr2_message() splits on \n so the machinery to handle multiple messages is there; messing with it might not be the best idea. Anyway, reading one message at a time would mean reading the fd one byte at a time; very inefficient.

> Or perhaps we could do something like make the slave send
> a message back to the daemon when it is done processing a USR2 request (or
> after it has read the data from the socket), and the daemon will hold off on
> sending the next request until it gets the confirmation it processed the last
> one.  
Seems a bit over-engineered.

> Also we could put a delay in gdmsetup so it waits a bit between sending
> two updates in a row.  This last idea doesn't really fix the race condition 
> but hides it a bit.
Ugh. If you do that, don't let my name appear in the commit message.

To be honest, I don't want to touch the GDM code more than necessary; it's a bit of a tangled mess. Yeah, signals, I know.

I've written some debug code and I've got a log. I'll attach them.

Comment 11 Ed Catmur 2006-05-03 01:08:59 UTC

Created attachment 64711 [details] [review]
gdm-debug.patch

Some debugging log statements.

Comment 12 Ed Catmur 2006-05-03 01:26:37 UTC

Created attachment 64713 [details] [review]
gdm-debug.patch

Second try.

Comment 13 Brian Cameron 2006-05-03 01:35:41 UTC

Let's keep thinking about this.  This is a tricky bit of code, as you notice, so we want to make sure we fix it properly.  After all, we are fixing a bug that only seems to show up if users click too frequently on certain buttons in gdmsetup, which probably doesn't happen too often.  Still, hanging is bad, so we should fix it...but we don't want to introduce any worse problems.

Could we do something to the slave, do you think, to make it abort the read if it doesn't get a message in, say, 5 seconds.  A timeout, so the slave will just abort waiting on the pipe after a while?  That might fix the problem more elegantly than making the slave and daemon handshake more.

Comment 14 Brian Cameron 2006-05-03 01:37:11 UTC

Another idea may be to hack the slave to read multiple messages from the pipe, but stick any extra messages it reads on an internal queue.  Then when it gets called a second time, it would just read off the internal queue rather than trying to read over the pipe.

Comment 15 Ed Catmur 2006-05-03 02:31:57 UTC

Created attachment 64715 [details] [review]
gdm-debug.patch

3rd time lucky...

Comment 16 Ed Catmur 2006-05-03 02:36:39 UTC

Created attachment 64717 [details]
gdm.log

Action to generate this log:

1. Patch and compile gdm with above debugging patch
2. Enable debugging in gdm.conf
3. Restart gdm, log in
4. Place gdm slave process under heavy load, or simulate by attaching debugger (making sure to pass signals)
5. In gdmsetup, turn on accessibility. Maybe toggle a few times.
6. Watch syslog.

End of log (relevant part)

May  3 03:28:38 minerva gdm[8416]: (child 8417) gdm_slave_handle_usr2_message on display :0: about to read from slave fd
May  3 03:28:38 minerva gdm[8416]: (child 8417) gdm_slave_handle_usr2_message on display :0: read 141 bytes from slave fd
May  3 03:28:38 minerva gdm[8416]: (child 8417) gdm_slave_handle_usr2_message on display :0: message is:
May  3 03:28:38 minerva gdm[8416]: (child 8417) gdm_slave_handle_usr2_message on display :0: <line 0>: !GtkModulesList gail:atk-bridge:/usr/lib/gtk-2.0/modules/libkeymouselistener:/usr/lib/gtk-2.0/modules/libdwellmouselistener
May  3 03:28:38 minerva gdm[8416]: (child 8417) gdm_slave_handle_usr2_message on display :0: <line 1>: !AddGtkModules 1
May  3 03:28:38 minerva gdm[8416]: (child 8417) gdm_slave_handle_usr2_message on display :0: <line 2>:
May  3 03:28:38 minerva gdm[8416]: (child 8417) gdm_slave_handle_usr2_message on display :0: end of message
May  3 03:28:38 minerva gdm[8416]: (child 8417) gdm_slave_handle_usr2_message on display :0: about to read from slave fd

At this point, the slave gdm process is hung.

Comment 17 Ed Catmur 2006-05-03 03:00:55 UTC

> Could we do something to the slave, do you think, to make it abort the read if
> it doesn't get a message in, say, 5 seconds.  A timeout, so the slave will just
> abort waiting on the pipe after a while?  That might fix the problem more
> elegantly than making the slave and daemon handshake more. 
No point: the parent process (the main gdm daemon) sends SIGUSR2 *after* it writes to the pipe. AFAIK Unix semantics preserve temporal ordering of inter-process communication.

> Another idea may be to hack the slave to read multiple messages from the pipe,
> but stick any extra messages it reads on an internal queue.  Then when it gets
> called a second time, it would just read off the internal queue rather than
> trying to read over the pipe.
I don't like it. Currently the code allows for sending multiple messages in one go. (It's not used, but that might change.) Also, an internal queue seems excessive state for a fairly simple one-way message passing protocol.

I do think the patch I posted is close to the correct solution. If the problem is that setting O_NONBLOCK is too potentially global in its effects, we could poll (or select() with zero timeval) the read end of the pipe prior to calling gdm_slave_handle_usr2_message(). Note that gdm_slave_send() select() on the read end of the pipe before calling sgdm_slave_handle_usr2_message(), so semantics would be preserved.

Comment 18 Ed Catmur 2006-05-03 03:12:02 UTC

Created attachment 64718 [details] [review]
gdm-slave-poll.patch

Select() on the notify fd.

I'm not entirely sure about this, though; what happens if a signal arrives between the select() call (at which point the pipe can be read without blocking) and the read() call in gdm_slave_handle_usr2_message()? (Yes, it's a very thin race. So what?) The second SIGUSR2 handler would read the lot: (gdmsetup omitted from diagram)

Main daemon      |Slave process
-----------------|-------------
write to pipe    
send SIGUSR2   
                  receive SIGUSR2
                  enter gdm_slave_usr2_handler()
                  poll pipe; data available
                  call gdm_slave_handle_usr2_message()
write to pipe
send SIGUSR2
                  + receive SIGUSR2
                  + enter gdm_slave_usr2_handler()
                  + poll pipe; data available
                  + call gdm_slave_handle_usr2_message()
                  + read from pipe; get 2 messages
                  + handle messages
                  + return from gdm_slave_handle_usr2_message()
                  + exit gdm_slave_usr2_handler()
                  read from pipe        <-- BLOCKS

Comment 19 Ed Catmur 2006-05-03 03:43:17 UTC

Hm. I think the use of select() on the notify fd in the current code is also vulnerable to this race. Note that gdm_slave_handle_usr2_message() is also called from gdm_slave_send(). gdm_slave_send() can be called from in a signal handler, e.g. from within gdm_slave_child_handler() (calls gdm_slave_send_num()).

Main daemon      |Children, etc.|Slave process
-----------------|--------------|-------------
write to pipe
send SIGUSR2
                                 recieve SIGUSR2
                                 enter gdm_slave_usr2_handler()
                                 increment in_usr2_signal (now 1)
                                 call gdm_slave_handle_usr2_message()
                  die (SIGCHLD)
                                 + recieve SIGCHLD
                                 + enter gdm_slave_child_handler()
                                 + call gdm_slave_send_num()
                                 + call gdm_slave_send()
                                 + if (in_usr2_signal > 0) {
                                 + select() on pipe: data available
                                 + call gdm_slave_handle_usr2_message()
write to pipe
send SIGUSR2
                                 + + recieve SIGUSR2
                                 + + enter gdm_slave_usr2_handler()
                                 + + increment in_usr2_signal (now 2)
                                 + + call gdm_slave_handle_usr2_message()
                                 + + read from pipe; get 2 messages
                                 + + handle messages
                                 + + return from gdm_slave_handle_usr2_message()
                                 + + exit gdm_slave_usr2_handler()
                                 + read from pipe        <-- BLOCKS

The moral of this is that if you have control jumping all over the place into and out of signal handlers, you cannot assume there is data available on a fd after select()ing on it; a signal handler might pop up and grab all the data before you can read it.

The only solution to this is an atomic poll-and-read; lacking that (and if Unix has atomic poll-and-read it's news to me), set the fd to O_NONBLOCK.

Comment 20 Brian Cameron 2006-05-03 04:15:08 UTC

Thanks for being so patient with me and for being willing to test your work.  You are talking me into accepting your original patch, I think.  Could you explain to me, though, how it deals with the fact that there's nothing left to read on the pipe the 2nd time through.  It isn't clear to me how making the read nonblock fixes this.  Does it cause the read to simply return instead of hanging?  And if we are confidant that this change will not cause the slave to ever get a message intended for another slave, then I feel more comfortable accepting the patch.

Comment 21 Ed Catmur 2006-05-03 13:58:42 UTC

> Could you explain to me, though, how it deals with the fact that there's 
> nothing left to read on the pipe the 2nd time through.  It isn't clear to me 
> how making the read nonblock fixes this.  Does it cause the read to simply 
> return instead of hanging?
Yes. From read(2):

       On error, -1 is returned, and errno is set appropriately.
...
ERRORS
       EAGAIN Non-blocking  I/O has been selected using O_NONBLOCK and no data
              was immediately available for reading.

> And if we are confidant that this change will not cause the slave to
> ever get a message intended for another slave,
Yes. The messages are sent on a pipe (not a named pipe) that is created immediately prior to the slave being forked off; each slave has its own corresponding pipe; if the master daemon is to send a message to all slaves (notifying update of a config setting, say) it writes the message to each pipe (and signals each slave) in turn.

To be honest, the more I read the code, the more I dislike it; most of the signal handlers don't look reentrant (things as simple as not using atomic increment). But a proper fix would mean a major rewrite (have the signal handlers atomically increment flags which then are handled and atomically decremented in a GLib mainloop, I suppose); I'm just happy if this particular hang is fixed.

Comment 22 Brian Cameron 2006-05-03 17:07:23 UTC

Yes, you are not the first to complain about the ugliness of the signal handlers.    I appreciate your help, and if you want to make more specific suggestions about how to clean them up, I'd be interested to hear.  To be honest, I don't really understand the problem of using atomic increment, for example.

Why is it necessary to change size_t to ssize_t in daemon/slave.c in your patch?

Comment 23 Ed Catmur 2006-05-04 15:42:34 UTC

Well, on some architectures and in some conditions (PowerPC, for example) incrementing a counter is a three-op: read memory location, increment, write back. If a signal handler modifies the counter after the main thread has read the counter, then once the signal handler exits the main thread will increment and write back the old value, so the change the signal handler made will be lost. Using atomic operations (g_atomic_int_inc()) will prevent this happening.

Hm - actually, using atomic increment isn't necessary, because the signal handlers pair increment with decrement, so the counters are unchanged when the signal handler exits. OK, forget about that. Signal handling is a special case of concurrency, because the execution threads are nested instead of interleaved.

Here's a genuine bug: in misc.c, the paired gdm_sig{chld,term,usr2}_block_{push,pop} functions are race-prone:

void
gdm_sigusr2_block_push (void)
{
	sigusr2_blocked++;

	if (sigusr2_blocked == 1) {
		/* Set signal mask */
		sigemptyset (&sigusr2block_mask);
		sigaddset (&sigusr2block_mask, SIGUSR2);
		sigprocmask (SIG_BLOCK, &sigusr2block_mask, &sigusr2block_oldmask);
	}
}

void
gdm_sigusr2_block_pop (void)
{
	sigusr2_blocked --;

	if (sigusr2_blocked == 0) {
		/* Reset signal mask back */
		sigprocmask (SIG_SETMASK, &sigusr2block_oldmask, NULL);
	}
}

After gdm_sigusr2_block_pop decrements sigusr2_blocked to 0, if a signal handler calls gdm_sigusr2_block_push at that point then sigusr2block_oldmask will be set to the mask with SIGUSR2 blocked; when gdm_sigusr2_block_pop is called in the signal handler it will restore the oldmask with SIGUSR2 blocked and so will the outer gdm_sigusr2_block_pop. The result will be that SIGUSR2 remains blocked for the lifetime of the process.

Also, if gdm_sigusr2_block_push is called by a signal handler immediately after gdm_sigusr2_block_push in the main context increments sigusr2_blocked to 1, that signal handler will believe SIGUSR2 is blocked when in fact the main gdm_sigusr2_block_push has not yet blocked it. The result will be that the signal handler executes code without SIGUSR2 blocked that should only be executed with SIGUSR2 blocked.

The code needs to satisfy the following conditions:
1. When gdm_sigusr2_block_push returns: SIGUSR2 is blocked.
2. When gdm_sigusr2_block_pop returns: the signal mask is what it was before the corresponding gdm_sigusr2_block_push call.

Environment:
1. push/pop calls come paired.
2. A paired push/pop call can occur between any two machine instructions.

Fix:

void
gdm_sigusr2_block_push (void)
{
	sigset_t oldmask;

	if (sigusr2_blocked == 0) {
		/* Set signal mask */
		sigemptyset (&sigusr2block_mask);
		sigaddset (&sigusr2block_mask, SIGUSR2);
		sigprocmask (SIG_BLOCK, &sigusr2block_mask, &oldmask);
	}

	sigusr2_blocked++;

	sigusr2_block_oldmask = oldmask;
}

void
gdm_sigusr2_block_pop (void)
{
	sigset_t oldmask;

	oldmask = sigusr2block_oldmask;

	sigusr2_blocked--;

	if (sigusr2_blocked == 0) {
		/* Reset signal mask back */
		sigprocmask (SIG_SETMASK, &sigusr2block_oldmask, NULL);
	}
}

Other signal handler issues - OK, gdm_slave_xfsz_handler is reentrant; gdm_slave_term_handler looks OK; gdm_slave_alrm_handler is probably OK; gdm_slave_child_handler seems OK (it reads slave_waitpids, but SIGCHLD is blocked when slave_waitpids is modified).

gdm_slave_usr2_handler calls gdm_slave_handle_usr2_message, which is non-reentrant: first of all, it reads from slave_notify_fd which as discussed above can block unless set to O_NONBLOCK. Also, it calls g_list_append, which is non-reentrant; if a signal handler calls g_list_append interrupting another call to g_list_append then the signal handler's item can be lost:

      last = g_list_last (list);
         <---- if signal handler called here, signal handler's item is lost
      last->next = new_list;
      new_list->prev = last;

Also, it allocates and frees memory (gdm_ack_response), which is non-reentrant (malloc keeps a static linked list of free memory; if it is reentered, then bad things happen).

This is admitted at the top of slave.c:

/* Yay thread unsafety */
static gboolean x_error_occurred = FALSE;
static gboolean gdm_got_ack = FALSE;
static char * gdm_ack_response = NULL;
static GList *unhandled_notifies = NULL;

Changes: I would make the SIGUSR2 just set (increment?) a flag and let the main context handle reading from the fd.

ssize_t: POSIX read has signature 

       ssize_t read(int fd, void *buf, size_t count);

size_t is unsigned; read returns -1 on error. ssize_t is signed.

Comment 24 Brian Cameron 2006-05-04 22:53:17 UTC

Thanks, Ed, for helping so much.  I applied your patch and the fix to the sigusr2 pop/push signal handlers.  I'm changing the synopsis of the bug to reflect that the gdm_slave_usr2_handler has issues.

Comment 25 Chris Wilson 2009-01-30 18:12:27 UTC

I think there may be a problem in USR1 handling as well, although it may be a different problem. I just found my gdm deadlocked after sending USR1 to it, with the following stack trace:

+ Trace 212039

#0 __kernel_vsyscall
#1 pthread_cond_wait
from /lib/tls/i686/cmov/libpthread.so.0
#2 pthread_cond_wait
from /lib/tls/i686/cmov/libc.so.6
#3 ??
from /usr/lib/libxcb.so.1
#4 ??
from /usr/lib/libxcb.so.1
#5 xcb_wait_for_reply
from /usr/lib/libxcb.so.1
#6 _XReply
from /usr/lib/libX11.so.6
#7 XSync
from /usr/lib/libX11.so.6
#8 ??
#9 <signal handler called>
#10 __kernel_vsyscall
#11 select
from /lib/tls/i686/cmov/libc.so.6
#12 ??
from /usr/lib/libxcb.so.1
#13 xcb_wait_for_reply
from /usr/lib/libxcb.so.1
#14 _XReply
from /usr/lib/libX11.so.6
#15 XSync
from /usr/lib/libX11.so.6
#16 ??
#17 ??
#18 ??
#19 ??
#20 ??
#21 ??
from /usr/lib/libglib-2.0.so.0
#22 g_main_context_dispatch
from /usr/lib/libglib-2.0.so.0
#23 ??
from /usr/lib/libglib-2.0.so.0
#24 g_main_loop_run
from /usr/lib/libglib-2.0.so.0
#25 ??
#26 __libc_start_main
from /lib/tls/i686/cmov/libc.so.6
#27 ??


I know that I don't have debugging symbols installed, but I don't think that the signal handler should be calling Xsync. That's not safe to do in a signal handler. About the only thing that is safe to do is to set a flag.

This is on Ubuntu Hardy Heron (8.04.2) with gdm_2.20.5-0ubuntu3_i386.deb.

Comment 26 William Jon McCann 2010-06-04 19:58:54 UTC

Thanks for taking the time to report this bug.
However, you are using a version that is too old and not supported anymore. GNOME developers are no longer working on that version, so unfortunately there will not be any bug fixes for the version that you use.

By upgrading to a newer version of GNOME you could receive bug fixes and new functionality. You may need to upgrade your Linux distribution to obtain a newer version of GNOME.
Please feel free to reopen this bug if the problem still occurs with a newer version of GNOME.

Comment 27 Ed Catmur 2010-06-06 23:41:32 UTC

For the record: the issues discussed in comment 23 are still apparent (as of 2.20.10); in particular gdm_slave_usr2_handler is still not reentrant.