GNOME Bugzilla – Bug 542391
libesd should not block if the daemon dies/hangs/whatever
Last modified: 2008-07-31 15:35:56 UTC
It seems that the esd daemon has recently been replaced on many systems with an unreliable PulseAudio daemon process that emulates esd. The problem I've been having a lot recently is that since pulseaudio likes to hang a lot, it causes my entire desktop to hang (can't logout, can't login, pidgin locks up if someone IM's me, firefox won't startup, or basically anything happens that would cause communication with the esound daemon process). I would suggest going through esdlib.c (and maybe other files?) and making sure that absolutely nothing will block for very long before timing out if it can't connect/write to the esd socket.
Created attachment 114325 [details] [review] esound-connect-timeout.patch This fixes connect() to not block indefinitely, but anything that reads or writes to the socket should really get fixed.
Created attachment 114348 [details] [review] esound-connect-and-auth-timeout.patch so I just had pulseaudio crap out on me again and tested out my patch... well, it got further than the last time I got a backtrace, but it got stuck in esd_send_auth() this time so I updated the patch to use timeouts for that too and then re-launched firefox and IT WORKED! IT BLOODY WORKED! HALLELUJAH!
Created attachment 114390 [details] [review] esound-timeout.patch this updated patch makes all of esdlib.c use read_timeout()/write_timeout() instead of read()/write() directly
committed
This change causes for me that a gnome session hangs on logout if no esd is running. gnome-panel gets into an endless loop, in libesd's write_timeout, where poll() returns 1 and write() returns EPIPE. Adding error handling like in read_timeout - see appended patch - fixes it for me. (though it didn't get a lot of testing yet)
Created attachment 115562 [details] [review] error handling for write_timeout
is this related to my problems ? http://bugzilla.gnome.org/show_bug.cgi?id=542698
> 542698 I don't think so. In my case there was no zombie involved. Just gnome-panel going into a loop. I'll try to reproduce your problem later. Are you noticing a process eating up CPU time, or blocking on an unusual wait channel?
Created attachment 115567 [details] [review] better-write-timeout.patch Matthias: I think your solution would work, but it depends on the lib catching SIGPIPE (which it probably does). I've attached what I think is a better solution. Could you test to make sure it solves the problem for you? It seems that on your system poll() is (probably) returning 1 and setting revents to POLLOUT | POLLHUP. Unlike in the read_timeout() scenario where we can read data even if POLLHUP is set, we can't actually /write/ data if POLLHUP has occurred. So... my fix is to check for the existence of POLLHUP or POLLERR /first/, if either of those is set, then abort the write(). Otherwise, try writing if POLLOUT is set.
Matthias, sometimes when I'm clicking logout/shutdown, nothing happens. I only suppose it's because of a zombie. That's quite a stupid issue and makes me very angry :/ .
> I've attached what I think is a better solution. Could you test to make sure it > solves the problem for you? Sorry, it doesn't help. poll() does only set POLLOUT. There are 2 sentences in the NetBSD manpage for poll(2) which might be of interest: Note that POLLHUP and POLLOUT should never be present in the revents bitmask at the same time. If the remote end of a socket is closed, poll() returns a POLLIN event, rather than a POLLHUP. Imo it is the most robust solution to abort the loop on every write error other than EINTR.
ok, but in your case is it even setting the POLLIN bit on revents? I'm just trying to avoid a possible SIGPIPE
> in your case is it even setting the POLLIN bit on revents? Yes, just checked. The poll() sets POLLIN and POLLOUT. (It must be put into the input "events" bitmask before.) But I wouldn't take that as sole indication that the socket was closed -- that looks just too fragile. As a general philosophy, I'd take poll/select as a hint that it is worth looking at a file descriptor, but the real error handling needs to be done after the read/write calls. > I'm just trying to avoid a possible SIGPIPE I didn't track the recent changes too closely, but as it looks the SIGPIPE was triggered before you introduced the _noblock functions, and it gets caught in the calling functions, and it was fine all the time, so why worry?
we don't want to set POLLIN on events I've already committed the write error check
The code now in svn HEAD works well for me, thanks.
awesome :-)