After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 542391 - libesd should not block if the daemon dies/hangs/whatever
libesd should not block if the daemon dies/hangs/whatever
Status: RESOLVED FIXED
Product: esound
Classification: Deprecated
Component: general
unspecified
Other Linux
: Normal normal
: ---
Assigned To: Esound Maintainers
Esound Maintainers
Depends on:
Blocks:
 
 
Reported: 2008-07-10 16:28 UTC by Jeffrey Stedfast
Modified: 2008-07-31 15:35 UTC
See Also:
GNOME target: ---
GNOME version: 2.21/2.22


Attachments
esound-connect-timeout.patch (2.70 KB, patch)
2008-07-10 16:38 UTC, Jeffrey Stedfast
none Details | Review
esound-connect-and-auth-timeout.patch (5.54 KB, patch)
2008-07-10 19:58 UTC, Jeffrey Stedfast
none Details | Review
esound-timeout.patch (20.11 KB, patch)
2008-07-11 14:20 UTC, Jeffrey Stedfast
committed Details | Review
error handling for write_timeout (383 bytes, patch)
2008-07-30 14:43 UTC, Matthias Drochner
none Details | Review
better-write-timeout.patch (812 bytes, patch)
2008-07-30 15:33 UTC, Jeffrey Stedfast
committed Details | Review

Description Jeffrey Stedfast 2008-07-10 16:28:47 UTC
It seems that the esd daemon has recently been replaced on many systems with an unreliable PulseAudio daemon process that emulates esd.

The problem I've been having a lot recently is that since pulseaudio likes to hang a lot, it causes my entire desktop to hang (can't logout, can't login, pidgin locks up if someone IM's me, firefox won't startup, or basically anything happens that would cause communication with the esound daemon process).

I would suggest going through esdlib.c (and maybe other files?) and making sure that absolutely nothing will block for very long before timing out if it can't connect/write to the esd socket.
Comment 1 Jeffrey Stedfast 2008-07-10 16:38:41 UTC
Created attachment 114325 [details] [review]
esound-connect-timeout.patch

This fixes connect() to not block indefinitely, but anything that reads or writes to the socket should really get fixed.
Comment 2 Jeffrey Stedfast 2008-07-10 19:58:39 UTC
Created attachment 114348 [details] [review]
esound-connect-and-auth-timeout.patch

so I just had pulseaudio crap out on me again and tested out my patch... well, it got further than the last time I got a backtrace, but it got stuck in esd_send_auth() this time so I updated the patch to use timeouts for that too and then re-launched firefox and IT WORKED! IT BLOODY WORKED! HALLELUJAH!
Comment 3 Jeffrey Stedfast 2008-07-11 14:20:24 UTC
Created attachment 114390 [details] [review]
esound-timeout.patch

this updated patch makes all of esdlib.c use read_timeout()/write_timeout() instead of read()/write() directly
Comment 4 Jeffrey Stedfast 2008-07-15 16:23:27 UTC
committed
Comment 5 Matthias Drochner 2008-07-30 14:41:53 UTC
This change causes for me that a gnome session hangs on logout
if no esd is running. gnome-panel gets into an endless
loop, in libesd's write_timeout, where poll() returns 1
and write() returns EPIPE.
Adding error handling like in read_timeout - see appended patch -
fixes it for me. (though it didn't get a lot of testing yet)
Comment 6 Matthias Drochner 2008-07-30 14:43:01 UTC
Created attachment 115562 [details] [review]
error handling for write_timeout
Comment 7 Jakub 'Livio' Rusinek 2008-07-30 14:48:12 UTC
is this related to my problems ?
http://bugzilla.gnome.org/show_bug.cgi?id=542698
Comment 8 Matthias Drochner 2008-07-30 14:57:29 UTC
> 542698

I don't think so. In my case there was no zombie involved. Just
gnome-panel going into a loop.
I'll try to reproduce your problem later. Are you noticing a
process eating up CPU time, or blocking on an unusual wait
channel?
Comment 9 Jeffrey Stedfast 2008-07-30 15:33:05 UTC
Created attachment 115567 [details] [review]
better-write-timeout.patch

Matthias: I think your solution would work, but it depends on the lib catching SIGPIPE (which it probably does).

I've attached what I think is a better solution. Could you test to make sure it solves the problem for you?

It seems that on your system poll() is (probably) returning 1 and setting revents to POLLOUT | POLLHUP.

Unlike in the read_timeout() scenario where we can read data even if POLLHUP is set, we can't actually /write/ data if POLLHUP has occurred. So... my fix is to check for the existence of POLLHUP or POLLERR /first/, if either of those is set, then abort the write(). Otherwise, try writing if POLLOUT is set.
Comment 10 Jakub 'Livio' Rusinek 2008-07-30 16:03:44 UTC
Matthias, sometimes when I'm clicking logout/shutdown, nothing happens. I only suppose it's because of a zombie. That's quite a stupid issue and makes me very angry :/ .
Comment 11 Matthias Drochner 2008-07-30 17:21:20 UTC
> I've attached what I think is a better solution. Could you test to make sure it
> solves the problem for you?

Sorry, it doesn't help. poll() does only set POLLOUT.
There are 2 sentences in the NetBSD manpage for poll(2) which might
be of interest:
  Note that POLLHUP and POLLOUT should never be present in
  the revents bitmask at the same time.  If the remote end
  of a socket is closed, poll() returns a POLLIN event,
  rather than a POLLHUP.
Imo it is the most robust solution to abort the loop on every write error
other than EINTR.
Comment 12 Jeffrey Stedfast 2008-07-30 19:17:44 UTC
ok, but in your case is it even setting the POLLIN bit on revents?

I'm just trying to avoid a possible SIGPIPE
Comment 13 Matthias Drochner 2008-07-30 20:54:26 UTC
> in your case is it even setting the POLLIN bit on revents?

Yes, just checked. The poll() sets POLLIN and POLLOUT. (It must
be put into the input "events" bitmask before.)
But I wouldn't take that as sole indication that the socket
was closed -- that looks just too fragile.
As a general philosophy, I'd take poll/select as a hint that
it is worth looking at a file descriptor, but the real error
handling needs to be done after the read/write calls.

> I'm just trying to avoid a possible SIGPIPE

I didn't track the recent changes too closely, but as it looks
the SIGPIPE was triggered before you introduced the _noblock
functions, and it gets caught in the calling functions, and
it was fine all the time, so why worry?
Comment 14 Jeffrey Stedfast 2008-07-30 21:10:47 UTC
we don't want to set POLLIN on events

I've already committed the write error check
Comment 15 Matthias Drochner 2008-07-31 14:04:39 UTC
The code now in svn HEAD works well for me, thanks.
Comment 16 Jeffrey Stedfast 2008-07-31 15:35:56 UTC
awesome :-)