GNOME Bugzilla – Bug 698081
Pidgin hangs in g_spawn_command_line_sync
Last modified: 2013-08-08 15:09:16 UTC
Created attachment 241587 [details] Pidgin gdb backtrace After update to GLib 2.36.0, Pidgin doesn't start anymore. After drawing some blank window frames, it hangs in g_spawn_command_line_sync("gsettings get org.gnome.system.proxy mode", ...), see stack trace. The "gsettings get org.gnome.system.proxy mode" command itself runs OK, and I even coded a one-line program that invokes the above command with g_spawn_command_line_sync -- it doesn't hang either. Before this call, Pidgin invokes g_spawn_command_line_sync("gconftool-2 ...") several times, and these calls do succeed. I have an impression that the only distinguishable condition for the problematic call is that it is called from within the main loop, while previous calls are not. That's why I'm filing the bug here, not in Pidgin's bug tracker. The gsettings process becomes a zombie, Pidgin itself can be killed only with SIGKILL. Mageia Linux 3 (Cauldron) i586, kernel 3.8.7, GLib 2.36.0, Pidgin 2.10.7 built with GTK+ 2.24.17.
I was able to run Pidgin under strace, it runs OK, the result is stable. When I was trying to produce a stack trace and executed Pidgin from GDB, once it ran OK, too. Seems like race condition -- the program runs OK only if something is slowing down its execution.
I am guessing that somehow this is a race condition involving the startup of the glib worker thread, though I can't see exactly how. a backtrace including the worker thread ("thread apply all bt") might be useful
Created attachment 241704 [details] thread apply all bt
I'm not sure what's going on here but I'm pretty sure the linked patch introduced the regression. Seeing as it was introduced in order to support a feature that we ended up rejecting anyway I think the best thing to do at this point is simply revert the patch. Before we do so, can you verify that a GLib compiled with this patch reverted will fix the issue?
Sorry: this patch https://git.gnome.org/browse/glib/commit/?id=ce0022933c255313e010b27f977f4ae02aad1e7e
(In reply to comment #4) > I'm not sure what's going on here but I'm pretty sure the linked patch > introduced the regression. Seeing as it was introduced in order to support a > feature that we ended up rejecting anyway I think the best thing to do at this > point is simply revert the patch. Besides the feature, there were two listed improvements in the commit message. As far as this issue; I'd like to at least investigate the Pidgin source.
(In reply to comment #4) > I'm not sure what's going on here but I'm pretty sure the linked patch > introduced the regression. Confirming. With the patch applied (git diff ce0022933c255313e010b27f977f4ae02aad1e7e 0bdf7fecaf1ffc7263d2bc48a87c99f4705138fc) and GLib 2.36.0 rebuilt, Pidgin runs OK like before. The result is stable. Feel free to ask me for any additional traces, dumps etc.
I suspect it has something to do with: libpurple/dns-query.c This code is calling fork() without a subsequent exec() in a GTK+ program which is a really, really bad idea. Specifically, doing this is going to kill the GLib worker thread. They should be using http://developer.gnome.org/gio/2.34/GResolver.html
The GLib worker thread appears to be alive and well in the attached backtrace.... The only improvement I'd like to keep from the patch is the g_warning() about the child already being reaped and even in that case I'm not sure if I want to fire the source or destroy it or what... Copying the EINTR checking was a mistake -- it was only in the gspawn code because it was blocking. The gmain code is not. The manpages say: EINTR WNOHANG was not set and an unblocked signal or a SIGCHLD was caught; see signal(7). ie: we won't get this with the WNOHANG we give in gmain.c.
(In reply to comment #9) > The GLib worker thread appears to be alive and well in the attached > backtrace.... True. Hmm. I think I see the race condition; suppose the main thread has already setup a child watch, so we have a SIGCHLD handler installed. Now we call g_spawn_sync(), and the child exits immediately, before we create the child watch source. The worker thread will get the SIGCHLD, find no matching unix child watch, and just drop it on the floor.
Er, nevermind, we do a waitpid() when creating the source to prevent exactly this race.
*** Bug 698606 has been marked as a duplicate of this bug. ***
We figured this out and it's pretty simple. g_spawn_sync() didn't used to rely on GLib being the SIGCHLD handler and now it does. Any program that tries to use its own SIGCHLD handler and g_spawn_sync() together is going to be impacted by this. I'm going to back it out since there's no really good reason for doing it this way (and it's actually slower, too).
Created attachment 242272 [details] [review] Partially revert "Merge waitpid() from g_spawn_sync into gmain()" This partially reverts commit ce0022933c255313e010b27f977f4ae02aad1e7e. It used to be safe to use g_spawn_sync() from processes that had their own SIGCHLD handler because it simply called wait(). When it was changed to depend on the GLib child watching infrastructure this meant that GLib had to own the SIGCHLD handler. This caused hangs in at least Pidgin. The patch contained two other improvements to the child watch code which we want to keep, so only revert the changes to gspawn itself.
Review of attachment 242272 [details] [review]: Looks fine, yeah.
Attachment 242272 [details] pushed as eb860fd - Partially revert "Merge waitpid() from g_spawn_sync into gmain()"
*** Bug 697895 has been marked as a duplicate of this bug. ***