After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 698081 - Pidgin hangs in g_spawn_command_line_sync
Pidgin hangs in g_spawn_command_line_sync
Status: RESOLVED FIXED
Product: glib
Classification: Platform
Component: mainloop
2.36.x
Other Linux
: Normal normal
: ---
Assigned To: gtkdev
gtkdev
: 697895 698606 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2013-04-15 19:41 UTC by Dimitri
Modified: 2013-08-08 15:09 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Pidgin gdb backtrace (2.37 KB, text/plain)
2013-04-15 19:41 UTC, Dimitri
  Details
thread apply all bt (4.63 KB, text/plain)
2013-04-17 01:38 UTC, Dimitri
  Details
Partially revert "Merge waitpid() from g_spawn_sync into gmain()" (4.04 KB, patch)
2013-04-23 17:32 UTC, Allison Karlitskaya (desrt)
committed Details | Review

Description Dimitri 2013-04-15 19:41:44 UTC
Created attachment 241587 [details]
Pidgin gdb backtrace

After update to GLib 2.36.0, Pidgin doesn't start anymore. After drawing some blank window frames, it hangs in g_spawn_command_line_sync("gsettings get org.gnome.system.proxy mode", ...), see stack trace. The "gsettings get org.gnome.system.proxy mode" command itself runs OK, and I even coded a one-line program that invokes the above command with g_spawn_command_line_sync -- it doesn't hang either. Before this call, Pidgin invokes g_spawn_command_line_sync("gconftool-2 ...") several times, and these calls do succeed. I have an impression that the only distinguishable condition for the problematic call is that it is called from within the main loop, while previous calls are not. That's why I'm filing the bug here, not in Pidgin's bug tracker.

The gsettings process becomes a zombie, Pidgin itself can be killed only with SIGKILL.

Mageia Linux 3 (Cauldron) i586, kernel 3.8.7, GLib 2.36.0, Pidgin 2.10.7 built with GTK+ 2.24.17.
Comment 1 Dimitri 2013-04-15 19:50:02 UTC
I was able to run Pidgin under strace, it runs OK, the result is stable. When I was trying to produce a stack trace and executed Pidgin from GDB, once it ran OK, too. Seems like race condition -- the program runs OK only if something is slowing down its execution.
Comment 2 Dan Winship 2013-04-16 14:29:39 UTC
I am guessing that somehow this is a race condition involving the startup of the glib worker thread, though I can't see exactly how.

a backtrace including the worker thread ("thread apply all bt") might be useful
Comment 3 Dimitri 2013-04-17 01:38:22 UTC
Created attachment 241704 [details]
thread apply all bt
Comment 4 Allison Karlitskaya (desrt) 2013-04-19 18:11:11 UTC
I'm not sure what's going on here but I'm pretty sure the linked patch introduced the regression.  Seeing as it was introduced in order to support a feature that we ended up rejecting anyway I think the best thing to do at this point is simply revert the patch.  Before we do so, can you verify that a GLib compiled with this patch reverted will fix the issue?
Comment 5 Allison Karlitskaya (desrt) 2013-04-19 19:23:27 UTC
Sorry: this patch https://git.gnome.org/browse/glib/commit/?id=ce0022933c255313e010b27f977f4ae02aad1e7e
Comment 6 Colin Walters 2013-04-19 19:32:00 UTC
(In reply to comment #4)
> I'm not sure what's going on here but I'm pretty sure the linked patch
> introduced the regression.  Seeing as it was introduced in order to support a
> feature that we ended up rejecting anyway I think the best thing to do at this
> point is simply revert the patch.  

Besides the feature, there were two listed improvements in the commit message.

As far as this issue; I'd like to at least investigate the Pidgin source.
Comment 7 Dimitri 2013-04-19 20:42:34 UTC
(In reply to comment #4)
> I'm not sure what's going on here but I'm pretty sure the linked patch
> introduced the regression.

Confirming. With the patch applied (git diff ce0022933c255313e010b27f977f4ae02aad1e7e 0bdf7fecaf1ffc7263d2bc48a87c99f4705138fc) and GLib 2.36.0 rebuilt, Pidgin runs OK like before. The result is stable. Feel free to ask me for any additional traces, dumps etc.
Comment 8 Colin Walters 2013-04-19 20:54:10 UTC
I suspect it has something to do with: libpurple/dns-query.c

This code is calling fork() without a subsequent exec() in a GTK+ program which is a really, really bad idea.  Specifically, doing this is going to kill the GLib worker thread.

They should be using http://developer.gnome.org/gio/2.34/GResolver.html
Comment 9 Allison Karlitskaya (desrt) 2013-04-20 15:21:37 UTC
The GLib worker thread appears to be alive and well in the attached backtrace....

The only improvement I'd like to keep from the patch is the g_warning() about the child already being reaped and even in that case I'm not sure if I want to fire the source or destroy it or what...  Copying the EINTR checking was a mistake -- it was only in the gspawn code because it was blocking.  The gmain code is not.  The manpages say:

       EINTR  WNOHANG was not set and an unblocked signal or a SIGCHLD was caught; see signal(7).


ie: we won't get this with the WNOHANG we give in gmain.c.
Comment 10 Colin Walters 2013-04-20 20:27:07 UTC
(In reply to comment #9)
> The GLib worker thread appears to be alive and well in the attached
> backtrace....

True.  Hmm.  I think I see the race condition; suppose the main thread has already setup a child watch, so we have a SIGCHLD handler installed.  Now we call g_spawn_sync(), and the child exits immediately, before we create the child watch source.  The worker thread will get the SIGCHLD, find no matching unix child watch, and just drop it on the floor.
Comment 11 Colin Walters 2013-04-20 20:30:17 UTC
Er, nevermind, we do a waitpid() when creating the source to prevent exactly this race.
Comment 12 Colin Walters 2013-04-22 19:20:22 UTC
*** Bug 698606 has been marked as a duplicate of this bug. ***
Comment 13 Allison Karlitskaya (desrt) 2013-04-23 17:25:01 UTC
We figured this out and it's pretty simple.  g_spawn_sync() didn't used to rely on GLib being the SIGCHLD handler and now it does.  Any program that tries to use its own SIGCHLD handler and g_spawn_sync() together is going to be impacted by this.  I'm going to back it out since there's no really good reason for doing it this way (and it's actually slower, too).
Comment 14 Allison Karlitskaya (desrt) 2013-04-23 17:32:24 UTC
Created attachment 242272 [details] [review]
Partially revert "Merge waitpid() from g_spawn_sync into gmain()"

This partially reverts commit ce0022933c255313e010b27f977f4ae02aad1e7e.

It used to be safe to use g_spawn_sync() from processes that had their
own SIGCHLD handler because it simply called wait().  When it was
changed to depend on the GLib child watching infrastructure this meant
that GLib had to own the SIGCHLD handler.

This caused hangs in at least Pidgin.

The patch contained two other improvements to the child watch code which
we want to keep, so only revert the changes to gspawn itself.
Comment 15 Colin Walters 2013-04-23 17:41:37 UTC
Review of attachment 242272 [details] [review]:

Looks fine, yeah.
Comment 16 Allison Karlitskaya (desrt) 2013-04-23 18:39:25 UTC
Attachment 242272 [details] pushed as eb860fd - Partially revert "Merge waitpid() from g_spawn_sync into gmain()"
Comment 17 Stef Walter 2013-08-08 15:09:16 UTC
*** Bug 697895 has been marked as a duplicate of this bug. ***