Bug 738620 – g_spawn_* functions break POSIX requirements and deadlock the child process under certain conditions

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 738620 - g_spawn_* functions break POSIX requirements and deadlock the child process under certain conditions


Summary:	g_spawn_* functions break POSIX requirements and deadlock the child process u...


Status:	RESOLVED OBSOLETE

Product:	glib
Classification:	Platform
Component:	gsubprocess
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal major
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-10-16 10:26 UTC by Aleksander Wabik
Modified:	2018-05-24 17:09 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Aleksander Wabik 2014-10-16 10:26:08 UTC

g_spawn_* functions call internally fork_exec_with_pipes(), which calls do_exec().

In these functions, there are called functions which are not async-signal-safe, like readdir, dirfd, and probably others (I apology for not providing a complete callstack, but I reproduced a bug once and without full debug info).

The final result is that after the fork, the child process ends up in the allocator to allocate or free some memory. If the allocator (like uClibc's malloc, or older versions of tcmalloc) does not register pthread_atfork handlers to ensure clean internal state during fork, the child process may deadlock on an internal allocator's mutex which was held by a different thread in the parent process during the fork().

This bug reproduces for example when I call gst_init() during the application's normal work, while other threads are already working. GStreamer forks to exec the gst-plugin-scanner, but due to bugs in g_spawn_async_with_pipes() it deadlocks.

I know that this bug could be fixed in the allocator, and major allocators (glib's ptmalloc, firefoxes jemalloc) do register atfork handlers, but by the POSIX spec after the fork only the async-signal-safe functions should be called in the child before it performs exec, so strictly speaking, the real bug is in glib.

Comment 1 Matthias Clasen 2014-10-16 11:01:30 UTC

Are you seeing this problem with some alternative libc ?

It should be possible to rewrite fdwalk without using library functions, using the syscalls directly.

Comment 2 Allison Karlitskaya (desrt) 2014-10-16 11:19:27 UTC

       int getdents(unsigned int fd, struct linux_dirent *dirp,
                    unsigned int count);

       Note: There is no glibc wrapper for this system call; see NOTES.

NOTES
       Glibc does not provide a wrapper for  this  system  call;  call  it
       using  syscall(2).  You will need to define the linux_dirent struc‐
       ture yourself.   However,  you  probably  want  to  use  readdir(3)
       instead.


:(

Comment 3 Colin Walters 2014-10-16 12:16:19 UTC

(In reply to comment #0)
> 
> I know that this bug could be fixed in the allocator, and major allocators
> (glib's ptmalloc, firefoxes jemalloc) do register atfork handlers, but by the
> POSIX spec after the fork only the async-signal-safe functions should be called
> in the child before it performs exec, so strictly speaking, the real bug is in
> glib.

But it's really quite easy to do, and makes using fork() significantly saner for applications.  It's not just glib you're going to hit this problem in.  Change uClibc and make the world a better place please.

Comment 4 Aleksander Wabik 2014-10-16 19:48:01 UTC

(In reply to comment #1)
> Are you seeing this problem with some alternative libc ?

Yes, tcmalloc. From my curiosity I've checked other allocators, and:
- D.L. malloc - safe, but disabled by default,
- ptmalloc - I assumed it's safe as you guys and everyone uses it all the time,
- jemalloc - safe,
- uClibc - vulnerable
- tcmalloc - older versions vulnerable, but fixed 1.5 years ago: https://code.google.com/p/gperftools/issues/detail?id=496


(In reply to comment #3)
> But it's really quite easy to do, and makes using fork() significantly saner
> for applications.  It's not just glib you're going to hit this problem in. 
> Change uClibc and make the world a better place please.

uClibc has a ton of bugs and no official releases :(

To make it clear: I'm a free-software and gnome fanboy, but no a g{lib,nome,streamer} developer, and I don't want to look like I'm "demanding" something from you. Just letting you know of a problem and sharing my opinion (if the cost of staying POSIX-compliant is not too high, then it's a good thing to stay compliant).

Comment 5 Colin Walters 2014-10-17 01:50:28 UTC

(In reply to comment #4)

> uClibc has a ton of bugs and no official releases :(

I suspect they're rather squeezed between Android and the "implement the useful bits of glibc or bust" model of systemd as far as embedded systems go.
 
> (if the cost of staying POSIX-compliant is not too high, then it's a good thing
> to stay compliant).

It's pretty high.  Last I looked for example, Python effectively requires forksafe-malloc too.

Comment 6 Allison Karlitskaya (desrt) 2014-10-20 12:22:47 UTC

We need to do an fdwalk on all POSIX systems.

On Linux we can do this with iterating /proc.  If we have a suitable libc (probably all of them) then we could also do this by making syscalls and skip the malloc() risk of using DIR*.

On other platforms we don't fdwalk by iterating /proc so this is not an issue there.  I also don't feel like building a grab-bag of malloc()-free DIR* replacements based on the syscalls of various OS kernels, so that's good.

We implement fdwalk for ourselves if it's not available.  I think we should probably always use our own implementation for two reasons:

 1) easier to find bugs in our implementation
 2) less risk that fdwalk will start calling malloc() in some new and interesting way in the future

Once we are always using our own implementation, we may as well be basing it on syscalls instead of the C library.

However: maybe there are other cases of malloc() use post-fork() that we could get tripped up by.

TL;DR: I want to fix this bug but only if it is the only one.  I worry that it's not -- particularly if you consider our typical users in addition to our own direct uses.  If there are other cases where we are using malloc() after fork() then I'd rather just declare that GLib has a dependency on a fork()-safe malloc() (in a similar way to what we did with the monotonic clock).

Comment 7 Ewen McNeill 2014-10-20 19:34:38 UTC

Potentially related, over at the MacPorts (OS X) site there is ticket #45309 for Gimp failing to launch DBUS or detect any plugins since new versions of glib/gtk/gimp were incorporated into MacPorts about 11 days ago (so around 2014-10-10 ish).  On the DBUS launch there's a "Not enough memory" error report.  It's worked fine for years prior to that.

As far as I can tell gimp detects plugins by launching something, via glib functions, to auto-discover them (and the DBUS case is definitely supposed to be a program launch).  Based on the debugging to date it looks like something in the fork() child is failing before it is able to get as far as calling exec*() -- but we have not yet determined precisely what.  Memory allocation is one plausible theory, especially given the "Not enough memory" report.

Ticket URL: https://trac.macports.org/ticket/45309

Seems to affect OS X 10.6.8, 10.7, 10.9 and 10.10 (beta) at least.  

It appears that on OS X the relevant glib functions do work in some other gtk applications (eg, eog is mentioned in the ticket), but not in gimp.  My hunch is that single threaded applications have different malloc-after-fork rules than multi-threaded applications.

Ewen

Comment 8 Allison Karlitskaya (desrt) 2014-10-20 19:55:56 UTC

Note: to my surprise it is in fact theoretically possible to write and run a Gtk program that does not create any threads.

Comment 9 Ewen McNeill 2014-10-20 20:02:07 UTC

(In reply to comment #8)
> Note: to my surprise it is in fact theoretically possible to write and run a
> Gtk program that does not create any threads.

Possibly slightly more practically, the ordering in which 
(a) external programs (eg, DBUS launch, plugin discovery) and 
(b) application thread startup 

happen, might well be relevant to what is permitted between fork() and exec*().  To the OS/libc the program that launches DBUS _then_ creates its application threads will look single threaded; the program which starts some of its application threads then launches DBUS will look multi threaded, and may be forced to play by different rules.  (Now I say that, it occurs to me that maybe the instigating event in the MacPorts/gimp case is reordering of steps in gimp's startup phase... Hmm.)

Ewen

Comment 10 Tom Jakubowski 2016-01-26 01:49:13 UTC

> - tcmalloc - older versions vulnerable, but fixed 1.5 years ago: 
> https://code.google.com/p/gperftools/issues/detail?id=496

FWIW, the tcmalloc Chromium uses is still vulnerable to this.  See https://code.google.com/p/chromium/issues/detail?id=570068

That makes use of g_spawn* incompatible with Chromium Embedded Framework.

Comment 11 Ewen McNeill 2016-01-26 04:13:55 UTC

For the record (reminded by a new comment today on this old bug), the MacPorts diagnosis for their related issue was that the atfork() handlers ended up referring to a memory address which was no longer valid once the shared object (plugin) was unloaded again.  Their resolution was to simply defeat the unloading of shared objects, since it was not possible to be sure that a module had not registered any callbacks (and thus could be safely unloaded).

For details see discussion on the MacPorts ticket, and the patch attached to the ticket:

https://trac.macports.org/ticket/45309
https://trac.macports.org/attachment/ticket/45309/glib2-45309.patch

AFAIK that patch has been present for the last year or so without new problems being reported.

Ewen

Comment 12 Colin Walters 2016-01-26 13:52:25 UTC

Substitute malloc implementations are a mess....maybe glibc could export a weak symbol like glibc_malloc_atfork ?  Basically allow 3rd party mallocs to also take over the exact spot in the atfork chain.

There's already a weak __malloc_hook which might be useful if tcmalloc doesn't already use it?

Not sure what we can really do about this on glib side as even if we changed glib itself not to malloc in its internal uses of fork(), I'm sure there are a lot of applications which do.

Comment 13 Ken Russell 2016-06-16 00:16:27 UTC

I think that refactoring the code under fork_exec_with_pipes so that it doesn't perform mallocs between the calls to fork() and exec() would improve glib's reliability significantly. fork_exec_with_pipes is treated as an atomic operation by callers, and a localized fix would prevent the sort of random hangs seen when running GNOME applications when DBUS_SESSION_BUS_ADDRESS is not set.

Chromium bug 309093 ( http://crbug.com/309093 ) continues to receive complaints from users that are attempting to run the browser in continuous integration systems. I've confirmed that this bug still exists in gspawn.c. A fix would be greatly appreciated by GNOME users.

Comment 14 Eric Rahm 2017-03-01 00:02:13 UTC

This is affecting Firefox as well, we've worked around it in two bugs [1,2] by registering our own |pthread_atfork| handler. We found a third issue [3] when running Firefox under ASAN as well.

A simple solution would be to brute force setting |FD_CLOEXEC| rather than enumerating |/proc/self/fd| [4] (and in turn allocating in the |opendir| call), ie:

> for (int i = 3; i < MAX_FDS; i++)
>   fcntl(fd, F_SETFD, FD_CLOEXEC)

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1300974
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1341621
[3] https://bugzilla.mozilla.org/show_bug.cgi?id=1342980
[4] https://github.com/GNOME/glib/blob/d892cf6feb215b3e2349b437c02d561926ea8259/glib/gspawn.c#L1015-L1073

Comment 15 Mike 2017-04-20 20:30:18 UTC

This continues to be a problem for web application developers using Chrome and
ChromeDriver to test in continuous integration environments. Most recently, it
has been causing intermittent timeouts during patch validation for
contributions to the "Web Platform Tests" FOSS project [1]. I don't have any
new information to add, though I'd be happy to give more detail if necessary.
It seems as though the problem is already well-understood, though, so please
take this comment as further evidence of the severity of the bug.

[1] https://github.com/w3c/web-platform-tests/issues/5406

Comment 16 GNOME Infrastructure Team 2018-05-24 17:09:08 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/945.