After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 746604 - g_spawn family does not appear to be safe in multi-threaded programs
g_spawn family does not appear to be safe in multi-threaded programs
Status: RESOLVED OBSOLETE
Product: glib
Classification: Platform
Component: gsubprocess
2.42.x
Other Linux
: Normal normal
: ---
Assigned To: gtkdev
gtkdev
Depends on:
Blocks:
 
 
Reported: 2015-03-22 13:08 UTC by Chris Vine
Modified: 2018-05-24 17:42 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Chris Vine 2015-03-22 13:08:48 UTC
The g_spawn* family of functions does not appear to be safe in multi-threaded programs, particularly on linux.  Given that gio is now multi-threaded, this means that most uses of g_spawn*, including in GSubprocess, will not be safe.

This arises because by default the g_spawn* functions (via do_exec() in glib/gspawn.c) walk an open descriptor list setting the FD_CLOEXEC flag on all file descriptors (except the descriptors for stdin, stdout and stderr) after one or more forks and before the eventual exec.  It does this either using the system supplied fdwalk (for solaris and BSDs) or by using an internal glib-supplied fdwalk function for linux and other posix-like OSes (also in glib/spawn.c).

The internal glib-supplied fdwalk for linux examines /proc/self/fd and to do so calls opendir, closedir and readdir.  opendir and closedir are definitely not async-signal-safe and therefore not safe to be called between a fork and an exec, and in my version of glibc (2.21) call malloc and free respectively and therefore have potential for deadlock.  A fallback version of fdwalk for other posix-like OSes calls getrlimit (almost certainly informally safe) and sysconf (which prior to POSIX.1-2008 was required to be async-signal-safe, and although that is no longer the case it is almost certainly informally safe).  The last two points are anyway not killers because they could equally well be called before the fork.

In solaris and the BSDs, the system-supplied fdwalk() is not guaranteed to be async-signal-safe and is therefore not guaranteed to be safe between a fork and an exec in a multi-threaded program.  I have no idea if it is informally safe - it would not be if it makes use of opendir/closedir.
Comment 1 Allison Karlitskaya (desrt) 2015-03-22 15:13:23 UTC
There is a lot of bad advice floating around on this point.

It's true that all of the functions you mention are not safe for use from signal handlers, but you are confusing "signal safe" with "can be used after fork() in a threaded program".

The libc (which is ultimately in control of forking) always cleans up this sort of stuff before a fork() occurs in order to ensure that things like malloc() and stdio will be safe on the other side.  In the case of glibc and stdio, it actually holds the stdio lock *during* the fork, to ensure nobody else can.  POSIX specifies pthread_atfork() to allow other libraries to do the same.  In that sense, if the library supports it, you can do more things after fork() in a threaded program than you could from within a single handler.

It could be that there are some other libc implementations around that get this wrong, but we're not really interested in supporting those. uClibc seems to do the same thing as glibc, btw.

So this is NOTABUG.
Comment 2 Chris Vine 2015-03-22 17:24:06 UTC
Sure glibc can relax POSIX requirements, but are you sure a malloc()
(which is what opendir() does) after a fork() really is safe?  I guess
glibc could unlock any memory locks immediately after forking (although
the man page for fork says otherwise about locks), but in that case the
glib documentation is wrong: "any mutexes that were held by other
threads in the parent process at the time of the fork() will still be
locked in the child process, and they will never be unlocked (since the
threads that held them don't exist in the child) ... In particular, it
is not safe to call any function which may call malloc(), which
includes POSIX functions such as setenv()."

I have to say I have never heard anyone else challenge the orthodoxy of
never calling malloc between a fork and an exec, but that does not
necessarily mean that you are wrong.

Is there any glibc documentation on this?
Comment 3 Chris Vine 2015-03-22 18:23:56 UTC
Interesting, on looking at it, it seems that glibc does set up atfork handlers for its memory locks.  In that event, the glib documentation for GSpawnChildSetupFunc is wrong.  Possibly also the deprecation of g_test_trap_fork() is too pessimistic (although maybe that has other problems with locks after forking other than just memory locks).

I would be interested if there is any glibc documentation on this (I have not been able to find any).
Comment 4 Tom Jakubowski 2016-01-23 03:21:55 UTC
For what it's worth, if an application links to CEF [1] it ends up bringing along Chromium's tcmalloc allocator.  Unlike glibc, this particular malloc/free implementation is *not* multi-threaded fork safe, see: https://code.google.com/p/chromium/issues/detail?id=570068.

I imagine that there are other alternative allocators which present the same problem, because it's somewhat risky for an allocator to install its own pthread_atfork handlers given that it can't control the order they're run in relative to other atfork_handlers (glibc ends around this because it implements pthread_atfork, and can run its internal atfork handlers whenever it pleases).

This all means CEF applications which use g_spawn are subject to this deadlocking issue.  I'll also note that the user documentation for g_spawn goes out of its way to warn about calling malloc/free after a multi-threaded fork:

> However, even on POSIX, you are extremely limited in what you can safely do from a GSpawnChildSetupFunc, because any mutexes that were held by other threads in the parent process at the time of the fork() will still be locked in the child process, and they will never be unlocked (since the threads that held them don't exist in the child). POSIX allows only async-signal-safe functions (see signal(7)) to be called in the child between fork() and exec(), which drastically limits the usefulness of child setup functions.
> In particular, it is not safe to call any function which may call malloc(), which includes POSIX functions such as setenv().

It strikes me a little odd that g_spawn itself ignores this warning.

[1]: https://bitbucket.org/chromiumembedded/cef
Comment 5 Tom Jakubowski 2016-01-23 03:22:58 UTC
Sorry. One of these days I'll remember that bugzilla doesn't word wrap quotes :-)

> However, even on POSIX, you are extremely limited in what you can safely
> do from a GSpawnChildSetupFunc, because any mutexes that were held by other
> threads in the parent process at the time of the fork() will still be locked
> in the child process, and they will never be unlocked (since the threads that
> held them don't exist in the child). POSIX allows only async-signal-safe
> functions (see signal(7)) to be called in the child between fork() and exec(),
> which drastically limits the usefulness of child setup functions.

> In particular, it is not safe to call any function which may call malloc(),
> which includes POSIX functions such as setenv().
Comment 6 Michael Catanzaro 2016-06-24 15:58:19 UTC
(In reply to Tom Jakubowski from comment #4)
> For what it's worth, if an application links to CEF [1] it ends up bringing
> along Chromium's tcmalloc allocator.

But surely you don't use tcmalloc in the child process before exec? That seems easy to avoid, at any rate.
Comment 7 Tom Jakubowski 2016-06-29 22:36:34 UTC
> But surely you don't use tcmalloc in the child process before exec? That seems easy
> to avoid, at any rate.

Well I could be as careful as I want to avoid calling malloc/free in the child process before exec, but *glib* still (unportably) makes those calls.  That's the reason this bug was filed to begin with, I think.
Comment 8 Colin Walters 2016-06-29 22:58:15 UTC
I'd review a patch to use raw system calls (open (O_DIRECTORY) + getdents()) for the fd walking on linux.

Then I know this doesn't help CEF on FreeBSD or whatever but...honestly the only sane solution is some hooking between glibc and custom allocators so the allocator can tell glibc about itself (and similarly for other libcs).
Comment 9 Dan Winship 2016-06-30 14:53:35 UTC
getdents() is not async-signal-safe and is thus at least theoretically not safe to call either.

We already have a non-malloc-using version of the code (for (fd = 0; fd < maxfd; fd++) ...). It just ends up making 1024 syscalls. But if we had some way of knowing whether malloc() was safe or not, we could just always use that fallback version when it wasn't.
Comment 10 Colin Walters 2016-06-30 15:23:54 UTC
getdents(64) is a system call, nothing to do with glibc, so it has to be safe.
Comment 11 Primiano Tucci 2017-04-27 14:31:30 UTC
I might be a bit late on this, but worth a try.
I've been recently being discussing this topic in:
 - libdbus bug: https://bugs.freedesktop.org/show_bug.cgi?id=100843 [1]
 - chromium bugs: https://crbug.com/715658 and https://crbug.com/695643

TL;DR libraries doing fork() + malloc() are causing chrome (and anything based on it, like CEF) to randomly hang.
I understand the discussion here is complex and not straightforward to solve. We had quite some long discussion in [1], and we are looking in workarounds regardless of what's going to be decided here (and in libdbus).

If I can express my honest viewpoint, I think that fork()-ing + malloc()+ing assuming that everybody uses glib's malloc (or any allocator with a atfork handler) is quite aggressive.

The concern in #9 seems to be about getdents(). Concretely speaking though, the issue here is not about being async signal safe, but merely not mess with userspace state like allocator locks.
The issue here (at least form my viewpoint) is not about what is compliant on paper with POSIX docs, it's what breaks things.
malloc in a fork clearly break things. getdent, as said in #10, is a syscall.
Comment 12 Matthias Clasen 2017-04-27 23:20:58 UTC
Commenting here without reopening the bug and attaching a patch that implements Colins suggestion is not going to do much good.
Comment 13 Primiano Tucci 2017-04-28 10:26:57 UTC
> Commenting here without reopening the bug

I don't seem to have the powers to reopen this bug. Do you prefer me to open a new one?

> and attaching a patch that implements Colins suggestion is not going to do much good.

I am not sure how this project works. By experience from other projects, sending out a patch without on a controversial topic without having any agreement is often a loss of time.

"Send a patch" works when there is a clear bug and it's just a matter of doing some work to fix it. This doesn't seem to be the case here, by looking at the comments 8,9,10.

If I am wrong and you feel there are no objections here in moving from opendir -> getdents, so this is just a matter of doing the work, please tell me and I will just send a patch.
Comment 14 Michael Catanzaro 2017-04-28 15:25:40 UTC
Although this doesn't affect GNOME directly, clearly this is still a valid bug impacting a major application that uses GLib, so I'll reopen it.

I recommend implementing Colin's suggestion in comment #8 if you want to solve this.
Comment 15 GNOME Infrastructure Team 2018-05-24 17:42:06 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/1014.