GNOME Bugzilla – Bug 738620
g_spawn_* functions break POSIX requirements and deadlock the child process under certain conditions
Last modified: 2018-05-24 17:09:08 UTC
g_spawn_* functions call internally fork_exec_with_pipes(), which calls do_exec(). In these functions, there are called functions which are not async-signal-safe, like readdir, dirfd, and probably others (I apology for not providing a complete callstack, but I reproduced a bug once and without full debug info). The final result is that after the fork, the child process ends up in the allocator to allocate or free some memory. If the allocator (like uClibc's malloc, or older versions of tcmalloc) does not register pthread_atfork handlers to ensure clean internal state during fork, the child process may deadlock on an internal allocator's mutex which was held by a different thread in the parent process during the fork(). This bug reproduces for example when I call gst_init() during the application's normal work, while other threads are already working. GStreamer forks to exec the gst-plugin-scanner, but due to bugs in g_spawn_async_with_pipes() it deadlocks. I know that this bug could be fixed in the allocator, and major allocators (glib's ptmalloc, firefoxes jemalloc) do register atfork handlers, but by the POSIX spec after the fork only the async-signal-safe functions should be called in the child before it performs exec, so strictly speaking, the real bug is in glib.
Are you seeing this problem with some alternative libc ? It should be possible to rewrite fdwalk without using library functions, using the syscalls directly.
int getdents(unsigned int fd, struct linux_dirent *dirp, unsigned int count); Note: There is no glibc wrapper for this system call; see NOTES. NOTES Glibc does not provide a wrapper for this system call; call it using syscall(2). You will need to define the linux_dirent struc‐ ture yourself. However, you probably want to use readdir(3) instead. :(
(In reply to comment #0) > > I know that this bug could be fixed in the allocator, and major allocators > (glib's ptmalloc, firefoxes jemalloc) do register atfork handlers, but by the > POSIX spec after the fork only the async-signal-safe functions should be called > in the child before it performs exec, so strictly speaking, the real bug is in > glib. But it's really quite easy to do, and makes using fork() significantly saner for applications. It's not just glib you're going to hit this problem in. Change uClibc and make the world a better place please.
(In reply to comment #1) > Are you seeing this problem with some alternative libc ? Yes, tcmalloc. From my curiosity I've checked other allocators, and: - D.L. malloc - safe, but disabled by default, - ptmalloc - I assumed it's safe as you guys and everyone uses it all the time, - jemalloc - safe, - uClibc - vulnerable - tcmalloc - older versions vulnerable, but fixed 1.5 years ago: https://code.google.com/p/gperftools/issues/detail?id=496 (In reply to comment #3) > But it's really quite easy to do, and makes using fork() significantly saner > for applications. It's not just glib you're going to hit this problem in. > Change uClibc and make the world a better place please. uClibc has a ton of bugs and no official releases :( To make it clear: I'm a free-software and gnome fanboy, but no a g{lib,nome,streamer} developer, and I don't want to look like I'm "demanding" something from you. Just letting you know of a problem and sharing my opinion (if the cost of staying POSIX-compliant is not too high, then it's a good thing to stay compliant).
(In reply to comment #4) > uClibc has a ton of bugs and no official releases :( I suspect they're rather squeezed between Android and the "implement the useful bits of glibc or bust" model of systemd as far as embedded systems go. > (if the cost of staying POSIX-compliant is not too high, then it's a good thing > to stay compliant). It's pretty high. Last I looked for example, Python effectively requires forksafe-malloc too.
We need to do an fdwalk on all POSIX systems. On Linux we can do this with iterating /proc. If we have a suitable libc (probably all of them) then we could also do this by making syscalls and skip the malloc() risk of using DIR*. On other platforms we don't fdwalk by iterating /proc so this is not an issue there. I also don't feel like building a grab-bag of malloc()-free DIR* replacements based on the syscalls of various OS kernels, so that's good. We implement fdwalk for ourselves if it's not available. I think we should probably always use our own implementation for two reasons: 1) easier to find bugs in our implementation 2) less risk that fdwalk will start calling malloc() in some new and interesting way in the future Once we are always using our own implementation, we may as well be basing it on syscalls instead of the C library. However: maybe there are other cases of malloc() use post-fork() that we could get tripped up by. TL;DR: I want to fix this bug but only if it is the only one. I worry that it's not -- particularly if you consider our typical users in addition to our own direct uses. If there are other cases where we are using malloc() after fork() then I'd rather just declare that GLib has a dependency on a fork()-safe malloc() (in a similar way to what we did with the monotonic clock).
Potentially related, over at the MacPorts (OS X) site there is ticket #45309 for Gimp failing to launch DBUS or detect any plugins since new versions of glib/gtk/gimp were incorporated into MacPorts about 11 days ago (so around 2014-10-10 ish). On the DBUS launch there's a "Not enough memory" error report. It's worked fine for years prior to that. As far as I can tell gimp detects plugins by launching something, via glib functions, to auto-discover them (and the DBUS case is definitely supposed to be a program launch). Based on the debugging to date it looks like something in the fork() child is failing before it is able to get as far as calling exec*() -- but we have not yet determined precisely what. Memory allocation is one plausible theory, especially given the "Not enough memory" report. Ticket URL: https://trac.macports.org/ticket/45309 Seems to affect OS X 10.6.8, 10.7, 10.9 and 10.10 (beta) at least. It appears that on OS X the relevant glib functions do work in some other gtk applications (eg, eog is mentioned in the ticket), but not in gimp. My hunch is that single threaded applications have different malloc-after-fork rules than multi-threaded applications. Ewen
Note: to my surprise it is in fact theoretically possible to write and run a Gtk program that does not create any threads.
(In reply to comment #8) > Note: to my surprise it is in fact theoretically possible to write and run a > Gtk program that does not create any threads. Possibly slightly more practically, the ordering in which (a) external programs (eg, DBUS launch, plugin discovery) and (b) application thread startup happen, might well be relevant to what is permitted between fork() and exec*(). To the OS/libc the program that launches DBUS _then_ creates its application threads will look single threaded; the program which starts some of its application threads then launches DBUS will look multi threaded, and may be forced to play by different rules. (Now I say that, it occurs to me that maybe the instigating event in the MacPorts/gimp case is reordering of steps in gimp's startup phase... Hmm.) Ewen
> - tcmalloc - older versions vulnerable, but fixed 1.5 years ago: > https://code.google.com/p/gperftools/issues/detail?id=496 FWIW, the tcmalloc Chromium uses is still vulnerable to this. See https://code.google.com/p/chromium/issues/detail?id=570068 That makes use of g_spawn* incompatible with Chromium Embedded Framework.
For the record (reminded by a new comment today on this old bug), the MacPorts diagnosis for their related issue was that the atfork() handlers ended up referring to a memory address which was no longer valid once the shared object (plugin) was unloaded again. Their resolution was to simply defeat the unloading of shared objects, since it was not possible to be sure that a module had not registered any callbacks (and thus could be safely unloaded). For details see discussion on the MacPorts ticket, and the patch attached to the ticket: https://trac.macports.org/ticket/45309 https://trac.macports.org/attachment/ticket/45309/glib2-45309.patch AFAIK that patch has been present for the last year or so without new problems being reported. Ewen
Substitute malloc implementations are a mess....maybe glibc could export a weak symbol like glibc_malloc_atfork ? Basically allow 3rd party mallocs to also take over the exact spot in the atfork chain. There's already a weak __malloc_hook which might be useful if tcmalloc doesn't already use it? Not sure what we can really do about this on glib side as even if we changed glib itself not to malloc in its internal uses of fork(), I'm sure there are a lot of applications which do.
I think that refactoring the code under fork_exec_with_pipes so that it doesn't perform mallocs between the calls to fork() and exec() would improve glib's reliability significantly. fork_exec_with_pipes is treated as an atomic operation by callers, and a localized fix would prevent the sort of random hangs seen when running GNOME applications when DBUS_SESSION_BUS_ADDRESS is not set. Chromium bug 309093 ( http://crbug.com/309093 ) continues to receive complaints from users that are attempting to run the browser in continuous integration systems. I've confirmed that this bug still exists in gspawn.c. A fix would be greatly appreciated by GNOME users.
This is affecting Firefox as well, we've worked around it in two bugs [1,2] by registering our own |pthread_atfork| handler. We found a third issue [3] when running Firefox under ASAN as well. A simple solution would be to brute force setting |FD_CLOEXEC| rather than enumerating |/proc/self/fd| [4] (and in turn allocating in the |opendir| call), ie: > for (int i = 3; i < MAX_FDS; i++) > fcntl(fd, F_SETFD, FD_CLOEXEC) [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1300974 [2] https://bugzilla.mozilla.org/show_bug.cgi?id=1341621 [3] https://bugzilla.mozilla.org/show_bug.cgi?id=1342980 [4] https://github.com/GNOME/glib/blob/d892cf6feb215b3e2349b437c02d561926ea8259/glib/gspawn.c#L1015-L1073
This continues to be a problem for web application developers using Chrome and ChromeDriver to test in continuous integration environments. Most recently, it has been causing intermittent timeouts during patch validation for contributions to the "Web Platform Tests" FOSS project [1]. I don't have any new information to add, though I'd be happy to give more detail if necessary. It seems as though the problem is already well-understood, though, so please take this comment as further evidence of the severity of the bug. [1] https://github.com/w3c/web-platform-tests/issues/5406
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/945.