GNOME Bugzilla – Bug 772354
g_spawn_async hangs if chdir to the passed working_directory hangs in the child
Last modified: 2018-05-24 19:07:47 UTC
I'm copying lots of data to/from my Android phone's internal and external storage, using USB 2.0 (I think) and the MTP protocol. The contents automatically appear (on Ubuntu) under /run/user/1000/gvfs. Filesystem operations are quite slow (I don't know why), usually it takes 1-2 seconds to copy a tiny file (copying a large file doesn't take much longer though), and gets even slower if I do concurrent operations from multiple terminals. It's very easy to get to a response time of 5-10 seconds. Every once in a while it completely hangs until I unplug the phone (kernel or gfvs or android bug? - whatever). (Other setups, e.g. an NFS mount with a dead server could also easily lead to this situation.) Most of the time this only effects what's happening inside a particular terminal tab, and that's okay. There's one exception though: If the current working directory is on such a slow/dead virtual filesystem, opening a new gnome-terminal window or tab becomes this slow/dead too. That is, the entire UI of all the g-t windows is unresponsive and unusable for the duration. I'm not sure what exactly goes on (haven't traced yet), I guess it's something with setting the CWD as per a previous OSC 7, but... any operation that accesses an arbitrary FS location specified by the user (even just simply cd'ing there or stat'ing it) should only happen after forking, that is, not effect the main g-t-s in any way so that it remains usable even if there are slow/dead filesystems around.
Is that with gnome-terminal git, or the ubuntu version? They have a patch (Provide-fallback-for-reading-current-directory-if-OS.patch) that could be causing this.
git (thx for the reminder though, it didn't occur to me they had this patch) Interestingly it didn't happen when I straced g-t-s :P Will continue investigating later on today.
"Luckily" one of the files on my phone's internal storage is unreadable; trying to read it hangs indefinitely (until I disconnect the phone). strace'ing gnome-terminal-server gives this: [...] stat("/var/lib/snapd/desktop/pixmaps", 0x7ffd7bf4fb20) = -1 ENOENT (No such file or directory) stat("/dev/random", {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 8), ...}) = 0 open("/dev/urandom", O_RDONLY|O_CLOEXEC) = 16 stat("/dev/random", {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 8), ...}) = 0 open("/dev/urandom", O_RDONLY|O_CLOEXEC) = 16 access("/bin/bash", X_OK) = 0 open("/dev/ptmx", O_RDWR|O_NOCTTY|O_NONBLOCK|O_CLOEXEC) = 16 and that's where it waits (the "open" above has completed and returned 16), the new window has appeared and is transparent, and all g-t windows are unresponsive. When I disconnect the phone, strace continues with this single line: --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=5284, si_uid=1000, si_status=1, si_utime=0, si_stime=0} --- and g-t is back to normal. Weird... Need to ltrace (which hardly ever works for me) or run in debug mode I guess...
It shouldn't affect the g-t-s really, but try stracing with -f to see what the newly created child is doing?
Could also try stracing the various gvfs processes to see what *they* are doing during the hang, and try starting g-t-s with GIO_USE_VFS=local to not use gvfs at all.
I already tried running another g-t (with different app-id) in the hope that that one would also hang while trying to talk to gvfs... but that one didn't hang :)
The strace output above was from an "strace -e file"... muscle memory... will do a proper one.
> try starting g-t-s with GIO_USE_VFS=local to not use gvfs at all. No difference, still hangs.
gvfs is irrelevant. vte hangs in g_spawn_async_with_pipes(directory, ..., vte_pty_child_setup, ...). (The first one of the two in pty.cc __vte_pty_spawn().) It neither returns, nor calls the child setup function. Quoting from https://developer.gnome.org/glib/stable/glib-Spawning-Processes.html#g-spawn-async-with-pipes: "child_setup and user_data are a function and user data. On POSIX platforms, the function is called in the child after GLib has performed all the setup it plans to perform (including creating pipes, closing file descriptors, etc.)" -- I guess this list also includes changing the working directory, fair enough -- "but before calling exec()." What I do not find documented is that the parent process waits for the completion of the setup function before continuing.
Did the strace -f reveal any more info?
I did not investigate any further, I think my previous comment describes what actually goes on. We'd need a variant of g_spawn_async_with_pipes() which does not wait in the parent process for the child to reach the exec step. I cannot think of a workaround around this method. We'd need buy-in from glib folks, or ditch this method and use our owk fork+exec (or a patched clone of g_spawn_async_with_pipes).
g_spawn_async() synchronously returns a GSpawnError about a possible failure at execv(), which obviously requires that the parent waits for the child to reach that step. (It's probably implemented using an unnamed pipe with close-on-exec, and sending the error code over this channel - or an EOF before any data means success). If the child is let loose before the exec step, there's no way to catch the execv error and report it on g-t's UI, an important feature that we wouldn't want to remove. If we'd like to keep that UI notification about failure, and address the current bug as well, we'd need a variant of g_spawn_sync that communicates failure/success of execv asynchronously via a callback (and the callback can be cancelled in case there's a pending spawn [e.g. due to the file system hang] and the user closes the tab). Shall we do it on our own, or try to get glib folks do this for us?
I've looked at g_spwan_async code again and still not sure how this could happen... Could you attach gdb to the hanging g-t-server and get a full backtrace, please ? (And maybe for the newly created child too; just to confirm it *is* hanging in chdir().)
Created attachment 340377 [details] g-t-server bt full Here's bt full for the server (parent). I guess I'd need a debug build of glib for #1-#2 to be more informative? I can't do bt on the child, gdb hangs while trying to connect (it prints the "Attaching to program ..." line and then its prompt doesn't appear). The child shows up as "g-t-server --app-id whatever" is "ps aux", that is, the same as the parent. The typical after-fork before-exec scenario.
Created attachment 340378 [details] g-t-server strace -f strace -f parent pid: 24960 child pid: 25648 Ctrl+Shift+N pressed (hang started) at 22:56:40. Phone unplugged (hang stopped) at 22:57:00. It _is_ clearly the chdir().
Just to confirm: If I remove the OSC 7 stuff from my PROMPT_COMMAND (all new tabs/windows open in my home) then this bug -- as expected -- no longer occurs, even while there's a hung "cp" going on. Trying to "ls" inside the "gvfs" directory, or stat()/chdir() to "mtp:host=blahblah" causes that shell to hang (does the MTP protocol only allow one operation a time? or is it gvfs's limitation? doesn't really matter), but g-t keeps running just fine.
The 'read' call in the bt must be the one in gspawn.c:read_ints(). Let's pass this to glib.
Created attachment 342551 [details] [review] proposal, for discussion
I found commit d4e25a5 references this bug, so I think it is the correct place to report an issue caused by the commit. GNOME Terminal leaks a pts when a tab is closed. After using it for several days, I have 4 tabs but with hundreds of pts in /dev/pts. Steps to Reproduce: 1. Start terminal with 3 tabs and type 'tty' command in each tab. My result is /dev/pts/25, /dev/pts/26, /dev/pts/27. 2. Close the first tab (/dev/pts/25) and we have 2 tabs left in this step. 3. Open a new tab. We have 3 tabs again in this step. Expected Results: The newly opened tab should reuse the recently closed pts (/dev/pts/25). Actual Results: The newly opened tab use /dev/pts/28, and /dev/pts/25 still exists. On FreeBSD, you can run 'procstat -f <pid>' and see 4 pts are opened by gnome-terminal-server. The same problem also happens on Linux, but I don't know which command can be used to show the name of opened pts. 'lsof -p <pid>' and 'ls -l /proc/<pid>/fd' only show /dev/ptmx. $ procstat -f 33650 PID COMM FD T V FLAGS REF OFFSET PRO NAME 33650 gnome-terminal-serv text v r r------- - - - /home/lantw44/gnome/devinstall/libexec/gnome-terminal-server 33650 gnome-terminal-serv cwd v d r------- - - - /home/lantw44 33650 gnome-terminal-serv root v d r------- - - - / ... 33650 gnome-terminal-serv 23 t - rw---n-- 1 0 - pts/25 33650 gnome-terminal-serv 24 t - rw---n-- 1 0 - pts/26 33650 gnome-terminal-serv 25 t - rw---n-- 1 0 - pts/27 33650 gnome-terminal-serv 26 t - rw---n-- 1 0 - pts/28
(In reply to Ting-Wei Lan from comment #19) > I found commit d4e25a5 references this bug, so I think it is the correct > place to report an issue caused by the commit. It is not; this bug here is for discussing the glib API needed to fix this in g-t. Please open a new gnome-terminal bug for your issue.
(In reply to Christian Persch from comment #20) > (In reply to Ting-Wei Lan from comment #19) > > I found commit d4e25a5 references this bug, so I think it is the correct > > place to report an issue caused by the commit. > > It is not; this bug here is for discussing the glib API needed to fix this > in g-t. Please open a new gnome-terminal bug for your issue. I opened bug 777007 for it.
Any comments from the glib developers on the approach taken in attachment 342551 [details] [review] ?
Ug. Offhand it feels like there's going to be a *lot* of other issues like this across many codebases. I don't think we're going to win at working around buggy FUSE filesystems. Ultimately the filesystem has to be correct.
One model could be some sort of babysitter process for FUSE mounts that polled them periodically with stat(/path/to/mount) or so, and if they failed to respond, killed the fuse mount.
Review of attachment 342551 [details] [review]: I'm not fond of this approach of creating a new half-baked cancellation API to work around this problem. In general, I'm also not too keen on adding additional gspawn API. Please consider GSubprocess for improvement instead. All of that said, I'd prefer to see something like one of the two following solutions (or some combination thereof). 1) We add a new flag G_SPAWN_REALLY_REALLY_ASYNC that returns immediately after fork() and reports failures to chdir() or exec() by way of non-zero exit status on the created pid (in the usual way). 2) We add way to get access to the internal pipe used for status reporting and a couple of utility functions to query it for progress and/or (and this is where it gets messy) a GSource for monitoring it for status updates (ie: either failure, or process successfully created and pipe closed via CLOEXEC). The necessary ugliness of this new API is a big part of why I'd want to go with GSubprocess here: we could *very* nicely wrap this into a g_subprocess__new_async() with nice error reporting via GError.
(In reply to Allison Lortie (desrt) from comment #25) > Review of attachment 342551 [details] [review] [review]: > > I'm not fond of this approach of creating a new half-baked cancellation API > to work around this problem. > > In general, I'm also not too keen on adding additional gspawn API. Please > consider GSubprocess for improvement instead. Btw, GSubprocess uses g_spawn_async internally. I didn't attach it here, but I had a follow-up patch to make gsubprocess use this as an example for how to use this API: https://paste.fedoraproject.org/538533/63127414/ .
GSubprocess uses gspawn internally mostly because we didn't want to duplicate all the code, and there was no benefit to doing so. If there is a strong reason to do so, then I don't mind doing either of: 1) do the duplication, trimming away the unneeded parts; or 2) exposing some glib_private API to help I wouldn't completely write off the possibility of us managing to do this with our custom child setup function, however... particularly if we take advantage of our knowledge of the internal implementation of gspawn.
I dug into this a bit more... it's not really as easy as I thought, since the child_err_report_pipe is allocated dynamically, we can't hijack it, since we don't know what number it will have in a given invocation (unless we employ an unacceptably high level of magic guessing). In any case, I don't think too much creativity will be required here to figure out what private stuff to open up in order to be able to present an extremely nice GSubprocess API here (ie: just follow the GAsyncInitable template) without exposing additional gspawn API.
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/1206.