GNOME Bugzilla – Bug 785391
Device to device copies/moves deadlock in common gvfs client dbus code
Last modified: 2018-09-21 18:10:02 UTC
Created attachment 356346 [details] gvfs-mtp crash report by the kernel in journald Problem encountered using latest stable Nautilus and gvfs-mtp (Archlinux user): [clement@T470-clement ~]$ pacman -Si gvfs-mtp Repository : extra Name : gvfs-mtp Version : 1.32.1+3+gd852665e-1 ---------------------------- [clement@T470-clement ~]$ pacman -Si nautilus Repository : extra Name : nautilus Version : 3.24.2-1 ---------------------------- STR: I tried to copy files from my phone internal memory to my new SD card. I manages to copy some files one by one, but copying a directory always resulted in the same problem: copy spinner in Nautilus stays at 0/XX files and doesn't move, directory is created but throw errors, MTP device can't be accessed anymore nor unmounted, you can just disconnect it to force it to disappear, and of course nothing get copied. Copies from phone internal memory to computer and from computer to phone SD card or internal memory acts fine and no problem to report. The issue is 100% reproduced every time, and is issuing a crash report by the kernel in journald, you can find all of this as an attachment here.
Additional informations I forgot in my first comment: Joined crash report contains one line complaining about an out of memory while this was clearly not the case, there was at least half of the system's memory available on a 16GB computer.
Are you trying to copy to the SD card inside the phone? You can't sanely do device->device copies over mtp; the files go to the computer and back again which is extremely slow, even if nothing else goes wrong. You should use a device side file manager to do it. If you're trying to copy from the device to the host, then it should work, but you definitely ran out of memory. The OOM killer kicked in, which doesn't happen by mistake. In the process list, the gvfs-mtp process is consuming a lot of memory - almost half of your available RAM, and you have no swap, so boom. I have one or two suspects in mind. The filesystem exposed by an android phone (I assume this is an android phone) over mtp will have a lot of very small files in the /Android and /data directories. These are extremely inefficient to copy, and so even if everything works fine, the process is really really slow. Second, there can be some weird files in there that don't copy properly and can stall out the process. Finally, nautilus does a full enumeration of the filesystem under the point you are copying before it copies anything. If you are copying the whole thing, then it will walk the whole filesystem. Internally, gvfs has to maintain a cache of mtp file IDs to filenames which gets populated as directories are traversed. The end result is that there's this cache, and that's probably what is consuming all the memory, if there are a lot of files under consideration. So, what does that mean? Maybe, the file enumeration is literally filling up all your ram and then you get OOM killed. Another possibility is that you get a very large cache, which does not initially trigger OOM but then one of the other two problems causes the copy to stall so it just sits around consuming a lot of RAM and then something else tries to do something, triggering memory stress, and then you get OOM killed. In the past when I've needed to copy the whole filesystem to a PC, I try and make a zip/tarball on the device and copy that over. It works much better.
Yes this was a copy from internal memoy to phone's SD card. Device-side file-managers are really not practical (and did strange things when I tried one time: files weren't visible on the PC). As stated in the first comment, copies from phone internal memory to PC and from PC to phone's internal memory or SD card just act fine, and there are no issues at all. gvfs-mtp can be memory-hungry, ok, but to the point of using 8GB or even more? It seems like a lot, right? Yes this is an Android phone. But I didn't tried to copy the whole filesystem actually. Just a couple of music albums. When I noticed it failed I tried only one album which is like a directory containing 5 to 30 files only? I don't get why Nautilus would do a full enumeration but I guess it happens too when doing PC to phone and phone to PC copies? I that's the case, I've got no issues when doing that. Same goes for the cache: very few files, it shouldn't run big.
Curious. What's the model of your phone and what version of android is it running? The scenario you describe is still not a good one (device->device copying through the PC) but the other characteristics aren't bad. You're trying to copy a small number of files from one device location to another. In this scenario, I can't see an obvious candidate for the memory usage. Are the media files you're copying exceptionally large? Probably not - you say music and not multi-gigabyte videos. I suspect there is some buggyness somewhere related to the device->device copy. I've tested this casually at various times and seen it work, but perhaps there is a device specific issue (for example, maybe internal storage to SD is weird). I will again strongly recommend using a device-side file manager to do this copying. it will be much quicker and more reliable. The lack of visibility on the PC side is due to the mtp server only showing files that it knows about - it has a separate media database. If you search around, you'll see discussions about media rescanning on android and tools for triggering it. If you do a rescan after copying, the files will then be visible on the PC side. eg: https://forums.androidcentral.com/android-4-4-kitkat/344997-how-do-i-force-media-rescan.html If you still don't want to do that, then I recommend copying to the PC and back again. Remember, you're doing this anyway (modulo not writing blocks to the PC's disk) when you do a device->device copy using mtp.
It is a Sony Z3 compact, running Android 6.0.1 with the base ROM. While I understand a device to device copy isn't the best possible combination, it's still not supposed to crash I think, it should only be slower than any other copies (I'm even wondering if it couldn't be a bit smarter and if files are small just read them in memory and only then write them). The files aren't that big: it's standard mp3 files, not even flac so between 5 and 15MB I'd say. The reason to do that is pretty obvious: It's a lot easier and faster to browse your filesystems on a computer and proceed to copies and move with a computer UI than a phone UI, even if it means slower copy.
To be sure, what you're trying to do should work (albeit inefficiently) and it's a bug that it doesn't work. Right now, it's a question of what the bug is and where it is. As I said, in my testing of device->device copies, everything worked correctly, but there are a lot of different factors. I'm using a different phone (Nexus/Pixel phones over the years) and no internal vs sd card. As a Sony, I think there's a good chance they are using a custom MTP stack, which means that you can't assume it'll behave the same as the default google stack. So I guess some things to check: 1) Can you run mtp-detect and attach the output 2) Does the behaviour change if you copy a single file (not even a single directory) 3) Does the behaviour change if you copy the file from internal storage to another location on internal storage (also try SD to SD)
So, I did several tests as you requested. First of all, here is the result of mtp-detect: [clement@T470-clement ~]$ mtp-detect libmtp version: 1.1.13 Listing raw device(s) Device 0 (VID=0fce and PID=51bb) is a SONY Xperia Z3 Compact MTP+ADB. Found 1 device(s): SONY: Xperia Z3 Compact MTP+ADB (0fce:51bb) @ bus 1, dev 78 Attempting to connect device(s) ignoring libusb_claim_interface() = -6PTP_ERROR_IO: failed to open session, trying again after resetting USB interface LIBMTP libusb: Attempt to reset device ignoring libusb_claim_interface() = -6LIBMTP PANIC: failed to open session on second attempt Unable to open raw device 0 OK. Running mtp-detect seems to be pretty messy. Actually, while MTP is enabled on the first run, I get this result and this is resetting phone's choice to have MTP activated (pop-up comes back asking what I want to do while plugged on USB). Running mtp-detect while the pop-up is already here with no choice made is giving the same result. Then, copying a single file can have various behaviours: - Very small files (A little bit more than 100KB) copy without giving any error. - Bigger files like a mp3 track (5 or 10MB) crash the whole thing after having copied a very small amount of datas (200KB, 700KB…) - These results give the exact same result when copying from SD Card to Internal memory or the other way around. Copies from SD to SD or internal to internal run issues either, the same as described in previous comments. We can note that cancelling the copy changes nothing: Nautilus is fully frozen, unresponsive and can't be used anymore (pop-up coming regularly asking if I want to wait or force-quit) until I disconnect the MTP device: once disconnected, it comes back to a normal state. Do I need to file another bug due for that? I noticed in the same time another bug when you try to copy a file when one file with the same name already exists, I'm gonna file it after in gvfs-mtp component as it seems to be MTP-related.
Copying a file as an overwrite should work, but again, that assumes the device stack is sane. You need to unmount the device in nautilus before running mtp-detect - or you get the errors you saw.
Created attachment 356459 [details] mtp-detect output (In reply to Philip Langdale from comment #8) > Copying a file as an overwrite should work, but again, that assumes the > device stack is sane. > > You need to unmount the device in nautilus before running mtp-detect - or > you get the errors you saw. Okay so I didn't think it was mounted, but forced it to unmount and got a different result. See the joined file. For files of the same name, I filed https://bugzilla.gnome.org/show_bug.cgi?id=785480
So, there are enough indications here that the mtp stack is custom - as it declares a couple of sony-specific extensions. As such, it means anything could go wrong. The basic operations you are attempting do work on a stock android device, so this is going to come down to trying to understand what the Sony stack is doing wrong, and that will be hard without access to a Sony device. Likewise for your other bug - copy-over works normally. I see the z3 compact is not going to get a nougat update, so you won't see any stack improvements without going to a custom ROM. I would not be optimistic that we could work around the bugs here. It's going to end up something like: 1) The partial object implementation in their stack is buggy so anything that uses it will fail, including device->device copies in nautilus. The only work-around is copying to the PC and then back again. 2) The stack is buggy when trying to overwrite a file; you must delete it explicitly first.
(In reply to Philip Langdale from comment #10) > So, there are enough indications here that the mtp stack is custom - as it > declares a couple of sony-specific extensions. As such, it means anything > could go wrong. > > The basic operations you are attempting do work on a stock android device, > so this is going to come down to trying to understand what the Sony stack is > doing wrong, and that will be hard without access to a Sony device. > > Likewise for your other bug - copy-over works normally. > > I see the z3 compact is not going to get a nougat update, so you won't see > any stack improvements without going to a custom ROM. I would not be > optimistic that we could work around the bugs here. It's going to end up > something like: > > 1) The partial object implementation in their stack is buggy so anything > that uses it will fail, including device->device copies in nautilus. The > only work-around is copying to the PC and then back again. > > 2) The stack is buggy when trying to overwrite a file; you must delete it > explicitly first. Mmmh, so does that mean nothing could be done to take care of these issues and we need to deal with it… because, well, Sony messed it up? I don't get the need to replace such basic features and behavior by custom ones…? Extend it if you feel it doesn't do enough, okay, but replace? Let's not even mention it's open and they could contribute their changes upstream :/ Same goes for this other strange error message in bug 785480 ?
Yeah, I know this isn't a very upbeat situation. You can try and isolate the behaviour more if you're motivated. If you want to verify that the issue is related to partial object handling, you can do some tests. When you normally copy files to and from a device, nautilus will use the whole-file copy APIs. When you do a device->device copy, it uses the partial object API to copy block by block as I noted above. To separately test partial object handling, you can copy files to/from the device through gvfs-fuse. If you look, you'll see a mount like: gvfsd-fuse on /run/user/1000/gvfs type fuse.gvfsd-fuse (rw,nosuid,nodev,relatime,user_id=1000,group_id=1000) on your system. If you open a terminal and go to that location, you'll see your mtp device as a directory and can navigate the directory structure. If you use 'cp' to do the copy, it will use the partial object API. You can see if you can copy files to/from the device this way and see if the behaviour matches what you saw in nautilus. If it fails in the same way, it'll be a fundamental problem. For the overwrite problem, the error says that when we tried to send the file metadata to the device, which you do before copy a file over. Now, the way the gvfs code is written, if we detect a file is already there we send an explicit delete request before sending the copied file. The fact that it is failing where it is failing implies the delete should have succeeded before the copy happens. Do you see the old file has gone after the failed copy attempt? If so, it's likely that the bug on the device side has something to do with the deleted file's metadata still being accessible so it gets upset when you say you're about to copy to that file, or something like that.
(In reply to Philip Langdale from comment #12) > Yeah, I know this isn't a very upbeat situation. > > You can try and isolate the behaviour more if you're motivated. If you want > to verify that the issue is related to partial object handling, you can do > some tests. > > When you normally copy files to and from a device, nautilus will use the > whole-file copy APIs. When you do a device->device copy, it uses the partial > object API to copy block by block as I noted above. To separately test > partial object handling, you can copy files to/from the device through > gvfs-fuse. > > If you look, you'll see a mount like: > > gvfsd-fuse on /run/user/1000/gvfs type fuse.gvfsd-fuse > (rw,nosuid,nodev,relatime,user_id=1000,group_id=1000) > > on your system. If you open a terminal and go to that location, you'll see > your mtp device as a directory and can navigate the directory structure. If > you use 'cp' to do the copy, it will use the partial object API. You can see > if you can copy files to/from the device this way and see if the behaviour > matches what you saw in nautilus. If it fails in the same way, it'll be a > fundamental problem. So, Copying at this place with cp works without any issue both from SD Card to Internal memory and the other way around, and both for one single file or a full directory. > For the overwrite problem, the error says that when we tried to send the > file metadata to the device, which you do before copy a file over. Now, the > way the gvfs code is written, if we detect a file is already there we send > an explicit delete request before sending the copied file. The fact that it > is failing where it is failing implies the delete should have succeeded > before the copy happens. Do you see the old file has gone after the failed > copy attempt? > > If so, it's likely that the bug on the device side has something to do with > the deleted file's metadata still being accessible so it gets upset when you > say you're about to copy to that file, or something like that. So, after doing some tests, it appears the problem only occur from device to device copy (whether it is SD to SD, internal to memory or from one to the other one). When this message does appear, the copy isn't completing at all. I'll update the other bug with those additional informations, and think now those bugs are definitely linked, I'll tag them in see also, if it's not even dependency.
*** Bug 785480 has been marked as a duplicate of this bug. ***
So, good news, bad news. The good news is I can reproduce your problem. The bad news is that means it's probably effecting a lot more people. Seems like there's been some sort of change in how Nautilus does the copying which is causing things to go south. As with your testing, I can verify that doing the copy through fuse works correctly.
Alright. Now things are fun. There's some sort of deadlock happening, where I see gvfsd-mtp issuing a block read, and then a write, and then it just stops. If I attach gdb and then continue, it will unblock and issue another read/write pair and then stop again. and I can interrupt, continue, and it will do it again and so on. I'm not sure what's getting dropped on the floor here, and I know I tested this functionality in the past and it worked. As noted, it seems to work for a simple 'cp' through fuse, but I tried doing rsync, which does the copy to a temporary file and then tries to move it to the final file name, which maps to another copy because you can't do moves over mtp. In this case, the first copy worked and then the second copy wedged in the same way it wedges in nautilus. This is very strange
Philip, can you please attach debug output from gvfsd? To be honest, I really wonder that it works over fuse mount point and not thru Nautilus. But it might be because fuse will do it strictly sequential, but I think that Nautilus tries to do it in parallel somehow...
Created attachment 356506 [details] debug output when attempting to copy file from device back to device I'm attaching the debug output. What you'll note is that there's a handful of read and write calls made successfully before it grinds to a halt, and those operations all complete successfully. It does issue overlapping reads and writes, which will be serialised by the mutex. It's possible fuse is internally serialised so that this doesn't arise, but it shouldn't be a problem.
Created attachment 356507 [details] Backtrace when not making progress Here's the full backtrace from gdb. It's thoroughly uninteresting. I've also now got it into a state where breaking into it in gdb doesn't unwedge things. It's hard wedged now.
Thanks. I've looked at it quickly, but I am surprised that all jobs are completed. It looks like it may be hung on the client side, isn't it? What about backtrace from Nautilus? Can you please try another GIO-based file manager, e.g. Thunar? Helgrind may help to detect deadlocks...
Created attachment 356682 [details] Backtrace off the copy thread in nautilus Alright, so here's the key thing. In the deadlock case, it's happening inside g_file_copy() or g_file_move(), and we see the stream_read is happening inside the stream_write. Based on the output in the mtp log, it shouldn't be deadlocking on the mtp mutex, but I know that the common daemon code that interfaces with dbus has locks too, so maybe one of those is the source of the deadlock. This also explains the difference in behaviour when using fuse. Doing a copy through the fuse mount leads to separate reads and writes happening which are not nested. This is why the initial rsync copy works. Then when rsync does the rename, it maps to a g_file_move in fuse (although there is a rename primitive available, it is not used), which then deadlocks in the same way. And what do you know? I tried doing a device->device copy on an ipad and it deadlocked in exactly the same way. weeee!
And, unsurprisingly, I can reproduce with 'gvfs-copy'. I'm trying to find out when it broke, because this used to work.
I went back to 1.24.x and it still deadlocks in the same way. I'd say that points to the regression being in gio...
Philip, thanks for testing, I can reproduce even with my Nexus, but I doupt it is something in GIO, I suppose it is rather something in gvfs client (maybe related to GTask port), I will make some tests a let you know...
Hmm, it works correctly with glib-2-50, but not with glib-2-52, maybe you were right, that something is wrong in GIO...
It seems that the following patch is the culprit, but it is not still clear to me what is wrong: https://git.gnome.org/browse/glib/commit/?id=3b5b5696ed121ef6ff48fd076fccf95053db33a7
Created attachment 356848 [details] backtrace full of gvfsd-mtp
Created attachment 356849 [details] backtrace full of gio copy
Created attachment 356850 [details] backtrace full of gio copy
This also affects other backends, which supposedly doesn't implement native copy (i.e. smb)... It hangs if g_output_stream_write doesn't write the whole buffer and repeats the operation - which happens more often with bigger buffers obviously - but it seems it doesn't fail always in that case... It happens only when "copying", writing only doesn't seem to be a problem, even if the whole buffer is not always written... By the way, it seems it doesn't happen when running client in valgrind. Any ideas?
Created attachment 357413 [details] [review] channel: Set sockets as nonblocking to prevent deadlocks when copying The channel socket pair is not set as nonblocking currently, which may cause deadlocks in some cases (e.g. in-mount copy over read-write fallback), because g_output_stream_write_async may block. This issue appears after increasing max size of buffer in read channel: https://bugzilla.gnome.org/show_bug.cgi?id=773826 Set channel sockets as nonblocking to be sure that _async methods don't block.
Created attachment 357414 [details] [review] channel: Increase buffer sizes in order to match buffers used in GIO Channel sockets buffers are smaller than max read channel buffer and buffers used in GIO: https://bugzilla.gnome.org/show_bug.cgi?id=773826 https://bugzilla.gnome.org/show_bug.cgi?id=773823 Increase those buffers also in order to improve performance.
Can you please test that it fixes the issue for you?
Review of attachment 357413 [details] [review]: Do you have a backtrace of the 2 threads deadlocking? Do you have a concrete test case to reproduce this problem?
> It hangs if g_output_stream_write doesn't write the whole buffer and repeats the operation - which happens more often with bigger buffers obviously - but it seems it doesn't fail always in that case... Does that mean there’s a problem with the error/retry handling on that call? Which call is this? I can’t see how changing the buffer size in GIO should cause problems in GVFS unless there’s some latent bug in GVFS already which was being masked by the buffer sizes being equal on both sides of the connection.
Review of attachment 357414 [details] [review]: I don't understand why this is necessary. The socket buffers are internal to the sockets, so why does it matter how much data they can hold? I would expect those buffer size to be controlled by the kernel and/or the glibc in ways that we cannot predict.
I am not sure this is correctly named as deadlock, but the whole problem is that g_output_stream_write_async completely blocks daemon process, even though it is an asynchronous method. The default socket buffer size is definitely smaller than 256k buffer, but the client uses 256k buffer. The limited socket buffer size causes that g_output_stream_write_async completely blocks daemon in readchannel (see attachment 356848 [details]) until client reads the data from the socket, but the problem is that client is doing something else - synchronously reading from writechannel in this case (attachment 356850 [details]), but the daemon is blocked and can't reply to the client, so also client is blocked... It is enough to set the sockets as nonblocking (attachment 357413 [details] [review]) to fix this issue, so the buffer size change (attachment 357414 [details] [review]) is not necessarily needed, but I would think it can improve the transmission because currently, we can't write the whole buffer to the socket... You can simply reproduce this issue for example by: gio mount localtest:///; gio save "localtest://$PWD/TEST" <<< $(seq -s" "0 1000000); gio copy -p "localtest://$PWD/TEST" "localtest://$PWD/TEST2" It doesn't necessarily happen when retrying as I told in Comment 30 initially, but it seems it is common for MTP/SMB backend. Maybe this should be also somehow improved in GLib, because g_output_stream_write_async calls synchronous g_unix_output_stream_write on the main thread and expects it won't block...
Created attachment 357424 [details] minimal example Here is a minimal example, where g_output_stream_write_async blocks. It seems it works correctly with g_unix_open_pipe, so maybe we should use it instead of socketpair, but I am not really expert in this area... what do you think?
Comment on attachment 357413 [details] [review] channel: Set sockets as nonblocking to prevent deadlocks when copying Attachment 357413 [details] pushed as 0496270 - channel: Set sockets as nonblocking to prevent deadlocks when copying Going to push in stable as well.
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/gvfs/issues/309.