GNOME Bugzilla – Bug 637095
gvfsd-metadata looping causes high CPU usage
Last modified: 2015-05-21 11:35:12 UTC
I am using Ubuntu Natty and once every few days gvfsd-metadata starts using a lot of cpu cycles. Stracing it and looking in ~/.local/share/gvfs-metadata/ indicates that it is continuously renaming a journal for 'home' and creating new logs; the system calls match the function meta_builder_write() . Next time I'll check whether there's a fixed number of iterations each second meaning maybe a timeout value is broken. It does not consume all of the cpu, but sits between 50% and 60%. It can be stopped by removing that directory or killing the process.
I can confirm this issue with all versions of gvfsd. Also, there are additional bug reports in Fedora/RHEL, Debian and Ubuntu which describe this issue. The Fedora/RHEL bug report contains an in-depth analysis by Michael Karcher under which conditions this problem arises. Debian: > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=624507 Ubuntu: > https://bugs.launchpad.net/fedora/+source/gvfs/+bug/720927 Fedora/RHEL: > https://bugzilla.redhat.com/show_bug.cgi?id=561904 Please raise the priority to anything above "Normal", it's a serious issue! Cheers, Adrian
According to the Debian and Fedora reports, this is related to use of an NFS-shared home directory and multiple concurrent logins.
How about, at least for short-term, disabling gvfsd-metadata entirely when running on an NFS /home? I don’t think this kind of setup should be encouraged anyway.
(In reply to comment #3) > How about, at least for short-term, disabling gvfsd-metadata entirely when > running on an NFS /home? I don’t think this kind of setup should be encouraged > anyway. Sounds reasonable. In any case, there is now a non-disclosed bug report on this as an important customer of RHEL complaint and upstream will hopefully now work on this issue with increased priority. I don't know of any updates, however. I just know that people at RedHat are actually now working on the issue. Adrian
Created attachment 235202 [details] [review] Disable gvfsd-metadata on remote filesystems This is the patch I’ll use to work around this.
(In reply to comment #5) > Created an attachment (id=235202) [details] [review] > Disable gvfsd-metadata on remote filesystems > > This is the patch I’ll use to work around this. You could add 9p, ceph, coda, gfs, ocfs to that list. (By the way, the Linux xstat() system call, if it is ever merged, will have an explicit flag for remote files.)
Created attachment 235212 [details] [review] Disable gvfsd-metadata on remote filesystems Here you go.
(In reply to comment #4) > Sounds reasonable. In any case, there is now a non-disclosed bug report on this > as an important customer of RHEL complaint and upstream will hopefully now work > on this issue with increased priority. > > I don't know of any updates, however. I just know that people at RedHat are > actually now working on the issue. Yes, I will be looking at the issue in coming days to deliver proper fix. While disabling partial functionality may be a temporary workaround, I'm not keen in pushing it in master.
So there several problems I've observed during debugging (and the list may still be incomplete): 1. Growing number of journal files, happens if both clients are in "retrying" mode 2. Occassional metadata daemon segfault (waiting for next crash to get bt), this looks very rare. From most issues what I've seen (below) the metadata daemon could recover but it takes several hours. 3. 100% CPU consumption - the daemon gots stuck in the retry loop within meta_tree_set_string(), after every successful meta_tree_flush_locked() call. This situation is probably generating high traffic over disk/network. 4. A race between main db and journal - every metatree file contains header carrying (random) tag that is used for constructing journal filename. However with the other client being (too) active, like being in the situation #3, the journal may not exist anymore. In that case, we got stuck with no journal, leading to errors on every write attempt. The journal file doesn't get recreated or refreshed (until another detected external change?). We return "Unable to set metadata key" error in this case. There's this comment in the sources: /* There is a race with tree replacing, where the journal could have been deleted (and the tree replaced) inbetween opening the tree file and the journal. However we can detect this case by looking at the tree and see if its been rotated, we do this to ensure we have an uptodate tree+journal combo. */ One thing I find disturbing is no protection from concurrent access to the same journal file from multiple clients. Clients have the journal file partially mmaped - the data area but not the header, containing number of records etc., which is updated separately. More debugging needed.
Created attachment 241365 [details] [review] metadata: Be more resistive to broken journal files This is a first patch of series of fixes I'm going to publish during next week. This patch fixes one of the case (there are more) leading to metadata daemon infinite loop. See the patch commit message for details. Ideally we should also check the upper bound since we're constructing new pointer from invalid data and we could jump completely out of our space leading to segfault. This is handled by the while cycle itself, not going past first or last entry. From my debugging experiments I was either getting valid data or zeroes, thus the infinite loop case. Haven't seen a segfault on torture test in past weeks.
Standing issues: - Simultaneous usage of one journal file by multiple clients. (work in progress) - Infinite loop between meta_journal_add_entry() - meta_tree_flush_locked() - meta_tree_refresh_locked() - meta_journal_add_entry() due to full journal and no force for rotation. Not yet sure what is causing this case, my suspicion is that the newly generated tree file tag hits one of the existing journal files. My metadata directory contains 16157 journal files at the moment. - Then there's the problem of hanging journal files, in some case old files are not being deleted. Could be fixed as a side-effect of future patches though.
Created attachment 241824 [details] [review] metadata: Force tree re-read after successful flush (In reply to comment #11) > - Infinite loop between meta_journal_add_entry() - meta_tree_flush_locked() - > meta_tree_refresh_locked() - meta_journal_add_entry() due to full journal and > no force for rotation. Not yet sure what is causing this case, my suspicion is > that the newly generated tree file tag hits one of the existing journal files. > My metadata directory contains 16157 journal files at the moment. The attached patch works around this issue by forcing re-read after flush. This also effectively weakens the need for the rotated bit, now acting more like a safety bit, yet still very important. > - Then there's the problem of hanging journal files, in some case old files > are not being deleted. Could be fixed as a side-effect of future patches > though. This could be caused by the same race in meta_builder_write() where the old journal files got unlinked only after successful old tree file open and header read-in. That's more a conceptual issue since we need to read the header to be able to construct journal filename. However could be probably dealt with with the current (old data) in MetaTree header.
Created attachment 241910 [details] [review] metadata: Create new journal if it doesn't exist
Created attachment 242151 [details] [review] metadata: Put journal in $XDG_RUNTIME_DIR for shared NFS homedir case This patch changes the location for journal files in $XDG_RUNTIME_DIR instead of using the usual location. This will allow us to overcome some NFS issues. In case of g_get_user_runtime_dir() failure, we'd get back to the homedir, leaving things at the same state. Since we always create new journal at the beginning, I had an idea of moving the old journal file to our private location so that all data are actually preserved. Another idea is to tweak the writeout timeout for NFS case. Btw. we may tweak function namespaces a little bit...
Created attachment 242220 [details] [review] metadata: Use shorter writeout timeout on NFS This changes the writeout timeout for metatrees on NFS to 15 seconds (the value is subject to discussion).
Created attachment 242246 [details] [review] metadata: Flush all scheduled writeouts on daemon exit Last piece in the series of patches that ensures NFS homedir compatibility.
Let me describe my test environment a bit. I was working with two machines having identical user set up. User homedir was mapped to a NFS mount, with same UID/GID. 1. Start desktop session on both machines 2. In your homedir, touch some file, let's call it "test" 3. Start infinite loop constantly changing attributes: while true; do gvfs-set-attribute test metadata::ahoj "xxxx"; sleep .1; gvfs-set-attribute test metadata::ahoj "xxxxx"; sleep .1; echo -n "."; done This will do the change every 100 ms and was sufficient to trigger the race. Often it took several minutes, the original writeout timeout was set to 60 seconds. You should be seeing one of the daemon going crazy, eating 100% CPU.
Review of attachment 241365 [details] [review]: looks good to me. Maybe we can also verify that the size stored at the start and end of the journal entry are the same.
Review of attachment 241824 [details] [review]: Looks 100% safe and seems like a good idea.
Review of attachment 241910 [details] [review]: looks good
Review of attachment 242151 [details] [review]: otherwise it looks ok ::: metadata/metabuilder.c @@ +924,3 @@ + if (! g_file_test (real_path, G_FILE_TEST_EXISTS)) + { + if (g_mkdir (real_path, 0700) != 0) I think this needs to be mkdir_with_parents
Review of attachment 242220 [details] [review]: looks good to me.
Review of attachment 242246 [details] [review]: looks good
Great, thanks for the review! Committed all the above, with corrections. (In reply to comment #18) > Maybe we can also verify that the size stored at the start > and end of the journal entry are the same. Looking at verify_journal_entry(): entry_len = GUINT32_FROM_BE (entry->entry_size); entry_len_end = GUINT32_FROM_BE (*(guint32 *)(journal->data + offset + entry_len - 4)); if (entry_len != entry_len_end) return NULL; I think we do that already. Or did you mean a different kind of check?
(In reply to comment #24) > Great, thanks for the review! Committed all the above, with corrections. Awesome. Could you please still leave the bug open for some more weeks? I need more time for a thorough testing. I will definitely give feedback. I just want to make sure gvfsd-metadata runs rock-solid now. Cheers, Adrian
(In reply to comment #25) > Could you please still leave the bug open for some more weeks? I need more time > for a thorough testing. I will definitely give feedback. I just want to make > sure gvfsd-metadata runs rock-solid now. I'll forget to close it for sure, but yes, I'll keep it open. I appreciate your testing effort, very valuable.
Created attachment 246784 [details] [review] metadata: Fix short journal check This was reading the size in the wrong place *sizep, not *(sizep-1), plus the out of bounds checks were wrong.
Attachment 246784 [details] pushed as 5a4f9e6 - metadata: Fix short journal check
gvfs-1.20.2 is still exhibiting the random ssd thrashing problem for me. Looking at ~/.local/share/gvfs-metadata/ it just creates and deletes a couple of files in an endless loop. Daemon doesn't seem to be logging anything anywhere that I could check. My only remedy to stop it has been "mv /usr/libexec/gvfsd-metadata /usr/libexec/gvfsd-metadata.disabled". This is on Gentoo. Some features are disabled by flags, prefixed by "- -" below. How can I help debugging this further? I could probably pause the process with gdb and inspect the output files? $ [-] equery u gvfs [ Legend : U - final flag setting for installation] [ : I - package is installed with flag ] [ Colors : set, unset ] * Found these USE flags for gnome-base/gvfs-1.20.2: U I - - afp : Enables support for accessing AFP (Apple Filing Protocol) network shares - - archive : Enables support for accessing files in archives transparently via app-arch/libarchive - - avahi : Add avahi/Zeroconf support - - bluray : Enable playback of Blu-ray filesystems using media-libs/libbluray + + cdda : Add Compact Disk Digital Audio (Standard Audio CD) support - - fuse : Enables fuse mount points in $HOME/.gvfs for legacy application access + + gnome-online-accounts : Enable configuration panel for net-libs/gnome-online-accounts accounts - - gphoto2 : Add digital camera support + + gtk : Add support for x11-libs/gtk+ (The GIMP Toolkit) + + http : Enable the HTTP/DAV backend using net-libs/libsoup-gnome - - ios : Enable support for Apple's iDevice with iOS operating system (iPad, iPhone, iPod, etc) + + libsecret : Enable app-crypt/libsecret support to store login credentials. - - mtp : Enable support for Media Transfer Protocol - - samba : Add support for SAMBA (Windows File and Printer sharing) + + systemd : Enable use of systemd-specific libraries and features like socket activation or session tracking - - test : Workaround to pull in packages needed to run with FEATURES=test. Portage-2.1.2 handles this internally, so don't set it in make.conf/package.use anymore + + udev : Enable udev base replacement code for cdda feature + + udisks : Enable storage management support (automounting, volume monitoring, etc)
(In reply to comment #29) > gvfs-1.20.2 is still exhibiting the random ssd thrashing problem for me. > Looking at ~/.local/share/gvfs-metadata/ it just creates and deletes a couple > of files in an endless loop. Daemon doesn't seem to be logging anything > anywhere that I could check. My only remedy to stop it has been "mv > /usr/libexec/gvfsd-metadata /usr/libexec/gvfsd-metadata.disabled". > > This is on Gentoo. Some features are disabled by flags, prefixed by "- -" > below. > > How can I help debugging this further? I could probably pause the process with > gdb and inspect the output files? Probably most useful would be a sure-fire way of reproducing the problem.
I wish I could pinpoint how it got started. This is the merge history for the package: $ [-] qlop -l | grep gvfs Wed Apr 13 19:51:15 2011 >>> gnome-base/gvfs-1.6.6-r1 Sat Apr 21 01:00:54 2012 >>> gnome-base/gvfs-1.10.1 Sun Jul 22 02:03:33 2012 >>> gnome-base/gvfs-1.10.1 Sun Jul 22 02:12:53 2012 >>> gnome-base/gvfs-1.12.3 Sun Sep 23 05:26:13 2012 >>> gnome-base/gvfs-1.12.3 Wed Jan 2 11:59:02 2013 >>> gnome-base/gvfs-1.14.2 Thu Dec 26 01:23:27 2013 >>> gnome-base/gvfs-1.16.4 Sun Mar 16 00:12:10 2014 >>> gnome-base/gvfs-1.18.3 Sat Jun 7 21:58:28 2014 >>> gnome-base/gvfs-1.18.3-r1 Sun Jul 27 16:31:54 2014 >>> gnome-base/gvfs-1.20.2 I haven't really noticed this ever before, up until somewhere last week. Maybe I have downloaded some graphics or other files that are broken in some way that completely confuses gvfsd during indexing (or whatever it does)?
I also notice this problem under Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=756205 I first noticed it on July 27, and it occurred again today (August 14).
Thanks for the patches. But can we can some traction here again, please? Sorry to spam here, but this keeps happening for me regularly as well. I've tried various things, killing the process, killing the directory IIRC, but it keeps coming back with 100%, maybe once a week or so, but it's irregular. 1. I've probably seen 100 posts on the Internet saying to delete the metadata folder, but that's not a solution: Even if it does get corrupted, the program MUST NOT go crazy and into an infinite loop or hog the CPU. That said, I do NOT think that metadata file corruption is the problem, because the problem keeps coming back for me. 2. I have huge NFS drives with many files mounted. If it tries to index or capture them, then it's sure to cause problems. People above said that it tries to index Samba drives. That's massively misguided and should not happen. It should only index local drives. Network drives can be huge, and often contain data that's not important to this user, but to other employees. In my case, it's media files (TV recordings, music etc.), but still *way* too huge to be indexed on the client. Even attempting to do this is braindead (this is technical term, not an insult). 3. Last, but most importantly, no program has the right to index all my drives without my *explicit* permission. Ask me, once. Give me a choice of which drives I want indexed, and where that index should be stored. This can be dangerous. People regularly keep private stuff on a physical USB drive, for the explicit reason to not leave trails on the local computer - whatever reason they might have. Having metadata about files stored on a *different* drive (partition) is a privacy threat by its very nature and should never be done without explicit user approval. This is totally misguided and should be disabled RIGHT NOW, then fixed, then it can be enabled again.
(In reply to comment #33) > Thanks for the patches. But can we can some traction here again, please? > > Sorry to spam here, but this keeps happening for me regularly as well. I've > tried various things, killing the process, killing the directory IIRC, but it > keeps coming back with 100%, maybe once a week or so, but it's irregular. Well if you can find a way of reproducing it easily, that would be helpful. > > 1. > I've probably seen 100 posts on the Internet saying to delete the metadata > folder, but that's not a solution: > Even if it does get corrupted, the program MUST NOT go crazy and into an > infinite loop or hog the CPU. > That said, I do NOT think that metadata file corruption is the problem, because > the problem keeps coming back for me. Yes, that's why this bug report is open. > > 2. > I have huge NFS drives with many files mounted. If it tries to index or capture > them, then it's sure to cause problems. People above said that it tries to > index Samba drives. That's massively misguided and should not happen. It should > only index local drives. Network drives can be huge, and often contain data > that's not important to this user, but to other employees. In my case, it's > media files (TV recordings, music etc.), but still *way* too huge to be indexed > on the client. Even attempting to do this is braindead (this is technical term, > not an insult). > > 3. > Last, but most importantly, no program has the right to index all my drives > without my *explicit* permission. Ask me, once. Give me a choice of which > drives I want indexed, and where that index should be stored. > This can be dangerous. People regularly keep private stuff on a physical USB > drive, for the explicit reason to not leave trails on the local computer - > whatever reason they might have. Having metadata about files stored on a > *different* drive (partition) is a privacy threat by its very nature and should > never be done without explicit user approval. > > This is totally misguided and should be disabled RIGHT NOW, then fixed, then it > can be enabled again. gvfsd-metadata does not index files, that's what tracker does.
> if you can find a way of reproducing it easily Does gvfs track files on NFS or Samba? From what I read, that is the case. If so, that's the bug and reproduction. This *will* go wrong, my stores are too big to scan them all, and I don't think they are exceptional in that sense. > gvfsd-metadata does not index files Yes, I used the term loosely, in the sense of keeping any metadata about other files. What I said still applies.
I just see that gvfs-metadata is *again* hogging CPU and 100% disk. If I don't kill it, this goes on infinitely. I had done "rm -rf ~/.local/share/gvfs-metadata/" just 2 hours ago, so that's definitely not working. And it's not a question of random storage corruption. I plenty of free space on all drives, so that's not it either. I can't tell you a reproduction, now would I know how to find one. I've never ever even started it. In fact, I have yet to find a way to just kill it for good. It keeps coming back, and none of the 100 posts on the net has shown a way to turn it off entirely, despite dozens of people asking specifically for it. I'm really upset at such programs that do things with all my data without my permission. Stuff like this MUST be opt-IN. Turn if off NOW, fix the bugs, add opt-in UI, then let it hit the poor people.
Given that the process is going wild right now, I've been looking at lsof|grep gvfs and ~/.local/share/gvfs-metadata/ . What I see there is most scary. First, in ~/.local/share/gvfs-metadata/, I see usually 3 files at a time with a UUID filename, of which 1-2 are with .log extension, the others are not. The files change every second (!), they get created and deleted immediately. No wonder we have a CPU and disk hog. Are you actively trying to kill my SSD and my filesystem, or is this create/delete cycle a bug? In fact, this happens so fast that an "ls -la" throws errors at any given time, but different ones each time, because ls and the filesystem apparently can't keep it with the extremely fast create/deletes. An example: $ ls -la ls: Zugriff auf uuid-567fed53-db21-4c1b-bf3f-129a448a56e7-3cae41a6.log nicht möglich: Datei oder Verzeichnis nicht gefunden insgesamt 44 drwx------ 2 ben ben 4096 Aug 25 21:43 . drwx------ 28 ben ben 4096 Aug 25 20:34 .. -rw------- 1 ben ben 56 Aug 25 21:43 uuid-567fed53-db21-4c1b-bf3f-129a448a56e7 ?????????? ? ? ? ? ? uuid-567fed53-db21-4c1b-bf3f-129a448a56e7-3cae41a6.log -rw-rw-r-- 1 ben ben 32768 Aug 25 21:43 uuid-567fed53-db21-4c1b-bf3f-129a448a56e7-fd5950d4.log Please note the "?" of the file stats, and that it says 44 files, but only 3 show up. This is not a filesystem corruption, because the files do not stay like that. The next ls -la shows: $ ls -la insgesamt 40 drwx------ 2 ben ben 115 Aug 25 21:46 . drwx------ 28 ben ben 4096 Aug 25 20:34 .. -rw------- 1 ben ben 56 Aug 25 21:46 uuid-567fed53-db21-4c1b-bf3f-129a448a56e7 -rw-rw-r-- 1 ben ben 32768 Aug 25 21:46 uuid-567fed53-db21-4c1b-bf3f-129a448a56e7-329ad117.log so the ? are gone. Then, 2 ls later, again I get: $ ls -la ls: Zugriff auf uuid-567fed53-db21-4c1b-bf3f-129a448a56e7-d5f586ab.log nicht möglich: Datei oder Verzeichnis nicht gefunden ls: Zugriff auf uuid-567fed53-db21-4c1b-bf3f-129a448a56e7.DK8GNX nicht möglich: Datei oder Verzeichnis nicht gefunden insgesamt 12 drwx------ 2 ben ben 4096 Aug 25 21:47 . drwx------ 28 ben ben 4096 Aug 25 20:34 .. -rw------- 1 ben ben 56 Aug 25 21:47 uuid-567fed53-db21-4c1b-bf3f-129a448a56e7 ?????????? ? ? ? ? ? uuid-567fed53-db21-4c1b-bf3f-129a448a56e7-d5f586ab.log ?????????? ? ? ? ? ? uuid-567fed53-db21-4c1b-bf3f-129a448a56e7.DK8GNX So, this is a constantly changing situation. This is substantiated by lsof, which shows lots of "DEL". Again, they change with each invocation: gvfsd-htt 14751 ben 9w IPv4 510436234 0t0 TCP mymachine:48423->myinternetproxy:3128 (CLOSE_WAIT) gvfsd-htt 14751 ben 10u unix 0x0000000000000000 0t0 510437000 socket gvfsd-htt 14751 ben 11w unix 0x0000000000000000 0t0 510436999 socket gvfsd-htt 14751 ben 12u IPv4 510437004 0t0 TCP mymachine:48424->myinternetproxy:3128 (CLOSE_WAIT) thunderbi 27588 ben DEL REG 8,8 1973904 /home/me/.local/share/gvfs-metadata/root thunderbi 27588 ben mem REG 8,6 94616 50734229 /usr/lib/x86_64-linux-gnu/gvfs/libgvfscommon.so thunderbi 27588 ben mem REG 8,6 169000 50734535 /usr/lib/x86_64-linux-gnu/gio/modules/libgvfsdbus.so thunderbi 27588 ben DEL REG 8,8 1970882 /home/me/.local/share/gvfs-metadata/root-861ac150.log thunderbi 27588 ben 97r REG 8,8 541548 1973904 /home/me/.local/share/gvfs-metadata/root (deleted) thunderbi 27588 ben 99r REG 8,8 32768 1970882 /home/me/.local/share/gvfs-metadata/root-861ac150.log (deleted) thunderbi 27705 ben mem REG 8,6 94616 50734229 /usr/lib/x86_64-linux-gnu/gvfs/libgvfscommon.so thunderbi 27705 ben mem REG 8,6 169000 50734535 /usr/lib/x86_64-linux-gnu/gio/modules/libgvfsdbus.so gvfsd-met 29802 ben cwd DIR 8,6 4096 128 / gvfsd-met 29802 ben rtd DIR 8,6 4096 128 / gvfsd-met 29802 ben txt REG 8,6 56640 17596969 /usr/lib/gvfs/gvfsd-metadata gvfsd-met 29802 ben mem REG 8,6 94616 50734229 /usr/lib/x86_64-linux-gnu/gvfs/libgvfscommon.so gvfsd-met 29802 ben mem REG 8,8 32768 68484339 /home/me/.local/share/gvfs-metadata/uuid-567fed53-db21-4c1b-bf3f-129a448a56e7-0cf40ef4.log gvfsd-met 29802 ben DEL REG 8,8 68484330 /home/me/.local/share/gvfs-metadata/uuid-567fed53-db21-4c1b-bf3f-129a448a56e7.VKJ7MX gvfsd-met 29802 ben mem REG 8,6 26258 36002336 /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache gvfsd-met 29802 ben 0u CHR 1,3 0t0 1029 /dev/null gvfsd-met 29802 ben 1u CHR 1,3 0t0 1029 /dev/null gvfsd-met 29802 ben 2u CHR 1,3 0t0 1029 /dev/null gvfsd-met 29802 ben 3u 0000 0,9 0 6826 anon_inode gvfsd-met 29802 ben 4u CHR 1,3 0t0 1029 /dev/null gvfsd-met 29802 ben 5u unix 0x0000000000000000 0t0 838838128 socket bash 29984 ben cwd DIR 8,8 4096 68484332 /home/me/.local/share/gvfs-metadata lsof 30025 ben cwd DIR 8,8 4096 68484332 /home/me/.local/share/gvfs-metadata grep 30026 ben cwd DIR 8,8 4096 68484332 /home/me/.local/share/gvfs-metadata lsof 30027 ben cwd DIR 8,8 4096 68484332 /home/me/.local/share/gvfs-metadata firefox 32745 ben DEL REG 8,8 1973904 /home/me/.local/share/gvfs-metadata/root firefox 32745 ben DEL REG 8,8 1970907 /home/me/.local/share/gvfs-metadata/uuid-567fed53-db21-4c1b-bf3f-129a448a56e7.A6L0KX firefox 32745 ben DEL REG 8,8 1970908 /home/me/.local/share/gvfs-metadata/uuid-567fed53-db21-4c1b-bf3f-129a448a56e7-421f706a.log firefox 32745 ben DEL REG 8,8 1964811 /home/me/.local/share/gvfs-metadata/uuid-9fb1aaeb-0f31-4beb-a64e-e27dccc59897-6b0316d0.log firefox 32745 ben mem REG 8,6 94616 50734229 /usr/lib/x86_64-linux-gnu/gvfs/libgvfscommon.so firefox 32745 ben mem REG 8,6 169000 50734535 /usr/lib/x86_64-linux-gnu/gio/modules/libgvfsdbus.so firefox 32745 ben DEL REG 8,8 1970882 /home/me/.local/share/gvfs-metadata/root-861ac150.log firefox 32745 ben DEL REG 8,8 1964810 /home/me/.local/share/gvfs-metadata/uuid-9fb1aaeb-0f31-4beb-a64e-e27dccc59897.VE11KX firefox 32745 ben 65r REG 8,8 5976 1964810 /home/me/.local/share/gvfs-metadata/uuid-9fb1aaeb-0f31-4beb-a64e-e27dccc59897.VE11KX (deleted) firefox 32745 ben 66r REG 8,8 32768 1964811 /home/me/.local/share/gvfs-metadata/uuid-9fb1aaeb-0f31-4beb-a64e-e27dccc59897-6b0316d0.log (deleted) firefox 32745 ben 72r REG 8,8 541548 1973904 /home/me/.local/share/gvfs-metadata/root (deleted) firefox 32745 ben 73r REG 8,8 37668 1970907 /home/me/.local/share/gvfs-metadata/uuid-567fed53-db21-4c1b-bf3f-129a448a56e7.A6L0KX (deleted) firefox 32745 ben 75r REG 8,8 32768 1970908 /home/me/.local/share/gvfs-metadata/uuid-567fed53-db21-4c1b-bf3f-129a448a56e7-421f706a.log (deleted) firefox 32745 ben 77r REG 8,8 32768 1970882 /home/me/.local/share/gvfs-metadata/root-861ac150.log (deleted) The thing that scares me most is that Firefox is the process, not gvfs. (Yet, killing gvfsd-metadata solves it - for the short term, so it's definitely related to that process.) What in the world is Firefox and Thunderbird during here? FWIW, these are self-compiled 64bit trunk builds of Firefox and Thunderbird, because I'm a Mozilla developer, but I have seen this bug since 1-2 years, so it's not related to any new Mozilla code change. I'm on Ubuntu 12.04 LTS.
gvfsd-htt 14751 ben 9w IPv4 510436234 0t0 TCP mymachine:48423->myinternetproxy:3128 (CLOSE_WAIT) That's probably the most scary. What is gvfsd doing on my HTTP internet proxy, talking to the world?
So, reading the Mozilla source code http://mxr.mozilla.org/comm-central/search?string=gioservice the linking with gvfs is for GIO, and the only thing it *should* be doing is default app integration, i.e. knowing which app to start for which filetype or URL. I.e. /etc/mime.types et al. Definitely nothing about generic file metadata, and nothing that should write to .local/share/gvfs-metadata/ . And MOST DEFINITELY nothing that should talk over the internet.
Confirming everything Ben is talking about. Exactly the same findings. I'm also living on Mozilla products: FF, TB, Lightning. Ben: re-quoting from above "my only remedy to stop it has been mv /usr/libexec/gvfsd-metadata /usr/libexec/gvfsd-metadata.disabled " if you want to pause on debugging this. You wrote above you found no way to stop it, so I wasn't sure whether it included this sledgehammer or not ;)
(In reply to comment #38) > gvfsd-htt 14751 ben 9w IPv4 510436234 0t0 > TCP mymachine:48423->myinternetproxy:3128 (CLOSE_WAIT) > > That's probably the most scary. What is gvfsd doing on my HTTP internet proxy, > talking to the world? Lots of program use gvfs to access HTTP URLs; IIRC Unity's dash does among other programs. (In reply to comment #40) > Confirming everything Ben is talking about. Exactly the same findings. I'm also > living on Mozilla products: FF, TB, Lightning. > > Ben: re-quoting from above "my only remedy to stop it has been > > mv /usr/libexec/gvfsd-metadata /usr/libexec/gvfsd-metadata.disabled > > " if you want to pause on debugging this. You wrote above you found no way to > stop it, so I wasn't sure whether it included this sledgehammer or not ;) Or chmod a-x ... (In reply to comment #40) Thanks for the info. Firstly, yes it's clearly a bug. Secondly, gvfsd-metadata is a data store. It doesn't track, store or index stuff behind your back; programs (like Firefox) use it to store information and it's no different from a privacy perspective if they stored the data somewhere else. Thirdly, if you're on Ubuntu 12.04, you probably haven't got the fixes that were done last year to make it more resilient, some of these problems may have been fixed. Fourthly, the reason you see Firefox doing stuff with metadata files is that storing and retrieving is partially implemented in both the client-side and the daemon. And yes, Firefox does actually set custom metadata: search toolkit/components/downloads/nsDownloadManager.cpp for "metadata". It's probably best for you to disable gvfsd-metadata since it's unlikely to be fixed on Ubuntu 12.04 any time soon.
> disable gvfsd-metadata Well, there's no way to do that. Moving exec files doesn't count, nor does an obscure gsetting. I should control it in the UI, as opt-in, with fine-grained control. > Firefox does actually set custom metadata: search > toolkit/components/downloads/nsDownloadManager.cpp for "metadata". Yes, that's the OS integration: It merely stores the URL from where a download file came. BTW: This should be stored as extended attributes in the filesystem (where the filesystem supports it, which is most common ones today), directly in the file node, not in /home/. Failing EAs, it should be stored at least in the same filesystem. Crossing filesystems is a massive privacy bug, see scenario mentioned above. Either way, I had not done a download, so I don't think that's it. Did you notice that both Firefox and Thunderbird are in the lsof? They appear both at the same moment, and disappear when I kill gvfs. None of this explains gvfs going wild and creating millions of randomly named files.
(In reply to comment #35) > Does gvfs track files on NFS or Samba? From what I read, that is the case. If > so, that's the bug and reproduction. This *will* go wrong, my stores are too > big to scan them all, and I don't think they are exceptional in that sense. It's purely a database built exclusively for gvfs. It stores and reads metadata that somebody pushes in there. Could be Tracker or another indexer, you need to find the offender. If something is traversing your mounts, it's certainly not gvfs. (In reply to comment #36) > In fact, I have yet to find a way to just kill it for good. It keeps coming > back, and none of the 100 posts on the net has shown a way to turn it off > entirely, despite dozens of people asking specifically for it. I'm really upset > at such programs that do things with all my data without my permission. No need for killing it, it's a database and does nothing on its own. The CPU and the I/O hog is just a result of a bug trigerred by some other application storing metadata. We need to find the root cause in order to fix the bug. Disabling gvfs-metadata is not a solution, just a temporary workaround that will hurt other desktop components on the other end. Don't read the posts over the internet, you can only do more harm to your system. > Stuff like this MUST be opt-IN. Turn if off NOW, fix the bugs, add opt-in UI, > then let it hit the poor people. Again, it's just a database. It's integral part of the ecosystem, you need to turn off the things that are using the database. (In reply to comment #37) > The files change every second (!), they get created and deleted immediately. No > wonder we have a CPU and disk hog. Right, it's the same old bug again. Now we need to find what's trigerring the problem (internally, not the outside application). > Are you actively trying to kill my SSD and my filesystem, or is this > create/delete cycle a bug? Sorry for that, it's a real problem, not intended. > In fact, this happens so fast that an "ls -la" throws errors at any given time, > but different ones each time, because ls and the filesystem apparently can't > keep it with the extremely fast create/deletes. Yes, also a known behavior. > The thing that scares me most is that Firefox is the process, not gvfs. (Yet, > killing gvfsd-metadata solves it - for the short term, so it's definitely > related to that process.) What in the world is Firefox and Thunderbird during > here? I remember a patch storing original URI for downloaded files. You can list stored keys for the particular file using gvfs-info. (In reply to comment #42) > BTW: This should be stored as extended attributes in the filesystem (where the > filesystem supports it, which is most common ones today), directly in the file > node, not in /home/. Failing EAs, it should be stored at least in the same > filesystem. Crossing filesystems is a massive privacy bug, see scenario > mentioned above. Usually, xattrs are turned off or there's not enough space to store long strings. Also, not every file management applications handles xattrs when copying/moving files. One of the gvfs-metadata advantages is the ability to store metadata for non-local or removable filesystems and doesn't need xattrs. Changes to the file (move, deletion) done though GIO are automatically reflected in gvfs metadata database. But yes, both ways would be nice to have, simultaneously. > Either way, I had not done a download, so I don't think that's it. Did you > notice that both Firefox and Thunderbird are in the lsof? They appear both at > the same moment, and disappear when I kill gvfs. Hmm, you can perhaps monitor metadata flow by monitoring d-bus traffic, yet it's not an easy way to do so.
Enough the non-sense rant, even though I don't have time to debug the issue, there are missing information that we need to collect and ideally try to reproduce the issue locally so that we can finally fix the problem. The original issue happened when two processes (metadata writers) modified the database at the same time. Don't worry about readers (clients), they just mmap() the database files and do not loop. Questions: 1. is your /home/me/.local/share/gvfs-metadata/ on a local filesystem? 2. are you running just a single desktop session at the same time? 3. are gvfs packages the distro version? No other repository or custom made builds? If this happens when all files in ~/.local/share/gvfs-metadata/ are deleted (and metadata daemon is not running), there's a good chance for finding a new bug as so far we've been focusing on the NFS case. We would need an exact list of steps done since desktop session login till the CPU hog. Or, at least a snapshot of the databases right before the bug + the exact steps since the snapshot. This is a debugging nightmare, let's admit it. But without reproducer, there's very little chance to find the root cause.
> 1. is your /home/me/.local/share/gvfs-metadata/ on a local filesystem? Yes, a local SSH > 2. are you running just a single desktop session at the same time? Yes > 3. are gvfs packages the distro version? No other repository or custom made builds? Correct, GNOME packages are just plain Ubuntu 12.04. > you need to find the offender As you said yourself, I am in no position to do that. You'll need to have the tools to inspect what's happening. I clearly don't know gvfs, I'm just reporting a very serious bug. If you give me a specific command to run (which doesn't expose private data), I can try to run it when the bug hits the next time. I don't have endless time to debug this, though, I already sank 2-3 hours into this and have a job on my own. I would say the lsof above is already a lot of data to start with. > > ls and the filesystem apparently can't > > keep it with the extremely fast create/deletes. > Yes, also a known behavior. At the very least, it's undeniable that gvfs (and not the app) is doing these create/delete of files, millions upon millions of them. That is something that's happening in your code and you can fix without knowing which application triggered it. That should never happen, no matter what the application does. Upon code inspection, I'm sure you can find some lines of code that could conceivably cause this, and fix them.
> Yes, a local SSH Typo: SSD
I have found a reproducer and the problem. Simply browse to http://rossl.org/junk/meta.html in Firefox, right-click and save the image to your Downloads folder. This should trigger gvfsd-metadata constant disk usage. The problem is that when an application tries to save a large key-value pair, larger than the size of the journal (32*1024), it triggers the journal to be flushed to make space for the entry and the operation is then retried, but it never fits, and the loop continues forever. As for why this is triggered by Firefox, Firefox saves the download URI in metadata::download-uri. If the download URI is a data URL (like in the above link), it can easily be > 32kb and so trigger the issue. I presume a similar thing happens with Thunderbird.
(In reply to comment #47) > I have found a reproducer and the problem. Simply browse to > http://rossl.org/junk/meta.html in Firefox, right-click and save the image to > your Downloads folder. This should trigger gvfsd-metadata constant disk usage. I confirm this reproducer on my machine.
I filed https://bugzilla.mozilla.org/show_bug.cgi?id=1060067 that Firefox should not save such data: URL and thus not trigger the bug. They are useless as download URI record anyway. > The problem is that when an application tries to save a large key-value pair, > larger than the size of the journal (32*1024), it triggers the journal to be > flushed to make space for the entry and the operation is then retried, > but it never fits, and the loop continues forever. Ross, thanks a lot for finding a reproduction and esp. this analysis. It seems to me that there need to be 2 fixes here: 1. Prevent saving metadata bigger than the supported size. Also reasonably checking and sanitizing input values in other dimensions. 2. Preventing that retry-loop. Recursive code should have an exit condition, not just the success, but also for error. Personally, I would not retry at all. Also, if possible, I'd add a loop counter and exit when 5 loops have been reached. I would say fix 2 is the more important one.
Created attachment 284795 [details] [review] metatree: avoid endless looping when the entry is too large Thanks all for debugging and testing. This patch removes the endless retry loop and retries the operation only once after the flush. We know that there isn't enough space for the entry if it fails after the flush.
Review of attachment 284795 [details] [review]: This looks like a possible solution; it might make sense to put the logic in a separate function rather than repeating it 4 times. However, I'm sure the retry loop was there for a reason. Given that I don't know this code at all, it would be best for Tomas to review it properly.
Review of attachment 284795 [details] [review]: (In reply to comment #47) > I have found a reproducer and the problem. Simply browse to > http://rossl.org/junk/meta.html in Firefox, right-click and save the image to > your Downloads folder. This should trigger gvfsd-metadata constant disk usage. > > The problem is that when an application tries to save a large key-value pair, > larger than the size of the journal (32*1024), it triggers the journal to be > flushed to make space for the entry and the operation is then retried, but it > never fits, and the loop continues forever. Ha, good catch! (In reply to comment #51) > However, I'm sure the retry loop was there for a reason. Given that I don't > know this code at all, it would be best for Tomas to review it properly. I wonder that too since I didn't write that code either. Perhaps it was supposed that meta_tree_flush() always succeeds as it rotates the journal if required. I'd say giving the meta_tree_flush() chance once should be enough to put things in a consistent state. Marking the patch as accepted, although neither I am 100% sure. We really need to have the metadata subsystem covered by tests...
Review of attachment 284795 [details] [review]: The loop was just meant to try again, assuming that it would fit on the second try. I never imagined metadata > 32k. I think this patch looks good, although I wouldn't mind if the meta_tree_xyz functions got a GError so that we can get some proper error reporting in these cases rather than g_warnings.
Comment on attachment 284795 [details] [review] metatree: avoid endless looping when the entry is too large Thank you all for cooperation to fix this bug. master: commit e8d04b3d286745534f0a07e57e17f5142eee6007 gnome-3-12: commit 23341281b7e7f3d92ea1f3d6dffc156a8adb03bc
Thanks a lot for fixing this!
Launchpad Bug Tracker wrote on 06.05.2015 20:01: https://bugs.launchpad.net/bugs/517021 > ** Changed in: gvfs (Ubuntu Precise) > Status: Fix Committed => Fix Released Fix tested and confirmed in Ubuntu 12.04 Precise. Before, I reproducibly ran into this bug when saving a data: URL from a Firefox-based webapp to a local file using the normal Firefox download dialog. I installed the Precise update from the normal Precise update package repositories. (I deliberately waited for the fixed public packages.) I save now, and I do not see the 100% CPU usage anymore. Bug fixed. Thanks a lot to Ross Lagerwall,Tomas Bzate, Alexander Larsson, Ondrej Holy (here), Monsta, Marc Deslauriers (Ubuntu) and everybody else involved!