GNOME Bugzilla – Bug 74371
a bad NFS mount kills nautilus at startup
Last modified: 2005-12-21 23:03:27 UTC
1015943259.494253 munmap(0x40142000, 4096) = 0 1015943259.495586 open("/etc/fstab", O_RDONLY) = 13 1015943259.495877 fstat64(13, {st_mode=S_IFREG|0644, st_size=638, ...}) = 0 1015943259.495990 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40142000 1015943259.496070 read(13, "LABEL=/ / "..., 4096) = 638 1015943259.496461 stat64("/mnt/nfs", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 1015943259.496625 stat64("/mnt/gnome", ... Hang forever ... Could this not be done with some cunning asynchronouse gnome-vfs thing ? I suppose that would fill up the thread pool nicely, but ... ;-)
Can't agree with Havoc's 'NOTABUG' but probably puntable.
There are no non-blocking file ops for NFS that i know. So it is probably hard to fix this.
Can you get a stack trace so we know where it is happening in the code?
mount an nfs partition, remove the network cable and you should be away :-)
nfs doesn't work on my laptop, so I can't get a trace. cc'ing federico: can you get a trace, pretty please? There is a small chance we may be able to work around it, I think, depending on exactly why it is doing the stat.
OK, let me get a trace. I can work on this if you prefer to work on other bugs.
Created attachment 9024 [details] Stack trace
You can get the above like this: 1. Use gnome-session-properties to set the restart style of nautilus to "normal". 2. killall nautilus 3. Mount something over NFS. 4. Unplug your network cable. 5. gdb nautilus 6. r 7. killall -STOP nautilus 8. You can now get traces for the individual threads in gdb.
I think the problem is this: gnome-vfs has a thread pool, which is bounded ( sensibly ), nautilus does a load of operations - some (many) of which are redundant nfs stuff is going to block one thread per operation. So - pretty soon your thread pool will run dry and the app will lock up. We could try doing things about this - for example pairing up the 'stats' above, so we only block 1 thread; We could get some more mileage by reducing duplicate calls where possible, but ... ultimately I think it is a fairly tough issue to address really; especially since you can't do an fstype without doing a syscall which will block (?).
You can possibly use the getmntent() calls to step through mtab and get the fstype from there. Though to get the corresponding device I think you need to do a stat(). (See gnome-vfs/modules/fstype.c) If everything could be done with just the mount directory and the fstype we could avoid stats. I'm not sure if that is possible though.
Can we just alarm(3) before the stat, or have a worker thread do the stats and cache the results?
Trying to do things asynchonously isn't going to help much if it still hangs forever, as we will run out of threads eventually. Using alarm() would be nice, but mixing signals and threads is pretty awkward. Linux differs from POSIX quite a bit here, in which thread receives the signal. See the glibc info docs. I would say that nfs filesystems should be mounted with the 'soft' option, so that the RPC calls timeout after a while rather than keep trying forever. Maybe someone could try this and see what happens.
You don't want to mount, say /var/spool/mail with the 'soft' flag; otherwise things can get massively horked if the network dies while someone is updating the mail spool. The right solution seems to remove redundant operations in Nautilus so that it doesn't run out of threads, and maybe increase the size of the thread pool if it does run out of them due to all being blocked.
We should also use our knowledge from fstab parsing to farm off operations on a certain device to queue for processing in a subset of the worker threads, so we can only block them up.
I still think blocking some threads forever is bad. We currently have a problem with a server with a few thousand NFS mounts. If your network connection goes down, that would block too many threads. (maybe more than one per NFS mount)
Created attachment 9372 [details] [review] Beta patch
The above patch is beta quality. It sets up a statalarm thread that kills any threads stuck stat()'ing a file on an NFS server for over 4 seconds. It kills the blocking thread by sending a SIGALRM, which will interrupt the stat syscall. It also protects against blocking FAM requests. This fix will only benefit NFS volumes mountd "soft" or "intr", but in these cases it should make nautilus useable given a server or connection loss. Thoughts?
[Search for 'luis spamming' to catch every instance of this email.] In order to better track Sun's bugs for Sun and Ximian's internal use, I've added a temporary keyword to some bugs. I apologize for the spam, and for the use of an additional keyword, but this is the best way for Sun to track 'it's' bugs without interfering with the community's own triage and bug behavior. If you have any questions or objections, please drop me a note at louie@ximian.com or email bugmaster@gnome.org for more open discussion.
Created attachment 9619 [details] [review] finalized version
This bug is basically unfixable, except in the few cases the above patch protects against. So marking NEEDINFO until we get some feedback from vfs maintainers.
Reopening to WONT[CANT]FIX but making sure all vfs-maints are actually represented here :)
Any fix should also be tested in Sun Ray environment (with several nautilus users seeing an NFS mount go bad)
I could bring up Nautilus by avoiding a stat() in libnautilus- private/nautilus-volume-monitor.c Attaching a patch which does the following In nautilus-volume-monitor.c i have avoided doing a stat() in finish_creating_volume (). We need to do a stat () to get the device id. However the device id is only required here in nautilus_volume_monitor_get_volume_for_path(). So in nautilus_volume_monitor_get_volume_for_path() i get the volume based on the mount_path rather than the device id. However there are a few more issues i encountered: 1. On applying the patch, nautilus comes up fine for the first time, even with a bad NFS mount. 2. To test a) kill the Nautilus already running. b) mount an NFS partition. c) pull out the network cable. d) Start Nautilus. It comes up fine. 3. But now if nautilus were to be killed, then some thread(s) is still blocking on something and running Nautilus again doesn't bring it up. 4. On further look into the code, i found out that on startup, in nautilus-trash-directory.c we do an add_volume () for each mount point present. This would hang for the bad NFS partition. Could addition of the a volumes in nautilus-trash-directory.c be delayed, until probably, when the user actually tries to access the volume ? Your thoughts on the above ? Reopening the bug...
Created attachment 11450 [details] [review] Patch ...
Shivram, I would think delaying the add volume is a good idea if it is possible. It might also speed nautilus startup in most circumstances because on a machine with many NFS mountpoints this add volume and create trash folders can take quite a long time.
The delay in addition of the trash volumes may not help. I would still face the problems i have when i try to access the bad mount point. Here are the problems i am facing with a bad NFS mount. 1. With my patch applied Nautilus comes up for the first time. I get a message dialog which says "Nautilus is searching your disks for Trash Folders". If you click on Ok the dialog doesnt go away. Clicking on the close icon closes the dialog. But this should be ok. Nautilus is up and also the user knows that something is wrong. 2. Kill nautilus, i do this with gnome-session-properties, and some of the nautilus processes are still running. So when i restart Nautilus it doesnt come up again. 3. The problem is that nautilus calls gnome_vfs_shutdown () on exit, but this function never returns. 4. The gnome_vfs_job_get_count () never returns zero probably due the threads accessing the bad mount path. So this condition would never be true void gnome_vfs_thread_backend_shutdown (void) { .... .... .... for (count = 0;; count++) { /* Check if it is OK to quit. Originally we used a * count of slave threads, but now we use a count of * outstanding jobs instead to make sure that the job * is cleanly destroyed. */ if (gnome_vfs_job_get_count () == 0) { done = TRUE; gnome_vfs_done_quitting = TRUE; } 5. Could the threads accessing the bad NFS mount path were to be forcebly killed on shutdown ?
in nautilus_volume_monitor_get_volume_for_path, be sure to perform the expensive operation of checking all the path's parents for symlinks to a mounted directory. This patch doesn't fix the actual problem, but if you are happy with the results, good luck :)
Oh, Thanks.. I had completely ignored symbolic links. A call to resolvepath() should make it work. I have just one more question :-) In gnome_vfs_thread_backend_shutdown (), the for loop would continue till gnome_vfs_job_get_count () returns zero. But in this case, it never is zero. Would waiting for the job count to be zero for a certain period of time, rather than forever be a problem ? I mean replacing for (count = 0;; count++) { /* Check if it is OK to quit. Originally we used a * count of slave threads, but now we use a count of * outstanding jobs instead to make sure that the job * is cleanly destroyed. */ with say for (count = 0; count < 500; count++) { This would give me a delay of 10+ seconds since we do a usleep (20000); I am assuming that 10 seconds would give sufficient time for the threads to shutdown in normal cases, and would break out of the for loop, in cases like the thread hanging on a stat().
Anyone got an answer to Shivram's question?
Is this still an issue with Nautilus 2.11?
Most likely - it's pretty easy to get a bad NFS mount, give it a go ;-)
You can't seriously be proposing to call resolvpath() on each call to nautilus_volume_monitor_get_volume_for_path()? That is a synchronous, quite expensive call that can go stomping all over the filesystem. Anyway, this should be slighly better in newer version. We're now using gnome_vfs_volume_monitor_get_volume_for_path(), and its forking around each stat on startup, and timeouts. So, hanged nfs mounts won't stop nautilus from starting. Anyone care to test this? It can still hang if you navigate into too many bad nfs directories of course, but I don't think that can be fixed.
Well - there's really no need to do a resolvpath IMHO - we just need to be more intelligent about how we allocate our thread-pool. ie. don't allow more than N% of threads to be working on a common sub-path concurrently. I wrote up a number of thoughts wrt. this in #314491#.
I'm marking this one as duplicate of bug 314491, since this is just an effect of the latter, and we probably can't do much about it. I'm also setting the patch status to "rejected" for the remaining patches. Thanks for the efforts of all people involved! *** This bug has been marked as a duplicate of 314491 ***