Bug 74371 – a bad NFS mount kills nautilus at startup

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 74371 - a bad NFS mount kills nautilus at startup


Summary:	a bad NFS mount kills nautilus at startup


Status:	RESOLVED DUPLICATE of bug 314491

Product:	nautilus
Classification:	Core
Component:	general
Version:	1.1.x
Hardware:	Other All

Importance:	High critical
Target Milestone:	1.1.x
Assigned To:	Nautilus Maintainers
QA Contact:	Nautilus Maintainers

URL:
Whiteboard:

Depends on:	314491
Blocks:

Reported:	2002-03-12 14:31 UTC by Michael Meeks
Modified:	2005-12-21 23:03 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Stack trace (10.64 KB, text/plain) 2002-06-06 20:14 UTC, Federico Mena Quintero		Details
Beta patch (12.65 KB, patch) 2002-06-21 09:36 UTC, Alex Graveley	none	Details \| Review
finalized version (13.55 KB, patch) 2002-07-03 20:18 UTC, Alex Graveley	rejected	Details \| Review
Patch ... (2.54 KB, patch) 2002-10-08 13:28 UTC, Shivram U	rejected	Details \| Review

Description Michael Meeks 2002-03-12 14:31:12 UTC

1015943259.494253 munmap(0x40142000, 4096) = 0
1015943259.495586 open("/etc/fstab", O_RDONLY) = 13
1015943259.495877 fstat64(13, {st_mode=S_IFREG|0644, st_size=638, ...}) = 0
1015943259.495990 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40142000
1015943259.496070 read(13, "LABEL=/                 /       "..., 4096) = 638
1015943259.496461 stat64("/mnt/nfs", {st_mode=S_IFDIR|0755, st_size=4096,
...}) = 0
1015943259.496625 stat64("/mnt/gnome", 
... Hang forever ...

Could this not be done with some cunning asynchronouse gnome-vfs thing ? I
suppose that would fill up the thread pool nicely, but ... ;-)

Comment 1 Luis Villa 2002-05-01 07:29:51 UTC

Can't agree with Havoc's 'NOTABUG' but probably puntable.

Comment 2 Alexander Larsson 2002-05-02 19:17:51 UTC

There are no non-blocking file ops for NFS that i know. So it is
probably hard to fix this.

Comment 3 Damon Chaplin 2002-05-09 19:27:46 UTC

Can you get a stack trace so we know where it is happening in the code?

Comment 4 Michael Meeks 2002-05-23 16:41:15 UTC

mount an nfs partition, remove the network cable and you should be
away :-)

Comment 5 Damon Chaplin 2002-06-05 20:22:18 UTC

nfs doesn't work on my laptop, so I can't get a trace.

cc'ing federico: can you get a trace, pretty please?


There is a small chance we may be able to work around it, I think,
depending on exactly why it is doing the stat.

Comment 6 Federico Mena Quintero 2002-06-06 17:06:54 UTC

OK, let me get a trace.  I can work on this if you prefer to work on
other bugs.

Comment 7 Federico Mena Quintero 2002-06-06 20:14:31 UTC

Created attachment 9024 [details]
Stack trace

Comment 8 Federico Mena Quintero 2002-06-06 20:17:03 UTC

You can get the above like this:

1. Use gnome-session-properties to set the restart style of nautilus
to "normal".
2. killall nautilus
3. Mount something over NFS.
4. Unplug your network cable.
5. gdb nautilus
6. r
7. killall -STOP nautilus
8. You can now get traces for the individual threads in gdb.

Comment 9 Michael Meeks 2002-06-07 07:57:09 UTC

I think the problem is this:

gnome-vfs has a thread pool, which is bounded ( sensibly ),

nautilus does a load of operations - some (many) of which are redundant

nfs stuff is going to block one thread per operation.

So - pretty soon your thread pool will run dry and the app will lock up.

We could try doing things about this - for example pairing up the
'stats' above, so we only block 1 thread;

We could get some more mileage by reducing duplicate calls where
possible, but ... ultimately I think it is a fairly tough issue to
address really; especially since you can't do an fstype without doing
a syscall which will block (?).

Comment 10 Damon Chaplin 2002-06-07 17:41:23 UTC

You can possibly use the getmntent() calls to step through mtab and
get the fstype from there. Though to get the corresponding device I think
you need to do a stat(). (See gnome-vfs/modules/fstype.c)

If everything could be done with just the mount directory and the fstype
we could avoid stats. I'm not sure if that is possible though.

Comment 11 Alex Graveley 2002-06-15 15:00:40 UTC

Can we just alarm(3) before the stat, or have a worker thread do the
stats and cache the results?

Comment 12 Damon Chaplin 2002-06-17 20:13:12 UTC

Trying to do things asynchonously isn't going to help much if it still
hangs forever, as we will run out of threads eventually.

Using alarm() would be nice, but mixing signals and threads is pretty
awkward. Linux differs from POSIX quite a bit here, in which thread
receives the signal. See the glibc info docs.

I would say that nfs filesystems should be mounted with the 'soft'
option, so that the RPC calls timeout after a while rather than keep
trying forever. Maybe someone could try this and see what happens.

Comment 13 Federico Mena Quintero 2002-06-17 23:57:28 UTC

You don't want to mount, say /var/spool/mail with the 'soft' flag;
otherwise things can get massively horked if the network dies while
someone is updating the mail spool.

The right solution seems to remove redundant operations in Nautilus so
that it doesn't run out of threads, and maybe increase the size of the
thread pool if it does run out of them due to all being blocked.

Comment 14 Michael Meeks 2002-06-18 09:11:16 UTC

We should also use our knowledge from fstab parsing to farm off
operations on a certain device to queue for processing in a subset of
the worker threads, so we can only block them up.

Comment 15 Damon Chaplin 2002-06-18 18:26:20 UTC

I still think blocking some threads forever is bad.
We currently have a problem with a server with a few thousand NFS
mounts. If your network connection goes down, that would block too many
threads. (maybe more than one per NFS mount)

Comment 16 Alex Graveley 2002-06-21 09:36:27 UTC

Created attachment 9372 [details] [review]
Beta patch

Comment 17 Alex Graveley 2002-06-21 09:42:49 UTC

The above patch is beta quality.  It sets up a statalarm thread that
kills any threads stuck stat()'ing a file on an NFS server for over 4
seconds.  It kills the blocking thread by sending a SIGALRM, which
will interrupt the stat syscall.  It also protects against blocking
FAM requests.

This fix will only benefit NFS volumes mountd "soft" or "intr", but in
these cases it should make nautilus useable given a server or
connection loss.

Thoughts?

Comment 18 Luis Villa 2002-07-02 15:14:25 UTC

[Search for 'luis spamming' to catch every instance of this email.]
In order to better track Sun's bugs for Sun and Ximian's internal use, I've
added a temporary keyword to some bugs. I apologize for the spam, and for the
use of an additional keyword, but this is the best way for Sun to track 'it's'
bugs without interfering with the community's own triage and bug behavior. If
you have any questions or objections, please drop me a note at louie@ximian.com
or email bugmaster@gnome.org for more open discussion.

Comment 19 Alex Graveley 2002-07-03 20:18:07 UTC

Created attachment 9619 [details] [review]
finalized version

Comment 20 Alex Graveley 2002-07-03 20:22:38 UTC

This bug is basically unfixable, except in the few cases the above
patch protects against.  So marking NEEDINFO until we get some
feedback from vfs maintainers.

Comment 21 Luis Villa 2002-07-10 22:10:49 UTC

Reopening to WONT[CANT]FIX  but making sure all vfs-maints are
actually represented here :)

Comment 22 Brian Nitz 2002-07-11 12:48:15 UTC

Any fix should also be tested in Sun Ray environment (with several
nautilus users seeing an NFS mount go bad)

Comment 23 Shivram U 2002-10-08 13:27:28 UTC

I could bring up Nautilus by avoiding a stat() in libnautilus-
private/nautilus-volume-monitor.c

Attaching a patch which does the following
In nautilus-volume-monitor.c i have avoided doing a stat() in 
finish_creating_volume (). We need to do a stat () to get the device 
id. However the device id is only required here in 
nautilus_volume_monitor_get_volume_for_path(). So in 
nautilus_volume_monitor_get_volume_for_path() i get the volume based 
on the mount_path rather than the device id.

However there are a few more issues i encountered:
1. On applying the patch, nautilus comes up fine for the first time, 
even with a bad NFS mount.
2. To test
  a) kill the Nautilus already running.
  b) mount an NFS partition.
  c) pull out the network cable.
  d) Start Nautilus. It comes up fine.
3. But now if nautilus were to be killed, then some thread(s) is 
still blocking on something and running Nautilus again doesn't bring 
it up.
4. On further look into the code, i found out that on startup, in 
nautilus-trash-directory.c we do an add_volume () for each mount 
point present. This would hang for the bad NFS partition. 
   Could addition of the a volumes in nautilus-trash-directory.c be 
delayed, until probably, when the user actually tries to access the 
volume ?

 Your thoughts on the above ?

Reopening the bug...

Comment 24 Shivram U 2002-10-08 13:28:36 UTC

Created attachment 11450 [details] [review]
Patch ...

Comment 25 Brian Nitz 2002-10-10 14:58:22 UTC

Shivram, 

  I would think delaying the add volume is a good idea if it is
possible.  It might also speed nautilus startup in most circumstances
because on a machine with many NFS mountpoints this add volume and
create trash folders can take quite a long time.

Comment 26 Shivram U 2002-10-16 14:28:11 UTC

The delay in addition of the trash volumes may not help. I would 
still face the problems i have when i try to access the bad mount 
point.

Here are the problems i am facing with a bad NFS mount.
1. With my patch applied Nautilus comes up for the first time. I get 
a message dialog which says "Nautilus is searching your disks for 
Trash Folders". If you click on Ok the dialog doesnt go away. 
Clicking on the close icon closes the dialog. But this should be ok. 
Nautilus is up and also the user knows that something is wrong.
2. Kill nautilus, i do this with gnome-session-properties, and some 
of the nautilus processes are still running. So when i restart 
Nautilus it doesnt come up again.
3. The problem is that nautilus calls gnome_vfs_shutdown () on exit, 
but this function never returns.
4. The gnome_vfs_job_get_count () never returns zero probably due the 
threads accessing the bad mount path. So this condition would never 
be true
  void
gnome_vfs_thread_backend_shutdown (void)
{
  ....
  ....
  ....
      for (count = 0;; count++) {
                /* Check if it is OK to quit. Originally we used a
                 * count of slave threads, but now we use a count of
                 * outstanding jobs instead to make sure that the job
                 * is cleanly destroyed.
                 */
                if (gnome_vfs_job_get_count () == 0) {
                        done = TRUE;
                        gnome_vfs_done_quitting = TRUE;
                }

5. Could the threads accessing the bad NFS mount path were to be 
forcebly killed on shutdown ?

Comment 27 Alex Graveley 2002-10-17 10:14:47 UTC

in nautilus_volume_monitor_get_volume_for_path, be sure to perform the
expensive operation of checking all the path's parents for symlinks to
a mounted directory.

This patch doesn't fix the actual problem, but if you are happy with
the results, good luck :)

Comment 28 Shivram U 2002-10-22 12:21:55 UTC

Oh, Thanks.. I had completely ignored symbolic links. A call to 
resolvepath() should make it work.

I have just one more question :-)

In gnome_vfs_thread_backend_shutdown (), the for loop would continue 
till gnome_vfs_job_get_count () returns zero. But in this case, it 
never is zero.

Would waiting for the job count to be zero for a certain period of 
time, rather than forever be a problem ?

I mean replacing
for (count = 0;; count++) {
                /* Check if it is OK to quit. Originally we used a
                 * count of slave threads, but now we use a count of
                 * outstanding jobs instead to make sure that the job
                 * is cleanly destroyed.
                 */
                
with say

for (count = 0; count < 500; count++) {
 
  This would give me a delay of 10+ seconds since we do a usleep
(20000);

  I am assuming that 10 seconds would give sufficient time for the 
threads to shutdown in normal cases, and would break out of the for 
loop, in cases like the thread hanging on a stat().

Comment 29 Kjartan Maraas 2003-10-28 07:52:43 UTC

Anyone got an answer to Shivram's question?

Comment 30 Christian Neumair 2005-08-05 23:55:46 UTC

Is this still an issue with Nautilus 2.11?

Comment 31 Michael Meeks 2005-08-08 14:11:45 UTC

Most likely - it's pretty easy to get a bad NFS mount, give it a go ;-)

Comment 32 Alexander Larsson 2005-08-31 14:29:52 UTC

You can't seriously be proposing to call resolvpath() on each call to
nautilus_volume_monitor_get_volume_for_path()? That is a synchronous, quite
expensive call that can go stomping all over the filesystem.

Anyway, this should be slighly better in newer version. We're now using
gnome_vfs_volume_monitor_get_volume_for_path(), and its forking around each stat
on startup, and timeouts. So, hanged nfs mounts won't stop nautilus from
starting. Anyone care to test this?

It can still hang if you navigate into too many bad nfs directories of course,
but I don't think that can be fixed.

Comment 33 Michael Meeks 2005-09-05 09:30:19 UTC

Well - there's really no need to do a resolvpath IMHO - we just need to be more
intelligent about how we allocate our thread-pool. ie. don't allow more than N%
of threads to be working on a common sub-path concurrently.
I wrote up a number of thoughts wrt. this in #314491#.

Comment 34 Christian Neumair 2005-12-21 23:03:27 UTC

I'm marking this one as duplicate of bug 314491, since this is just an effect of the latter, and we probably can't do much about it. I'm also setting the patch status to "rejected" for the remaining patches.
Thanks for the efforts of all people involved!

*** This bug has been marked as a duplicate of 314491 ***