GNOME Bugzilla – Bug 720860
Hang when peer connection closes while enumerating
Last modified: 2018-09-21 17:37:51 UTC
The is an easily reproducible hang when searching in nautilus with the sftp backend. If the peer connection closes between the time that the enumerate job calls send_reply() and the time that the enumerate job actually finishes, then it gets stuck in peer_connection_closed() trying to cancel a job that cannot be cancelled. It cannot be cancelled because g_vfs_job_cancel() does nothing if send_reply has been called. Removing the "|| job->sent_reply" clause in g_vfs_job_cancel() fixes the problem, but I'm not sure if it is the correct fix.
Maybe it should look at job->finished instead. This would mean that the cancel signal can get into the enumerate job inbetween the time it has sent the reply and not finished the enumeration. For many operations there is no real difference, because GVfsJobDbus and GVfs channel emits it in send_reply, The only difference is in mount/unmount/enumerate that override dbus_send_reply().
Created attachment 286128 [details] [review] Allow cancelling jobs that have sent a reply but not finished A few job types (mount, unmount and enumerate) can be in a state where sent_reply is TRUE but finished is FALSE because they override send_reply. If the peer connection closes during this period, the daemon hangs in peer_connection_closed because the job still exists but cannot be cancelled. To fix this, allow cancelling jobs that have sent a reply but not yet finished. This can be reproduced fairly easily by doing a search on an sftp mount in Nautilus.
(In reply to comment #1) > Maybe it should look at job->finished instead. This would mean that the cancel > signal can get into the enumerate job inbetween the time it has sent the reply > and not finished the enumeration. For many operations there is no real > difference, because GVfsJobDbus and GVfs channel emits it in send_reply, > The only difference is in mount/unmount/enumerate that override > dbus_send_reply(). That makes sense, and the fix seems to work!
This is a slight change in semantics. We're now allowed to cancel the operation after we have *started* but not finished sending a reply. This makes sense, as the sending of a reply is an i/o operation that could block, especially in the case of enumerate as above where the reply is actually a streaming op. I worry a bit that we're not correctly handing this in every case though. For instance, say we force unmount the mount causing a gvfs_job_cancel() at the exact time that a GVfsReadChannel read call finished. That means we're after g_vfs_job_send_reply() which sents sent_reply=1 but we've only queued an async write of the reply on the channel via g_vfs_channel_send_reply(). However, atm GVfsChannel doesn't seems to look at the job cancellable for the async operation so it will never recieve it and may block forever. However, this is clearly a much tinier race than the huge race the above finishes, so i think we should apply this for now and then do further work ensuring that no job can block forever if they get a gvfs_job_cancel() between gvfs_job_send_reply() and gvfs_job_emit_finished(). Of course, as i said above, most standard job types calls emit_finished() unconditionally in send_reply, so this is a limited search.
Pushed to master as fcffe33ba57f968e2cb0073cb201007808155702. Thanks for the review. Leaving open to fix the smaller race in comment 4.
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/gvfs/issues/220.