Bug 761278 – Crashes on unmount

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 761278 - Crashes on unmount


Summary:	Crashes on unmount


Status:	RESOLVED FIXED

Product:	gvfs
Classification:	Core
Component:	mtp backend
Version:	1.28.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Philip Langdale
QA Contact:	gvfs-maint

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2016-01-29 10:30 UTC by Ondrej Holy
Modified:	2016-04-15 15:54 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Fix unmount race 1/4 (3.68 KB, patch) 2016-04-10 21:10 UTC, Philip Langdale	none	Details \| Review
Fix unmount race 2/4 (2.74 KB, patch) 2016-04-10 21:10 UTC, Philip Langdale	none	Details \| Review
Fix unmount race 2/3 (5.58 KB, patch) 2016-04-10 21:11 UTC, Philip Langdale	none	Details \| Review
Fix unmount race 4/4 (5.73 KB, patch) 2016-04-10 21:11 UTC, Philip Langdale	none	Details \| Review
Fix unmount race 4/4 v2 (5.51 KB, patch) 2016-04-11 03:02 UTC, Philip Langdale	none	Details \| Review
Fix unmount race 1/3 v2 (3.95 KB, patch) 2016-04-13 02:30 UTC, Philip Langdale	none	Details \| Review
Fix unmount race 3/3 v3 (5.49 KB, patch) 2016-04-13 02:40 UTC, Philip Langdale	none	Details \| Review
Fix unmount race 1/3 v4 (3.96 KB, patch) 2016-04-13 14:38 UTC, Philip Langdale	accepted-commit_now	Details \| Review
Fix unmount race 2/3 v4 (5.59 KB, patch) 2016-04-13 14:38 UTC, Philip Langdale	accepted-commit_now	Details \| Review
Fix unmount race 3/3 v4 (5.76 KB, patch) 2016-04-13 14:39 UTC, Philip Langdale	none	Details \| Review
Fix unmount race 3/3 v4 (5.89 KB, patch) 2016-04-13 14:52 UTC, Philip Langdale	reviewed	Details \| Review

Description Ondrej Holy 2016-01-29 10:30:01 UTC

Some crashes still happen on unmount.

Truncated backtrace:
Thread no. 1 (8 frames)
 #0 sync_transfer_wait_for_completion at sync.c:48
 #1 do_sync_bulk_transfer at sync.c:180
 #2 libusb_bulk_transfer at sync.c:257
 #3 ptp_usb_event at libusb1-glue.c:1603
 #4 ptp_usb_event_wait at libusb1-glue.c:1662
 #5 LIBMTP_Read_Event at libmtp.c:2163
 #6 check_event at gvfsbackendmtp.c:627
 #7 g_thread_proxy at gthread.c:778

I suppose it is because LIBMTP_Release_Device is called from do_unmount, when LIBMTP_Read_Event is running... Philip, wouldn't be solution to call LIBMTP_Release_Device from check_event thread, after LIBMTP_Read_Event return?

See downstream bug reports:
https://bugzilla.redhat.com/show_bug.cgi?id=1302239
https://bugzilla.redhat.com/show_bug.cgi?id=1242180

Comment 1 Philip Langdale 2016-01-29 17:17:40 UTC

The problem is Read_Event blocks. So you can't guarantee that it would ever return in an idle situation. In the bugs you linked to, it's not clear if it's crashing when they pull the plug (without unmounting first) but this is definitely the case I tried to address before, and sure, in that case you could image doing unmount after read_event returns but in the case you want to unmount without unplugging, it'll just block on Read_Event forever and there's no way to kick it to return.

I put a bunch of logic in to try and avoid this race but I don't think it can be done perfectly.

Comment 2 Ondrej Holy 2016-04-01 07:55:00 UTC

You can see "do_unmount" in the full backtrace in the linked bugs, so it is regular unmount, not force unmount, because "do_unmount" is not called in the case of force unmount...

I've looked at libmtp sources briefly and have few ideas...

Can't we do some quick operation to produce some event (or inject fake event somehow)? 
What about adding LIBMTP_Check_Event using ptp_usb_event_check and loop on it in the backend?
Can't ptp_usb_control_cancel_request be used to interrupt the wait? 

Philip, does any of the ideas make sense?

Comment 3 Philip Langdale 2016-04-02 16:13:51 UTC

I don't think injecting an event is going to be a robust solution but the others are conceivable.

I think the best solution would be a Check_Event that allowed us to control the loop. Obviously it requires propagating a change through libmtp but that should be possible now. I can look into this.

Comment 4 Philip Langdale 2016-04-02 17:05:25 UTC

So, just thinking about the mechanics:

1) We continue to check for events in a separate thread. I think this is a good idea, as we could block on the backend mutex while doing work that reacts to an event. Obviously, we could poll on the main thread and then dispatch the work to other threads, but this seems mechanically easier.

2) The check_event thread needs to poll for events now. I guess we'd run a separate main loop in this thread to do that efficiently.

3) do_unmount can now join on the event thread to correctly synchronize the shutdown process.

Comment 5 Ondrej Holy 2016-04-04 08:22:58 UTC

Thanks you are looking into this. The suggested mechanics sounds good to me. I saw the patches for libmtp. We have to decrease the timeout significantly to solve something. I personally expect that function with word "check" in the name should return almost immediately. It seems ptp_usb_event_check is not used anywhere in libmtp, but it comes from libgphoto2 and I am not sure what is expected there...

Comment 6 Philip Langdale 2016-04-05 03:53:09 UTC

More digging.

The really really right way to solve this is to use the asynchronous I/O capabilities of libusb. This would allow us to keep an interrupt transfer pending, but still be cancellable.

However, libmtp and ptp.c don't use or expose this capability. Given the complexity involved in exposing it in a consistent way, I'm tempted to opt out and use libusb calls directly from gvfs. It's possible to pull the libusb device pointer out of ptp and then work with it - the mechanics of registering the interrupt transfer and serving it are pretty straight-forward.

There are built-in timeouts and retries within the usb spec and libusb which it does not feel good to try and wrap in way we've discussed so far.

The interrupt transfer is attempted periodically by libusb at an interval specified by the hardware and the hardware returns a NAK token if there is no event queued. libusb then retries the interrupt according to that interval.

The 'timeout' value exposed up to the caller reflects how long libusb should follow this attempt/retry loop before giving up.

Assuming we do this, we'd use the libusb async API and register the transfer fd with our main loop and when an event is notified we'd dispatch the event handler work to a single-thread threadpool. That seems like the most natural way to model it.

Comment 7 Philip Langdale 2016-04-09 05:38:10 UTC

Oh man, this makes my brain hurt.

So, because we're using libmtp for everything else and it uses the synchronous libusb API, that means we have a multi-threaded event situation.

So we must read this:

http://libusb.sourceforge.net/api-1.0/mtasync.html

And after reading that, I can't see how I can integrate the libusb event fds into the main loop safely. The multi-threaded logic requires doing a blocking wait if something else is processing events and you unblock when it's safe for you to process events. This would end up blocking the entire main loop.

In turn, that seems to imply we have to run a separate event loop in another thread to do the libusb polling. This is really a pain.

The only way we could do a single event loop is if libmtp was fully asynchronous.

Comment 8 Philip Langdale 2016-04-09 18:14:27 UTC

Ok, so I have something that I'm comfortable with.

1) Implement an async version of LIBMTP_Read_Event. It takes a callback that will be called when an event is received.

This uses the async libusb api internally, but does not need to expose any explicit libusb types in the API. Cute.

2) Declare a dependency on libusb in gvfs as we'll need to make explicit libusb calls to manage events.

3) Split the existing check_events into three parts.

a) The check_events thread now uses libusb_handle_events_timeout_completed() to poll for libusb events. By using this higher level call, we avoid any polling conflicts with the other blocking libmtp calls and the timeout allows us to terminate polling on shutdown with a reasonable latency (1s timeout seems fine)

b) The callback for Read_Event_Async. It checks that shutdown isn't going on, and if not, it will push the event to a thread pool for evaluation

c) The event handling logic moves to a separate function executed on the thread pool. This work cannot be inlined into b) because it does libmtp calls and these will block if not recursively inside b) which is inside a) with the libusb event lock held.

4) Now that we know the check_event thread can be terminated, we can join the thread during unmount before releasing the mtp device. (And we don't need to play games with weak refs anymore.)

5) Profit.

Comment 9 Philip Langdale 2016-04-10 21:10:34 UTC

Created attachment 325685 [details] [review]
Fix unmount race 1/4

Comment 10 Philip Langdale 2016-04-10 21:10:47 UTC

Created attachment 325686 [details] [review]
Fix unmount race 2/4

Comment 11 Philip Langdale 2016-04-10 21:11:01 UTC

Created attachment 325687 [details] [review]
Fix unmount race 2/3

Comment 12 Philip Langdale 2016-04-10 21:11:40 UTC

Created attachment 325688 [details] [review]
Fix unmount race 4/4

I have attached the four patch series that implements this fix, assuming that required libmtp changes are available.

Comment 13 Philip Langdale 2016-04-11 00:59:38 UTC

Ha!

libusb recently added a way to interrupt the poll loop. It's not a release yet, but it'll fix the problem comprehensively when it's available.

https://github.com/libusb/libusb/commit/a6db382ad11f7662b550338e0570d5a2dfd8ce5a

and they've added some new documentation that says the poll function will return when you close the usb device, so maybe I can re-order the cleanup sequence to get the desired results too.

Comment 14 Philip Langdale 2016-04-11 03:02:05 UTC

Created attachment 325695 [details] [review]
Fix unmount race 4/4 v2

Improved patch 4 that takes advantage of knowing that release_device will interrupt the polling thread. We can now let it block for a long time, knowing it will be interrupted, and then join after calling release.

For the record, this technique still doesn't work with the old Read_Event API because the internal implementation of blocking does not react to interruption correctly. It will immediately try to poll again.

In fact, this is why the reported crashes happen - because it's not safe to retry the poll after the device is released.

Comment 15 Ondrej Holy 2016-04-12 11:16:54 UTC

Review of attachment 325685 [details] [review]:

Thanks that you are working on it...

::: daemon/gvfsbackendmtp.c
@@ +739,1 @@
     }

It seems to me you forget to add:
else {
  return NULL;
}

Comment 16 Ondrej Holy 2016-04-12 11:17:22 UTC

Review of attachment 325686 [details] [review]:

::: daemon/gvfsbackendmtp.c
@@ +646,3 @@
+    g_mutex_lock (&backend->mutex);
+    switch (event) {
+    case LIBMTP_EVENT_STORE_ADDED:

You are changing the indentation here just for the "case" keywords and not for the code inside. I see the reasons and maybe this indentation is more accurate than the previous one, however it would be nice to be consistent in the indentation everytime. So maybe it would be best to merge this patch with the previous one...

Comment 17 Ondrej Holy 2016-04-12 11:18:09 UTC

Review of attachment 325687 [details] [review]:

Looks good though I didn't test it...

Comment 18 Ondrej Holy 2016-04-12 11:18:38 UTC

Review of attachment 325695 [details] [review]:

I am not really sure that this fixes our problem somehow. It crashes currently because LIBMTP_Release_Device is called during event handling. LIBMTP_Release_Device may be still called before the event thread is finished, so LIBMTP_Read_Event_Async and LIBMTP_Handle_Events_Timeout_Completed may be still executed after LIBMTP_Release_Device. So there is still same race as before, isn't it? The event thread needs to be terminated before calling LIBMTP_Release_Device, needn't it? It seems to me that the previous version of this patch did it...

::: daemon/gvfsbackendmtp.c
@@ +1009,3 @@
+#ifdef HAVE_LIBMTP_1_1_12
+  /* LIBMTP_Release_Device interrupts the polling thread. */
+  g_thread_join(op_backend->event_thread);

missing space before "("

Comment 19 Philip Langdale 2016-04-12 14:17:35 UTC

This is explicitly documented in the libusb code. Release device does a libusb close device which will interrupt the event polling and allow it to shutdown cleanly.

The crash happens because of the libusb bug which causes read event to retry after interruption instead of returning.

I'm pretty confident that it is safe to call close device at my time during the poll cycle because the only shared state with the poller is the set of fds. As long as the code around the poll is safe, you're ok. And I made sure it is.

The ideal solution, as I noted, is to use the new interrupt transfer call so that we can join the thread before calling release device but I don't think we have a correctness issue without it.

Comment 20 Philip Langdale 2016-04-13 02:30:21 UTC

Created attachment 325833 [details] [review]
Fix unmount race 1/3 v2

Merging the first two part as requested

Comment 21 Philip Langdale 2016-04-13 02:40:01 UTC

Created attachment 325834 [details] [review]
Fix unmount race 3/3 v3

Ok, I see two potential problems.

1) Read_Event_Async being called after release device. This might crash.

2) Handle_Events being called after release device. This might end up blocking for the full timeout period.

Obviously a very small race window, but very difficult to combat. I think you could use a separate mutex to avoid 1) but don't see a way to avoid 2).

When we're in a position to use the new interrupt_handler call, then we can deal with this properly.

I've reverted to the one second poll implementation.

Comment 22 Ondrej Holy 2016-04-13 09:56:04 UTC

Review of attachment 325833 [details] [review]:

Thanks, it looks better... just some nitpicks for consistency with the rest of the code...

(I set commit_after_freeze, because GVfs is not branched yet for gnome-3-20, so it is not good idea to commit it right now.)

::: daemon/gvfsbackendmtp.c
@@ +641,3 @@
+}
+
+void

static void

@@ +642,3 @@
+
+void
+handle_event (GVfsBackendMtp *backend, LIBMTP_event_t event, uint32_t param1) {

the opening curly bracket should be on a new line

Comment 23 Ondrej Holy 2016-04-13 09:56:14 UTC

Review of attachment 325687 [details] [review]:

I wanted to test it, however the patch is not applicable currently...  

I realized that this is less safe than the current solution, because you obtained weak reference for the backend before calling g_thread_pool_push, however handle_event is running without the additional reference... This is still problem even with the next patch. Imagine the case when handle_event and do_unmount is called in the same time and do_unmount locks the mutex. Handle_event continues after finishing do_unmount. So there is still some time before finalize is called and backend is cleared, however there is still a really small chance for a segfault... I don't see any simple solution for it, so it would be good to add at least some comment in the code that it might happen...

I have added also some nitpicks for the formatting...

::: daemon/gvfsbackendmtp.c
@@ +432,3 @@
 
+#if HAVE_LIBMTP_1_1_5
+  g_thread_pool_free(backend->event_pool, TRUE, TRUE);

missing space before opening parentheses

@@ +925,3 @@
+#if HAVE_LIBMTP_1_1_5
+  /* It's no longer safe to handle events. */
+  g_thread_pool_set_max_threads(op_backend->event_pool, 0, NULL);

dtto

Comment 24 Ondrej Holy 2016-04-13 09:56:38 UTC

Review of attachment 325834 [details] [review]:

Cool, I am not sure how this will work with force_unmount, but regular unmount should be definitely much safer... I will test it once the patches will be applicable... Maybe we should add hard dependency for the new libmtp to force distros upgrade the libmtp library, but that's just idea (don't know libmtp release cycle)...

::: daemon/gvfsbackendmtp.c
@@ +626,3 @@
+#if HAVE_LIBMTP_1_1_12
+static void
+check_event_cb(int ret, LIBMTP_event_t event, uint32_t param1, void *user_data) {

opening bracket should be on a new line

@@ +633,3 @@
+
+  if (ret != LIBMTP_HANDLER_RETURN_OK ||
+      g_atomic_int_get (&backend->unmount_started)) {

We should probably check also force_unmounted...

@@ +640,3 @@
+  data->event = event;
+  data->param1 = param1;
+  gboolean tret = g_thread_pool_push(backend->event_pool, data, NULL);

missing space before opening parentheses

@@ +661,3 @@
+        continue;
+      }
+      backend->event_completed = FALSE;

It would be better to set the flag before LIBMTP_Read_Event_Async.

@@ +666,3 @@
+     * Return from polling every second to check for unmount.
+     */
+    struct timeval tv = { 1, 0 };

It would be good to use some macro instead of hardcoded number.

@@ +668,3 @@
+    struct timeval tv = { 1, 0 };
+    DEBUG_ENUMERATE ("(I) check_event: Polling for events.\n");
+    ret = LIBMTP_Handle_Events_Timeout_Completed(&tv, &(backend->event_completed));

missing space before opening parentheses

@@ +981,3 @@
+#ifdef HAVE_LIBMTP_1_1_12
+  /* Thread will terminate after flag is set. */
+  g_thread_join(op_backend->event_thread);

missing space before opening parentheses

@@ +982,3 @@
+  /* Thread will terminate after flag is set. */
+  g_thread_join(op_backend->event_thread);
+#endif

It would be better to add this code below g_thread_pool_set_max_threads, to be sure that another handle_event is not started in the meantime.

Comment 25 Philip Langdale 2016-04-13 14:21:49 UTC

(In reply to Ondrej Holy from comment #23)
> Review of attachment 325687 [details] [review] [review]:
>
> I realized that this is less safe than the current solution, because you
> obtained weak reference for the backend before calling g_thread_pool_push,
> however handle_event is running without the additional reference... This is
> still problem even with the next patch. Imagine the case when handle_event
> and do_unmount is called in the same time and do_unmount locks the mutex.
> Handle_event continues after finishing do_unmount. So there is still some
> time before finalize is called and backend is cleared, however there is
> still a really small chance for a segfault... I don't see any simple
> solution for it, so it would be good to add at least some comment in the
> code that it might happen...

the 'backend' uses by handle_event comes from the thread_pool. The thread_pool doesn't hold an extra ref to the backend because it itself is cleaned up as part of finalizing the backend, so the lifetimes are synchronized.

handle_event takes the mutex around everything it does, so either it is running or do_unmount is running. If handle_event is running, then we don't actually do the unmount until it is finished. If do_unmount is running, then handle_event will continue and immediately check for unmount_started, and then immediately return when it sees it is set.

So this ordering is safe.

do_unmount sets max threads to zero so if any more events are queued, they will not be processed, and even if they were, it would still be safe.

Comment 26 Philip Langdale 2016-04-13 14:29:14 UTC

(In reply to Ondrej Holy from comment #24)
> Review of attachment 325834 [details] [review] [review]:
> 
> Cool, I am not sure how this will work with force_unmount, but regular
> unmount should be definitely much safer... I will test it once the patches
> will be applicable... Maybe we should add hard dependency for the new libmtp
> to force distros upgrade the libmtp library, but that's just idea (don't
> know libmtp release cycle)...

I will do more testing with force_unmount. I forget which code paths are executed and which are not. I suspect the right thing to do is set 'unmount_started' as part of forced unmount but I'll confirm and produce a separate diff. There's no new problem here - we were not consistently checking the forced flag before either.

> @@ +661,3 @@
> +        continue;
> +      }
> +      backend->event_completed = FALSE;
> 
> It would be better to set the flag before LIBMTP_Read_Event_Async.

I don't set it before because of the error case. If the call fails, then we need to retry and event_completed must be TRUE to allow for the retry. Setting it to FALSE first would mean having to set it back to TRUE in the error path. There's no actual benefit to doing it first as this is a true asynchronous method and until the Handle_Events call is made there's no forward progress.
 
> @@ +982,3 @@
> +  /* Thread will terminate after flag is set. */
> +  g_thread_join(op_backend->event_thread);
> +#endif
> 
> It would be better to add this code below g_thread_pool_set_max_threads, to
> be sure that another handle_event is not started in the meantime.

I did it in this order so that we know that we're definitely not adding new items to the queue before we pause the queue. As discussed above, starting a new handle_event is fine - it'll block on the mutex and then immediately return.

Comment 27 Philip Langdale 2016-04-13 14:38:01 UTC

Until the change is committed to libmtp and some release takes place we can't add a hard dependency. But after that point, I think it's reasonable to add such a dependency.

I'm attaching updated diffs that compile against master (I have to work against 3.16 due to my current distro)

Comment 28 Philip Langdale 2016-04-13 14:38:41 UTC

Created attachment 325865 [details] [review]
Fix unmount race 1/3 v4

Comment 29 Philip Langdale 2016-04-13 14:38:59 UTC

Created attachment 325866 [details] [review]
Fix unmount race 2/3 v4

Comment 30 Philip Langdale 2016-04-13 14:39:16 UTC

Created attachment 325867 [details] [review]
Fix unmount race 3/3 v4

Comment 31 Philip Langdale 2016-04-13 14:52:29 UTC

Created attachment 325869 [details] [review]
Fix unmount race 3/3 v4

I took a look at forced unmount and I think the reasonable thing to do is set unmount_started in that path and just check for unmount_started in the usual way. Updated diff 3/3 accordingly.

Comment 32 Ondrej Holy 2016-04-14 11:07:39 UTC

(In reply to Philip Langdale from comment #25)
> (In reply to Ondrej Holy from comment #23)
> > Review of attachment 325687 [details] [review] [review] [review]:
> >
> > I realized that this is less safe than the current solution, because you
> > obtained weak reference for the backend before calling g_thread_pool_push,
> > however handle_event is running without the additional reference... This is
> > still problem even with the next patch. Imagine the case when handle_event
> > and do_unmount is called in the same time and do_unmount locks the mutex.
> > Handle_event continues after finishing do_unmount. So there is still some
> > time before finalize is called and backend is cleared, however there is
> > still a really small chance for a segfault... I don't see any simple
> > solution for it, so it would be good to add at least some comment in the
> > code that it might happen...
> 
> the 'backend' uses by handle_event comes from the thread_pool. The
> thread_pool doesn't hold an extra ref to the backend because it itself is
> cleaned up as part of finalizing the backend, so the lifetimes are
> synchronized.
> 
> handle_event takes the mutex around everything it does, so either it is
> running or do_unmount is running. If handle_event is running, then we don't
> actually do the unmount until it is finished. If do_unmount is running, then
> handle_event will continue and immediately check for unmount_started, and
> then immediately return when it sees it is set.
> 
> So this ordering is safe.
> 
> do_unmount sets max threads to zero so if any more events are queued, they
> will not be processed, and even if they were, it would still be safe.

Thanks for the explanation and sorry, I don't know GThreadPool API well. I realized that g_thread_pool_free waits for termination of threads, so it is safe then... :-)

Thanks for the modified patches, I will look at them later...

Comment 33 Ondrej Holy 2016-04-15 15:18:04 UTC

Review of attachment 325865 [details] [review]:

Looks good, thanks! I've already branched GVfs, so feel free to push the patches..

Comment 34 Ondrej Holy 2016-04-15 15:18:25 UTC

Review of attachment 325866 [details] [review]:

Looks good!

Comment 35 Ondrej Holy 2016-04-15 15:18:51 UTC

Review of attachment 325869 [details] [review]:

Looks good and it seems it works pretty well...

::: daemon/gvfsbackendmtp.c
@@ +77,3 @@
+ ************************************************/
+
+#define ONE_SECOND { 1, 0 }

Constant with name ONE_SECOND isn't very useful, it should be rather something like EVENT_POLL_RATE in order to allow simple changes of the poll rate without another code changes (i.e. ONE_SECONDS -> TWO_SECONDS) in the future...

Comment 36 Philip Langdale 2016-04-15 15:54:48 UTC

I've pushed the patches with the constant renamed. Thanks for the reviews.