GNOME Bugzilla – Bug 778135
Hang when downloading flatpak data
Last modified: 2018-09-21 16:28:54 UTC
There is a corner case leading to a hang since 8e1b0f5a99b1f1236c3d1124cc718e905e5741f1. Basically the situation is: - the SoupQueueMessageItem has async_pending set to TRUE - its state is SOUP_MESSAGE_FINISHING - but somehow no idle_run_queue GSource is scheduled - the item is first in the queue The result is that all the items in the queue with the same context as the one the first item has are not processed, so it hangs. I'll attach patches in the followup comments.
Created attachment 344853 [details] [review] FIx a corner case with scheduling queue processing GSource
Created attachment 344854 [details] [review] Remove unused field from queue item
Created attachment 344855 [details] [review] Use G_SOURCE_REMOVE instead of FALSE in idle_run_queue
Created attachment 345023 [details] [review] FIx a corner case with scheduling queue processing GSource I have noticed that the commit message in the first patch had unwrapped lines, here is the updated one. Otherwise no changes in the code.
Review of attachment 345023 [details] [review]: I don't understand this code well enough to know if this patch is correct, but I have a few comments. You should squash the "Remove unused field from queue item" patch into this one, since the field is not unused until this patch is applied. It is quite a shame that we are making the code significantly messier when we do not yet understand the original problem, which is that item->async_pending somehow got set to TRUE when no async operation was actually pending. It really should never happen. I understand that libsoup can be extremely difficult to debug, though, and you did a good job coming up with a patch to such complex code that fixes the problem, while clearly being careful to avoid thread safety issues. Perhaps a totally incorrect shot in the dark, but I wonder if it could be related to the special handling of CONNECT messages: /* CONNECT messages are handled specially */ if (msg->method == SOUP_METHOD_CONNECT) continue; ::: libsoup/soup-session.c @@ +144,3 @@ + + GMutex queue_datas_lock; + GHashTable *queue_datas; data is already plural, so this should be named "queue_data_lock" and "queue_data". Even if it is a list of multiple IdleRunQueueData objects. @@ +363,1 @@ G_OBJECT_CLASS (soup_session_parent_class)->finalize (object); Style nit: leave a blank line here before chaining up. @@ +2134,3 @@ + if (session) { + remove_idle_run_queue_data (session, queue_data); + g_clear_object (&session); This should use g_object_unref(), since you don't need to set session to NULL and you already know it's not NULL here. @@ +2345,3 @@ + source = g_idle_source_new (); + g_source_set_priority (source, G_PRIORITY_DEFAULT); + g_source_set_callback (source, idle_run_queue, queue_data, idle_run_queue_dnotify); Shouldn't you use soup_add_completion_reffed() for this? Does the g_source_attach() have to be done outside the mutex? I doubt it? If so, why?
Review of attachment 344854 [details] [review]: As Michael says, this should be squashed into the main patch.
Review of attachment 344855 [details] [review]: Nice cleanup.
(In reply to Michael Catanzaro from comment #5) > Review of attachment 345023 [details] [review] [review]: > > I don't understand this code well enough to know if this patch is correct, > but I have a few comments. > > You should squash the "Remove unused field from queue item" patch into this > one, since the field is not unused until this patch is applied. Alright, will do. > > It is quite a shame that we are making the code significantly messier when > we do not yet understand the original problem, which is that > item->async_pending somehow got set to TRUE when no async operation was > actually pending. It really should never happen. I understand that libsoup > can be extremely difficult to debug, though, and you did a good job coming > up with a patch to such complex code that fixes the problem, while clearly > being careful to avoid thread safety issues. > > Perhaps a totally incorrect shot in the dark, but I wonder if it could be > related to the special handling of CONNECT messages: > > /* CONNECT messages are handled specially */ > if (msg->method == SOUP_METHOD_CONNECT) > continue; > Likely not correct, the method is simply "GET". I was rather wondering if it wasn't the following snippet (at soup-session.c:1570'ish): if (item->state != SOUP_MESSAGE_RESTARTING) { item->state = SOUP_MESSAGE_FINISHING; if (item->new_api && !item->async) soup_session_process_queue_item (item->session, item, NULL, TRUE); } > ::: libsoup/soup-session.c > @@ +144,3 @@ > + > + GMutex queue_datas_lock; > + GHashTable *queue_datas; > > data is already plural, so this should be named "queue_data_lock" and > "queue_data". Even if it is a list of multiple IdleRunQueueData objects. > Will do. > @@ +363,1 @@ > G_OBJECT_CLASS (soup_session_parent_class)->finalize (object); > > Style nit: leave a blank line here before chaining up. > Ok. > @@ +2134,3 @@ > + if (session) { > + remove_idle_run_queue_data (session, queue_data); > + g_clear_object (&session); > > This should use g_object_unref(), since you don't need to set session to > NULL and you already know it's not NULL here. > I kind of wanted to avoid a situation when, in future, somebody decides to use the session variable after the if (which could be dangling pointer at this point), risking some hard to reproduce bugs. Maybe I should rewrite so the session variable goes out of scope to limit the use after free. Probably sounds a bit strained. > @@ +2345,3 @@ > + source = g_idle_source_new (); > + g_source_set_priority (source, G_PRIORITY_DEFAULT); > + g_source_set_callback (source, idle_run_queue, queue_data, > idle_run_queue_dnotify); > > Shouldn't you use soup_add_completion_reffed() for this? Does the > g_source_attach() have to be done outside the mutex? I doubt it? If so, why? I guess that one can never be too careful with the code running inside the mutex. I had an uneasy feeling about locking order here with the queue data mutex and the context lock. I was wondering for a short time if there might be a situation where one thread locks the queue data mutex and then tries to attach the source to a context (thus trying to lock the context) while other thread could have the context locked and trying to lock the queue data lock (somehow). Admittedly, I'm not an expert in the contents of gmain.c, so I may be simply wrong here. Anyway, for the peace of mind, I decided to attach the source with no queue data lock held, that shouldn't hurt either. Thanks for the review!
Created attachment 345813 [details] [review] FIx a corner case with scheduling queue processing GSource, v2 Merged the patch removing the unused field into the big one, renamed the queue_datas{,_lock} variables to queue_data{,_lock}, separated the chainup with an empty line and used the g_object_unref instead of g_clear_object. I still attach the mutex outside the queue data lock.
Review of attachment 345023 [details] [review]: Here’s my review of the existing patch. However, I have some thoughts about what could be causing this, which would result in a completely different patch. I think we should fix the underlying issue now, while we can reproduce it, rather than hiding it and it resurfacing at some point in the future. --- In order for item->async_pending to be set in the first place, idle_run_queue() must have been scheduled for the item at least once. So it must be the case that either: 1. idle_run_queue() runs and does not process the item; or 2. idle_run_queue() runs and processes the item, but does not complete it and does not re-schedule it. #1: This could happen if either: a) item->async changes to !item->async before async_run_queue() is called. This is not possible: it’s a construct-only field. b) item->async_context does not match the session’s context. See below. c) The item uses SOUP_METHOD_CONNECT, which has already been discounted. #2: This could happen whenever the item->state is set to SOUP_MESSAGE_FINISHING without also re-queueing the item: a) message_completed() (as Krzesimir says, this looks suspicious). b) SOUP_MESSAGE_READY handling in soup_session_process_queue_item(), iff it’s called with !loop. We could check this if we knew the value of item->msg->status_code when it stalls. I think #1b might be a problem, since the GMainContext used in soup_session_real_kick_queue() is not necessarily the same as the one checked against in async_run_queue(). Patch coming. I also think #2a and #2b are worth investigating. An easy way to investigate this would be to printf(G_STRFUNC, old_state, new_state) on each change to item->state to work out which code path is causing this. ::: libsoup/soup-session.c @@ +144,3 @@ + + GMutex queue_datas_lock; + GHashTable *queue_datas; (Technically the singular is ‘datum’, but nobody uses that in common usage.) @@ +266,3 @@ + + g_mutex_init (&priv->queue_datas_lock); + priv->queue_datas = g_hash_table_new (NULL, NULL); Nitpick: Use g_hash_table_new_full() and explicitly set the free functions to NULL, to make it clear what they are (and that they haven’t been forgotten about). @@ +2134,3 @@ + if (session) { + remove_idle_run_queue_data (session, queue_data); + g_clear_object (&session); I agree with Krzesimir here: it’s a bit more future-proof to use g_clear_object(). Otherwise you do have a dangling GObject pointer (session) hanging around, which is messy even if it’s safely not used at the moment. @@ +2341,3 @@ + queue_data = g_slice_new (IdleRunQueueData); + g_weak_ref_init (&queue_data->wref, session); + queue_data->context = context; Don’t you need a strong ref on the GMainContext here? What guarantee do you have that it won’t disappear while in the queue? @@ +2343,3 @@ + queue_data->context = context; + + source = g_idle_source_new (); Call g_source_set_name() too, to make debugging easier. @@ +2345,3 @@ + source = g_idle_source_new (); + g_source_set_priority (source, G_PRIORITY_DEFAULT); + g_source_set_callback (source, idle_run_queue, queue_data, idle_run_queue_dnotify); Agreed, I can’t think of why calling the g_source_attach() inside the mutex would cause any harm. All the vfuncs in a GSource (prepare, check, dispatch, finalize) are called by gmain.c with the GMainContext lock unlocked. I can’t think of how any user code could be called with a GMainContext lock held. > I decided to attach the source with no queue data lock held, that shouldn't hurt either. Wouldn’t it be possible for thread 1 to unlock here, then thread 2 to remove the same context from the queue, then thread 1 to erroneously attach the source for it?
(In reply to Philip Withnall from comment #10) > Review of attachment 345023 [details] [review] [review]: > > Here’s my review of the existing patch. Argh, mid-air collision. Most of my comments apply to the new patch.
(In reply to Philip Withnall from comment #10) > #2: This could happen whenever the item->state is set to > SOUP_MESSAGE_FINISHING without also re-queueing the item: > a) message_completed() (as Krzesimir says, this looks suspicious). > b) SOUP_MESSAGE_READY handling in soup_session_process_queue_item(), iff > it’s called with !loop. We could check this if we knew the value of > item->msg->status_code when it stalls. Also: #2: c) cache_stream_finished() calls soup_session_kick_queue(), but if the item already has item->async_pending set, that will do nothing. d) cancel_cache_response() calls soup_session_kick_queue() similarly. They should probably be calling soup_session_process_queue_item() instead? I have not followed the async cache handling code through completely to see if some calling function takes care of this — for the moment it should be enough to work out whether the async cache code touches this item.
(In reply to Philip Withnall from comment #10) > Review of attachment 345023 [details] [review] [review]: > > Here’s my review of the existing patch. However, I have some thoughts about > what could be causing this, which would result in a completely different > patch. I think we should fix the underlying issue now, while we can > reproduce it, rather than hiding it and it resurfacing at some point in the > future. > > --- > > In order for item->async_pending to be set in the first place, > idle_run_queue() must have been scheduled for the item at least once. So it > must be the case that either: > 1. idle_run_queue() runs and does not process the item; or > 2. idle_run_queue() runs and processes the item, but does not complete it > and does not re-schedule it. > > #1: This could happen if either: > a) item->async changes to !item->async before async_run_queue() is called. > This is not possible: it’s a construct-only field. > b) item->async_context does not match the session’s context. See below. In my case: - item->async_context comes from soup_session_get_async_context(item->session) (soup_message_queue_append()) - item->session comes from queue->session (also soup_message_queue_append()) - queue->session comes from soup_message_queue_new() called from soup_session_init() - the session is created by the ostree fetcher with soup_session_async_new_with_options() which also passes SOUP_SESSION_USE_THREAD_CONTEXT option set as TRUE (so soup_session_get_async_context() always returns the thread default main context) (ostree_fetcher_session_thread) - the main context is pushed as a new thread default before the session is created and is iterated later in the function (also ostree_fetcher_session_thread) - the main context is created by the fetcher (_ostree_fetcher_constructed) - the soup_session_send_async is always invoked indirectly with g_main_context_invoke_full on the fetcher's main context, so it's executed in the thread where the soup requests are processed, so soup_session_get_async_context() returns the expected thread default main context (that is the one equal to item->async_context) (session_thread_idle_add in ostree fetcher) So, I'm not sure if this case is applicable here. > c) The item uses SOUP_METHOD_CONNECT, which has already been discounted. > > #2: This could happen whenever the item->state is set to > SOUP_MESSAGE_FINISHING without also re-queueing the item: > a) message_completed() (as Krzesimir says, this looks suspicious). OTOH, the message_completed calls soup_session_kick_queue for an async item. > b) SOUP_MESSAGE_READY handling in soup_session_process_queue_item(), iff > it’s called with !loop. We could check this if we knew the value of > item->msg->status_code when it stalls. Hm, the stalled item in my queue had the SOUP_MESSAGE_FINISHING state. > > I think #1b might be a problem, since the GMainContext used in > soup_session_real_kick_queue() is not necessarily the same as the one > checked against in async_run_queue(). Patch coming. I also think #2a and #2b > are worth investigating. > > An easy way to investigate this would be to printf(G_STRFUNC, old_state, > new_state) on each change to item->state to work out which code path is > causing this. I'll try that. Earlier I tried using watchpoints in gdb, but that worked somewhat strange and not really in a helpful way. > > ::: libsoup/soup-session.c > @@ +144,3 @@ > + > + GMutex queue_datas_lock; > + GHashTable *queue_datas; > > (Technically the singular is ‘datum’, but nobody uses that in common usage.) > > @@ +266,3 @@ > + > + g_mutex_init (&priv->queue_datas_lock); > + priv->queue_datas = g_hash_table_new (NULL, NULL); > > Nitpick: Use g_hash_table_new_full() and explicitly set the free functions > to NULL, to make it clear what they are (and that they haven’t been > forgotten about). Ok. > > @@ +2134,3 @@ > + if (session) { > + remove_idle_run_queue_data (session, queue_data); > + g_clear_object (&session); > > I agree with Krzesimir here: it’s a bit more future-proof to use > g_clear_object(). Otherwise you do have a dangling GObject pointer (session) > hanging around, which is messy even if it’s safely not used at the moment. > Ah, cool, I'll revert my change then. > @@ +2341,3 @@ > + queue_data = g_slice_new (IdleRunQueueData); > + g_weak_ref_init (&queue_data->wref, session); > + queue_data->context = context; > > Don’t you need a strong ref on the GMainContext here? What guarantee do you > have that it won’t disappear while in the queue? > When GMainContext goes away, all the sources attached to it are destroyed. And our GSource removes the context from the queue data in the destroy notify function. > @@ +2343,3 @@ > + queue_data->context = context; > + > + source = g_idle_source_new (); > > Call g_source_set_name() too, to make debugging easier. > Alright. What the name would it be? "Soup idle queue run for context %p"? Should it have an address of the SoupSession in the name too? > @@ +2345,3 @@ > + source = g_idle_source_new (); > + g_source_set_priority (source, G_PRIORITY_DEFAULT); > + g_source_set_callback (source, idle_run_queue, queue_data, > idle_run_queue_dnotify); > > Agreed, I can’t think of why calling the g_source_attach() inside the mutex > would cause any harm. All the vfuncs in a GSource (prepare, check, dispatch, > finalize) are called by gmain.c with the GMainContext lock unlocked. I can’t > think of how any user code could be called with a GMainContext lock held. > > > I decided to attach the source with no queue data lock held, that shouldn't hurt either. > > Wouldn’t it be possible for thread 1 to unlock here, then thread 2 to remove > the same context from the queue, then thread 1 to erroneously attach the > source for it? At this point - no, because the context is removed only in the GSource's destroy notify function and only if the IdleRunQueueData in the hash table is the same as the one associated to the GSource being destroyed. And here we already put our own, new IdleRunQueueData which is associated with the GSource we will attach. But anyway, if you think it is safe to attach the GSource inside the queue data lock then I'll just use the soup_add_completion_reffed().
(In reply to Philip Withnall from comment #12) > (In reply to Philip Withnall from comment #10) > > #2: This could happen whenever the item->state is set to > > SOUP_MESSAGE_FINISHING without also re-queueing the item: > > a) message_completed() (as Krzesimir says, this looks suspicious). > > b) SOUP_MESSAGE_READY handling in soup_session_process_queue_item(), iff > > it’s called with !loop. We could check this if we knew the value of > > item->msg->status_code when it stalls. > > Also: > #2: > c) cache_stream_finished() calls soup_session_kick_queue(), but if the item > already has item->async_pending set, that will do nothing. cache_stream_finished() is set as a signal callback in async_return_from_cache, which is called from two places: - conditional_get_ready_cb, which is set as a callback in async_respond_from_cache, which is called only in soup_session_send_async. But the async_respond_from_cache function always returns false for me, so this path is not taken. - idle_return_from_cache_cb, same story story as above. > d) cancel_cache_response() calls soup_session_kick_queue() similarly. cancel_cache_response is called in two places: - conditional_get_ready_cb which is set as a callback in async_respond_from_cache in the path not taken - idle_return_from_cache_cb - same as above. > > They should probably be calling soup_session_process_queue_item() instead? I > have not followed the async cache handling code through completely to see if > some calling function takes care of this — for the moment it should be > enough to work out whether the async cache code touches this item. No idea if they should be calling soup_session_process_queue_item. But as noted above, we are not taking this code paths.
(In reply to Krzesimir Nowak from comment #13) > (In reply to Philip Withnall from comment #10) > > Review of attachment 345023 [details] [review] [review] [review]: > > > > Here’s my review of the existing patch. However, I have some thoughts about > > what could be causing this, which would result in a completely different > > patch. I think we should fix the underlying issue now, while we can > > reproduce it, rather than hiding it and it resurfacing at some point in the > > future. > > > > --- > > > > In order for item->async_pending to be set in the first place, > > idle_run_queue() must have been scheduled for the item at least once. So it > > must be the case that either: > > 1. idle_run_queue() runs and does not process the item; or > > 2. idle_run_queue() runs and processes the item, but does not complete it > > and does not re-schedule it. > > > > #1: This could happen if either: > > a) item->async changes to !item->async before async_run_queue() is called. > > This is not possible: it’s a construct-only field. > > b) item->async_context does not match the session’s context. See below. > > In my case: > - item->async_context comes from > soup_session_get_async_context(item->session) (soup_message_queue_append()) > - item->session comes from queue->session (also soup_message_queue_append()) > - queue->session comes from soup_message_queue_new() called from > soup_session_init() > - the session is created by the ostree fetcher with > soup_session_async_new_with_options() which also passes > SOUP_SESSION_USE_THREAD_CONTEXT option set as TRUE (so > soup_session_get_async_context() always returns the thread default main > context) (ostree_fetcher_session_thread) > - the main context is pushed as a new thread default before the session is > created and is iterated later in the function (also > ostree_fetcher_session_thread) > - the main context is created by the fetcher (_ostree_fetcher_constructed) > - the soup_session_send_async is always invoked indirectly with > g_main_context_invoke_full on the fetcher's main context, so it's executed > in the thread where the soup requests are processed, so > soup_session_get_async_context() returns the expected thread default main > context (that is the one equal to item->async_context) > (session_thread_idle_add in ostree fetcher) > > So, I'm not sure if this case is applicable here. OK. It still looks like there’s the potential for bugs here in other cases. > > c) The item uses SOUP_METHOD_CONNECT, which has already been discounted. > > > > #2: This could happen whenever the item->state is set to > > SOUP_MESSAGE_FINISHING without also re-queueing the item: > > a) message_completed() (as Krzesimir says, this looks suspicious). > > OTOH, the message_completed calls soup_session_kick_queue for an async item. Ah, yes. I discounted that because it’s before the item’s state is changed to SOUP_MESSAGE_FINISHING — but of course that doesn’t matter, because it’s queuing idle sources, not processing synchronously. > > b) SOUP_MESSAGE_READY handling in soup_session_process_queue_item(), iff > > it’s called with !loop. We could check this if we knew the value of > > item->msg->status_code when it stalls. > > Hm, the stalled item in my queue had the SOUP_MESSAGE_FINISHING state. Yes, the code which handles SOUP_MESSAGE_READY (around line 1980) potentially sets the item’s state to SOUP_MESSAGE_FINISHING and then might do nothing else: if (item->msg->status_code == SOUP_STATUS_TRY_AGAIN) { soup_message_cleanup_response (item->msg); item->state = SOUP_MESSAGE_STARTING; } else item->state = SOUP_MESSAGE_FINISHING; break; So I think there’s the potential for a bug here? It depends on whether soup_session_process_queue_item() is called with !loop. > > I think #1b might be a problem, since the GMainContext used in > > soup_session_real_kick_queue() is not necessarily the same as the one > > checked against in async_run_queue(). Patch coming. I also think #2a and #2b > > are worth investigating. > > > > An easy way to investigate this would be to printf(G_STRFUNC, old_state, > > new_state) on each change to item->state to work out which code path is > > causing this. > > I'll try that. Earlier I tried using watchpoints in gdb, but that worked > somewhat strange and not really in a helpful way. You’d have to find the right SoupMessageQueueItem then set up the watchpoint, which would be fiddly. I think it’s a lot easier to log all state changes for all items, then filter the logs afterwards for the item you’re interested in. > > @@ +2341,3 @@ > > + queue_data = g_slice_new (IdleRunQueueData); > > + g_weak_ref_init (&queue_data->wref, session); > > + queue_data->context = context; > > > > Don’t you need a strong ref on the GMainContext here? What guarantee do you > > have that it won’t disappear while in the queue? > > > > When GMainContext goes away, all the sources attached to it are destroyed. > And our GSource removes the context from the queue data in the destroy > notify function. Of course, sorry. > > @@ +2343,3 @@ > > + queue_data->context = context; > > + > > + source = g_idle_source_new (); > > > > Call g_source_set_name() too, to make debugging easier. > > > > Alright. What the name would it be? "Soup idle queue run for context %p"? > Should it have an address of the SoupSession in the name too? I don’t think it’s worth a g_strdup_printf() call. “Soup idle queue runner” would be fine, I think. The context can always be figured out from which thread it’s running in. > > @@ +2345,3 @@ > > + source = g_idle_source_new (); > > + g_source_set_priority (source, G_PRIORITY_DEFAULT); > > + g_source_set_callback (source, idle_run_queue, queue_data, > > idle_run_queue_dnotify); > > > > Agreed, I can’t think of why calling the g_source_attach() inside the mutex > > would cause any harm. All the vfuncs in a GSource (prepare, check, dispatch, > > finalize) are called by gmain.c with the GMainContext lock unlocked. I can’t > > think of how any user code could be called with a GMainContext lock held. > > > > > I decided to attach the source with no queue data lock held, that shouldn't hurt either. > > > > Wouldn’t it be possible for thread 1 to unlock here, then thread 2 to remove > > the same context from the queue, then thread 1 to erroneously attach the > > source for it? > > At this point - no, because the context is removed only in the GSource's > destroy notify function and only if the IdleRunQueueData in the hash table > is the same as the one associated to the GSource being destroyed. And here > we already put our own, new IdleRunQueueData which is associated with the > GSource we will attach. Right. > But anyway, if you think it is safe to attach the > GSource inside the queue data lock then I'll just use the > soup_add_completion_reffed().
Created attachment 345879 [details] [review] Alternative patch probably fixing the same issue, tons simpler I did logging like Philip suggested and got an interesting nugget at the end: =============== function: get_connection line: 1945 old: SOUP_MESSAGE_STARTING new: SOUP_MESSAGE_CONNECTING async pending: false function: connect_complete line: 1750 old: SOUP_MESSAGE_CONNECTING new: SOUP_MESSAGE_CONNECTED async pending: false function: soup_session_process_queue_item line: 1985 old: SOUP_MESSAGE_CONNECTED new: SOUP_MESSAGE_READY async pending: false function: soup_session_process_queue_item line: 2003 old: SOUP_MESSAGE_READY new: SOUP_MESSAGE_RUNNING async pending: false function: soup_session_real_requeue_message line: 2209 old: SOUP_MESSAGE_RUNNING new: SOUP_MESSAGE_RESTARTING async pending: false function: soup_session_process_queue_item line: 2029 old: SOUP_MESSAGE_RESTARTING new: SOUP_MESSAGE_STARTING async pending: false function: get_connection line: 1939 old: SOUP_MESSAGE_STARTING new: SOUP_MESSAGE_READY async pending: false function: soup_session_process_queue_item line: 2003 old: SOUP_MESSAGE_READY new: SOUP_MESSAGE_RUNNING async pending: false function: message_completed line: 1577 old: SOUP_MESSAGE_RUNNING new: SOUP_MESSAGE_FINISHING async pending: true =============== In the last entry the async_pending is set to true. Looking at the code in the message_completed function makes this obvious - we call soup_session_kick_queue() which sets the async_pending to TRUE for all the async items in the queue. So I experimentally moved kicking the queue after the item's state is changed and that apparently seemed to work. I'm attaching a patch with the change. It will hopefully obsolete the complex one. It's quite late now and I haven't yet understood why this started working. Tomorrow I will look at it again, test it more and write the proper commit message.
I’m not sure how this can make things work. If I’m understanding the patch correctly, it essentially changes: soup_session_kick_queue (item->session); item->state = SOUP_MESSAGE_FINISHING; into item->state = SOUP_MESSAGE_FINISHING; soup_session_kick_queue (item->session); (after eliminating all paths which we know aren’t taken). And since soup_session_real_kick_queue() doesn’t query the item->state, and shouldn’t cause idle_run_queue() to be called until after message_completed() returns anyway, I don’t see how this can make a difference. Given that item->async_pending is changing from FALSE to TRUE on this call to soup_session_kick_queue(), that means that either idle_run_queue() is newly scheduled; or another item in the same context already has item->async_pending set already. Can you add some debug printf()s to idle_run_queue() and async_run_queue() to see when they’re both called, and which items are seen and have soup_session_process_queue_item() called on them by async_run_queue()?
I did some debugging and found a culprit. Basically a race condition here, but funnily 100% reproducible? Will post the details together with the patch.
Hmph, looks like the fix is not very obvious. So here is the log of what happens together with my comments: function: soup_session_process_queue_item line: 2005 thread: 0x562b6db31140 uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock old: SOUP_MESSAGE_READY new: SOUP_MESSAGE_RUNNING async pending: false Here we changed the state to RUNNING. function: async_run_queue line: 2102 thread: 0x562b6db31140 uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock state after processing: SOUP_MESSAGE_RUNNING function: soup_message_io_finished line: 183 thread: 0x562b6db29150 uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock completion: complete After a while we can see that downloading has finished. Note that this happens in a different thread. This function calls a callback which, in this case, is the message_completed function. function: soup_session_real_kick_queue line: 2352 thread: 0x562b6db29150 uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock scheduled the gsource The first thing we do in the message_completed function is kicking the queue. The idle source for running the queue processing has been just scheduled. function: async_run_queue line: 2086 thread: 0x562b6db31140 uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock method is connect: false item async: true same context: true item context: 0x562b6de5a3f0 session context: 0x562b6de5a3f0 And bang, the previous thread got preempted (?) and we started processing the queue. Dunno about the preemption - it always happens. Probably it tries to lock something that's already locked? function: async_run_queue line: 2096 thread: 0x562b6db31140 uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock setting async pending to FALSE We set the async pending to FALSE (it was FALSE anyway…). function: async_run_queue line: 2099 thread: 0x562b6db31140 uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock state before processing: SOUP_MESSAGE_RUNNING function: async_run_queue line: 2102 thread: 0x562b6db31140 uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock state after processing: SOUP_MESSAGE_RUNNING As we can see, absolutely nothing happens - we are still in the RUNNING state. The processing is shortcutted by: if (item->async) return; function: soup_session_real_kick_queue line: 2362 thread: 0x562b6db29150 uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock setting async pending to TRUE And we are back to the thread that was preempted, at this point the idle gsource callback was already called and we set the async pending to TRUE. function: message_completed line: 1579 thread: 0x562b6db29150 uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock old: SOUP_MESSAGE_RUNNING new: SOUP_MESSAGE_FINISHING async pending: true And here we change the state to FINISHING. It's too late. function: soup_message_io_finished line: 188 thread: 0x562b6db29150 uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock completion done The following calls to soup_session_real_kick_queue will show that async_pending is set to TRUE, despite no idle gsource being scheduled. function: soup_session_real_kick_queue line: 2357 thread: 0x562b6db31140 uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock already pending function: soup_session_real_kick_queue line: 2362 thread: 0x562b6db31140 uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock setting async pending to TRUE So, my idea to fix it was basically to set the idle gsource that executes the same code as in the message_completed function if item->async is TRUE. But now, I'm getting: error: While pulling app/com.google.AndroidStudio/x86_64/eos3 from remote eos-apps: Server returned status 0: Unknown Error Meh.
Created attachment 345959 [details] [review] Another alternative patch, this time with idle GSource.
Drive-by thoughts: We’ve got calls to soup_session_real_kick_queue() from both threads, which doesn’t look right. SoupSession doesn’t use any locking for message processing (just connection handling), so which thread is *supposed* to be calling soup_session_real_kick_queue()? Is the code missing another soup_add_completion_reffed() call somewhere? Was this trace recorded with attachment #345959 [details] applied, or with git master?
Sorry for not being clear with the logs. The logs come from the current master branch (+ the patch that adds the logs), without the fix. The patch I added here causes the soup_session_real_kick_queue() to be invoked in the thread where item->context is iterated. But now soup_session_send_async seems to return a reply with HTTP status being 0. Here is the repo with my fork of libsoup I'm working on now - https://github.com/kinvolk/libsoup. Currently, the tracing and the not-fully-working fix is on the krnowak/tracing-msg-state-changes branch.
Some more findings (I don't even know where to begin): - So far, the problems I had were with downloading the superblock. With the patch that basically moves the message_completed code into the idle GSource, the superblock downloads just fine and nothing hangs. The "error: While pulling app/com.google.AndroidStudio/x86_64/eos3 from remote eos-apps: Server returned status 0: Unknown Error" error happens later, for a different file (just some .filez). - My "fix" patch breaks libsoup's "make check". - Currently in my "fix" patch, the message_completed() function will defer the kicking the queue and changing the message state if the item is async. Otherwise it will do it immediately. I added a hack that basically defers the actions when the item is async AND the current thread is different from the thread where session's async main context is iterated (well, actually different from the thread where the session was created, but I conflated those two for now, it's a hack anyway). Things seem to work again and libsoup's "make check" started to pass. - Apparently there is an assumption that the message_completed should be sometimes running synchronously. Where this assumption is, I haven't yet pinpointed. Also, I haven't figured out why the message's status is still zero, but I have a feeling that it's some race condition here. - I was wondering what was that second thread (the 0x562b6db29150 one) that was showing up sometimes in the logs and messed the things up. I started flatpak again in gdb and set two breakpoints inside the message_completed function - one when the item is async and the threads differ and one when the item is also async, but threads are the same. The backtrace for the same threads case (so kicking the queue and changing the status was not deferred):
+ Trace 237155
The other thread is apparently the main thread. And the message_completed function is called as a result of splicing the input stream returned by soup_session_send_finish to the memory output stream. The splicing closes the input stream, which in turn calls the message_completed function. So the input stream (SoupClientInputStream) is not thread-safe when my fix is not applied. - Not sure what's the proper fix for this case. My first idea was to add another flag to SoupMessageQueueItem like "guint no_defer : 1" and set it to TRUE before calling g_input_stream_close in the send_async_spliced function. Then change the condition in message_completed to "if (item->async && !item->no_defer) { setup_gsource(); return; }". I haven't written it and haven't tested it, so it's all a wild guess. I'll continue the research on Monday.
Ugh, two backtraces were merged into one. Please view the raw version too, then.
message_completed() is being called as the completion_cb from soup-message-io.c, which also stores a `GMainContext *async_context`. The fact that soup_message_io_finished() can be called from anywhere (it’s non-static) and will directly call completion_cb() looks a bit fishy to me. Perhaps a better fix would be to add the g_main_context_invoke() call there (in soup_message_io_finished()) rather than in message_completed()? Looks to me like that would fix a more general problem. See also: get_headers_cb and parse_headers_cb in SoupMessageIOData. Have you tried running libsoup with tsan or under Valgrind’s drd or helgrind? That might be interesting. (Running it under tsan would be faster than Valgrind, but means recompiling the whole stack from libsoup up to flatpak to use tsan.)
(In reply to Krzesimir Nowak from comment #19) > Hmph, looks like the fix is not very obvious. So here is the log of what > happens together with my comments: > ... > And we are back to the thread that was preempted, at this point the idle > gsource callback was already called and we set the async pending to TRUE. Right. That seems like the entire problem to me; you need to make sure that the idle_run_queue() calls queued up by soup_session_real_kick_queue() don't run until after soup_session_real_kick_queue() is done. Or at least, that the body of async_run_queue() doesn't run until soup_session_real_kick_queue() is done. So we need a mutex to protect against that. (Actually, while you need to keep async_run_queue() and soup_session_real_kick_queue() from running at the same time, there's no problem with running async_run_queue() in two different threads simultaneously, since they'll operate on non-overlapping queue items. So you could use a GRWLock instead of a GMutex, with async_run_queue() getting a read lock, and soup_session_real_kick_queue() getting a write lock. I think.)
I didn't expect putting a GRWLock to work and indeed it deadlocked - see the backtrace (the reader lock was taken in async_run_queue, frame 18):
+ Trace 237162
I tried quickly to work it around by deferring the kicking with the idle GSource, but I got a lot of failures in make check, so I didn't try further. I also tried running flatpak under valgrind's helgrind and drd, but I got like 500k lines in log files. Looking at some of the reports it seems to be that somehow GLib's locking utilities are transparent(?) to helgrind. I tried to add suppressions, but after a while I stated that there is just too much noise so I gave up - it takes too much time. Here, I am attaching a patch that seems to fix the issue for me - it basically defers the execution of the code of the message_completed current thread is not an owner of the item's async main context. Initially I did as Philip suggested - by using g_main_context_invoke on the io->completion_cb, but I only could use io->async_context, as io->item is not always available. The patch didn't help - I was still getting failures in make check. There is also a question about the thread-safety of the GInputStream object returned by soup_session_send_finish. The situation I have is that flatpak has two threads - main thread and the fetcher thread. In the fetcher thread the soup_session_send_async is called, so we eventually get the response body stream, which is then moved to the main thread and spliced to the memory buffer. Splicing closes the input stream, which results in the mess described in comment 19 (it eventually calls the soup_message_io_finished function. Is this a valid use of soup's input stream? If so, we have a bug in libsoup. Otherwise we have a bug in ostree and possibly a documentation bug in libsoup. I talked to Philip on IRC and he leaned toward the bug in ostree (as I understood it). What do you think?
Created attachment 346412 [details] [review] Execute message_complete in the proper thread I actually forgot to add the patch. Also, for the code with GRWLock, please see https://github.com/kinvolk/libsoup/commit/65d90622ed9c9704b9f76abf26c39546150d47bd
So, I have a feeling that if the real culprit is that async_run_queue and soup_session_real_kick_queue shouldn't be running in parallel, then my first patch (https://bugzilla.gnome.org/attachment.cgi?id=345813) actually solves the issue properly and does it without locking, because it gets rid of the async_pending flag which is a subject of the race conditions. I guess that the second patch should be applied too to avoid changes of the item state in the thread unrelated with item's async context. Still, the question about thread-safety of the libsoup's G*Stream implementation needs to be answered and clarified in the documentation. Also, a note about reproducibility - I could reproduce it with flatpak using the endless' version of ostree, which seems to be based on some older version of upstream ostree. Upstream ostree already consumes libsoup's streams in the same thread, so it's very likely not affected. For endless' ostree I filed a PR to workaround the bug in libsoup (https://github.com/endlessm/ostree/pull/59).
Review of attachment 346412 [details] [review]: This looks generally good to me, and passes libsoup’s `make check`. ::: libsoup/soup-session.c @@ +1607,3 @@ + data = g_slice_new (MessageCompletedData); + data->completion = completion; + data->item = item; Nitpick: I’d move the `soup_message_queue_item_ref()` here (i.e. `data->item = soup_message_queue_item_ref (item);` to make it more obvious that the ownership of the ref is being transferred to the `MessageCompletedData` instance. @@ +1611,3 @@ + message_completed_invoke_cb, + data); + return; Nitpick: I’d use an `else` and no early `return`, to simplify the control flow slightly.
(In reply to Krzesimir Nowak from comment #27) > Here, I am attaching a patch that seems to fix the issue for me - it basically defers the execution of the code of the message_completed current thread is not an owner of the item's async main context. Initially I did as Philip suggested - by using g_main_context_invoke on the io->completion_cb, but I only could use io->async_context, as io->item is not always available. The patch didn't help - I was still getting failures in make check. If you were to use g_main_context_invoke() on the io->completion_cb, I think the conventional approach would be to use the GMainContext from g_main_context_get_thread_default() called in new_iostate(). The conventional assumption when passing a callback to a function (across a library boundary) is that the callback will be invoked in the GMainContext which is the thread default at the time of initially passing the callback to the library. (I hope that makes sense.) In this case, io->async_context probably is a different GMainContext instance — the call to soup_message_io_client() in soup_message_send_request() (which is how the client-side code enters soup-message-io.c) can pass one of a couple of GMainContexts in. I just tried to put a patch together to use g_main_context_invoke() with io->completion_cb, and it’s a complete pain to do properly (for get_headers_cb() and parse_headers_cb() too) because the latter two expect to be able to return synchronously. I think your approach is better — we just need to make sure it’s extended to all code which passes a completion_cb into soup-message-io.c.
Krzesimir is this still relevant? If so would you mind address the Philip's review notes?
Ah, I'm not sure, really. I'm not sure what and where the issue was exactly, but it was something wrong on the ostree side, where it made a request in one thread and handled some reply in other one (or something like that) using the same object, which lead to those issues with race conditions and whatnot. Clearly a scenario libsoup does not support and probably does not intend to support. This issue was fixed in ostree long time ago, but I tripped into it, because endless fork didn't have that fix at that time. So, not sure. All that steam for a NOTABUG? :)
(In reply to Krzesimir Nowak from comment #33) > Ah, I'm not sure, really. I'm not sure what and where the issue was exactly, > but it was something wrong on the ostree side, where it made a request in > one thread and handled some reply in other one (or something like that) > using the same object, which lead to those issues with race conditions and > whatnot. Clearly a scenario libsoup does not support and probably does not > intend to support. But your patch still addresses a potential race in the libsoup code, right? Are you saying that the analysis for that patch is flawed?
(In reply to Philip Withnall from comment #34) > (In reply to Krzesimir Nowak from comment #33) > > Ah, I'm not sure, really. I'm not sure what and where the issue was exactly, > > but it was something wrong on the ostree side, where it made a request in > > one thread and handled some reply in other one (or something like that) > > using the same object, which lead to those issues with race conditions and > > whatnot. Clearly a scenario libsoup does not support and probably does not > > intend to support. > > But your patch still addresses a potential race in the libsoup code, right? > Are you saying that the analysis for that patch is flawed? Whether the analysis was flawed or not depends on an answer to question whether GInputStream can be safely used in a different thread from the one it was created in? Or libsoup's implementation of GInputStream. Because, IIRC, all this was caused by moving the input stream I received from libsoup to another thread and splicing it to a memory stream there, which in turn closed the input stream, which invoked some libsoup code in a racy way. But well, if the patch closes some potential race and does not break anything, then it sounds like a win anyway (maybe at a cost of slightly complicating an already complicated code). I'll try to allocate some time for the patch soon. Will let you know. I still need to recall whether both patches should be applied or only the last one.
(In reply to Krzesimir Nowak from comment #35) > Whether the analysis was flawed or not depends on an answer to question > whether GInputStream can be safely used in a different thread from the one > it was created in? Or libsoup's implementation of GInputStream. Because, > IIRC, all this was caused by moving the input stream I received from libsoup > to another thread and splicing it to a memory stream there, which in turn > closed the input stream, which invoked some libsoup code in a racy way. Yes, it is an API guarantee of GInputStream that it is fully threadsafe and can be used on any thread, so long as only one input operation is in progress at a time. There is a note in the GInputStream documentation: "See the documentation for GIOStream for details of thread safety of streaming APIs." And it's explained there: """ Operations on GIOStreams cannot be started while another operation on the GIOStream or its substreams is in progress. Specifically, an application can read from the GInputStream and write to the GOutputStream simultaneously (either in separate threads, or as asynchronous operations in the same thread), but an application cannot start any GIOStream operation while there is a GIOStream, GInputStream or GOutputStream operation in progress, and an application can’t start any GInputStream or GOutputStream operation while there is a GIOStream operation in progress. This is a product of individual stream operations being associated with a given GMainContext (the thread-default context at the time the operation was started), rather than entire streams being associated with a single GMainContext. GIO may run operations on GIOStreams from other (worker) threads, and this may be exposed to application code in the behaviour of wrapper streams, such as GBufferedInputStream or GTlsConnection. With such wrapper APIs, application code may only run operations on the base (wrapped) stream when the wrapper stream is idle. Note that the semantics of such operations may not be well-defined due to the state the wrapper stream leaves the base stream in (though they are guaranteed not to crash). """ So creating a GInputStream on thread A and then reading from it on thread B is required to be safe.
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/libsoup/issues/100.