Bug 778135 – Hang when downloading flatpak data

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 778135 - Hang when downloading flatpak data


Summary:	Hang when downloading flatpak data


Status:	RESOLVED OBSOLETE

Product:	libsoup
Classification:	Core
Component:	HTTP Transport
Version:	2.57.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	libsoup-maint@gnome.bugs
QA Contact:	libsoup-maint@gnome.bugs

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2017-02-03 13:23 UTC by Krzesimir Nowak
Modified:	2018-09-21 16:28 UTC

See Also:	768567
GNOME target:	---
GNOME version:	---

Attachments
FIx a corner case with scheduling queue processing GSource (6.51 KB, patch) 2017-02-03 13:24 UTC, Krzesimir Nowak	none	Details \| Review
Remove unused field from queue item (1.44 KB, patch) 2017-02-03 13:24 UTC, Krzesimir Nowak	none	Details \| Review
Use G_SOURCE_REMOVE instead of FALSE in idle_run_queue (798 bytes, patch) 2017-02-03 13:25 UTC, Krzesimir Nowak	reviewed	Details \| Review
FIx a corner case with scheduling queue processing GSource (6.51 KB, patch) 2017-02-06 10:39 UTC, Krzesimir Nowak	none	Details \| Review
FIx a corner case with scheduling queue processing GSource, v2 (7.25 KB, patch) 2017-02-15 11:59 UTC, Krzesimir Nowak	none	Details \| Review
Alternative patch probably fixing the same issue, tons simpler (1.29 KB, patch) 2017-02-15 21:00 UTC, Krzesimir Nowak	none	Details \| Review
Another alternative patch, this time with idle GSource. (2.10 KB, patch) 2017-02-16 16:24 UTC, Krzesimir Nowak	none	Details \| Review
Execute message_complete in the proper thread (2.86 KB, patch) 2017-02-21 21:40 UTC, Krzesimir Nowak	reviewed	Details \| Review

Description Krzesimir Nowak 2017-02-03 13:23:24 UTC

There is a corner case leading to a hang since 8e1b0f5a99b1f1236c3d1124cc718e905e5741f1.

Basically the situation is:
- the SoupQueueMessageItem has async_pending set to TRUE
- its state is SOUP_MESSAGE_FINISHING
- but somehow no idle_run_queue GSource is scheduled
- the item is first in the queue

The result is that all the items in the queue with the same context as the one the first item has are not processed, so it hangs.

I'll attach patches in the followup comments.

Comment 1 Krzesimir Nowak 2017-02-03 13:24:23 UTC

Created attachment 344853 [details] [review]
FIx a corner case with scheduling queue processing GSource

Comment 2 Krzesimir Nowak 2017-02-03 13:24:53 UTC

Created attachment 344854 [details] [review]
Remove unused field from queue item

Comment 3 Krzesimir Nowak 2017-02-03 13:25:40 UTC

Created attachment 344855 [details] [review]
Use G_SOURCE_REMOVE instead of FALSE in idle_run_queue

Comment 4 Krzesimir Nowak 2017-02-06 10:39:53 UTC

Created attachment 345023 [details] [review]
FIx a corner case with scheduling queue processing GSource

I have noticed that the commit message in the first patch had unwrapped lines, here is the updated one. Otherwise no changes in the code.

Comment 5 Michael Catanzaro 2017-02-14 18:51:27 UTC

Review of attachment 345023 [details] [review]:

I don't understand this code well enough to know if this patch is correct, but I have a few comments.

You should squash the "Remove unused field from queue item" patch into this one, since the field is not unused until this patch is applied.

It is quite a shame that we are making the code significantly messier when we do not yet understand the original problem, which is that item->async_pending somehow got set to TRUE when no async operation was actually pending. It really should never happen. I understand that libsoup can be extremely difficult to debug, though, and you did a good job coming up with a patch to such complex code that fixes the problem, while clearly being careful to avoid thread safety issues.

Perhaps a totally incorrect shot in the dark, but I wonder if it could be related to the special handling of CONNECT messages:

	/* CONNECT messages are handled specially */
	if (msg->method == SOUP_METHOD_CONNECT)
		continue;

::: libsoup/soup-session.c
@@ +144,3 @@
+
+	GMutex queue_datas_lock;
+	GHashTable *queue_datas;

data is already plural, so this should be named "queue_data_lock" and "queue_data". Even if it is a list of multiple IdleRunQueueData objects.

@@ +363,1 @@
 	G_OBJECT_CLASS (soup_session_parent_class)->finalize (object);

Style nit: leave a blank line here before chaining up.

@@ +2134,3 @@
+	if (session) {
+		remove_idle_run_queue_data (session, queue_data);
+		g_clear_object (&session);

This should use g_object_unref(), since you don't need to set session to NULL and you already know it's not NULL here.

@@ +2345,3 @@
+	source = g_idle_source_new ();
+	g_source_set_priority (source, G_PRIORITY_DEFAULT);
+	g_source_set_callback (source, idle_run_queue, queue_data, idle_run_queue_dnotify);

Shouldn't you use soup_add_completion_reffed() for this? Does the g_source_attach() have to be done outside the mutex? I doubt it? If so, why?

Comment 6 Philip Withnall 2017-02-14 18:52:29 UTC

Review of attachment 344854 [details] [review]:

As Michael says, this should be squashed into the main patch.

Comment 7 Philip Withnall 2017-02-14 18:52:53 UTC

Review of attachment 344855 [details] [review]:

Nice cleanup.

Comment 8 Krzesimir Nowak 2017-02-14 21:07:30 UTC

(In reply to Michael Catanzaro from comment #5)
> Review of attachment 345023 [details] [review] [review]:
> 
> I don't understand this code well enough to know if this patch is correct,
> but I have a few comments.
> 
> You should squash the "Remove unused field from queue item" patch into this
> one, since the field is not unused until this patch is applied.

Alright, will do.

> 
> It is quite a shame that we are making the code significantly messier when
> we do not yet understand the original problem, which is that
> item->async_pending somehow got set to TRUE when no async operation was
> actually pending. It really should never happen. I understand that libsoup
> can be extremely difficult to debug, though, and you did a good job coming
> up with a patch to such complex code that fixes the problem, while clearly
> being careful to avoid thread safety issues.
> 
> Perhaps a totally incorrect shot in the dark, but I wonder if it could be
> related to the special handling of CONNECT messages:
> 
> 	/* CONNECT messages are handled specially */
> 	if (msg->method == SOUP_METHOD_CONNECT)
> 		continue;
> 

Likely not correct, the method is simply "GET". I was rather wondering if it wasn't the following snippet (at soup-session.c:1570'ish):

	if (item->state != SOUP_MESSAGE_RESTARTING) {
		item->state = SOUP_MESSAGE_FINISHING;

		if (item->new_api && !item->async)
			soup_session_process_queue_item (item->session, item, NULL, TRUE);
	}


> ::: libsoup/soup-session.c
> @@ +144,3 @@
> +
> +	GMutex queue_datas_lock;
> +	GHashTable *queue_datas;
> 
> data is already plural, so this should be named "queue_data_lock" and
> "queue_data". Even if it is a list of multiple IdleRunQueueData objects.
> 

Will do.

> @@ +363,1 @@
>  	G_OBJECT_CLASS (soup_session_parent_class)->finalize (object);
> 
> Style nit: leave a blank line here before chaining up.
> 

Ok.

> @@ +2134,3 @@
> +	if (session) {
> +		remove_idle_run_queue_data (session, queue_data);
> +		g_clear_object (&session);
> 
> This should use g_object_unref(), since you don't need to set session to
> NULL and you already know it's not NULL here.
> 

I kind of wanted to avoid a situation when, in future, somebody decides to use the session variable after the if (which could be dangling pointer at this point), risking some hard to reproduce bugs. Maybe I should rewrite so the session variable goes out of scope to limit the use after free.

Probably sounds a bit strained.

> @@ +2345,3 @@
> +	source = g_idle_source_new ();
> +	g_source_set_priority (source, G_PRIORITY_DEFAULT);
> +	g_source_set_callback (source, idle_run_queue, queue_data,
> idle_run_queue_dnotify);
> 
> Shouldn't you use soup_add_completion_reffed() for this? Does the
> g_source_attach() have to be done outside the mutex? I doubt it? If so, why?

I guess that one can never be too careful with the code running inside the mutex. I had an uneasy feeling about locking order here with the queue data mutex and the context lock. I was wondering for a short time if there might be a situation where one thread locks the queue data mutex and then tries to attach the source to a context (thus trying to lock the context) while other thread could have the context locked and trying to lock the queue data lock (somehow). Admittedly, I'm not an expert in the contents of gmain.c, so I may be simply wrong here. Anyway, for the peace of mind, I decided to attach the source with no queue data lock held, that shouldn't hurt either.

Thanks for the review!

Comment 9 Krzesimir Nowak 2017-02-15 11:59:11 UTC

Created attachment 345813 [details] [review]
FIx a corner case with scheduling queue processing GSource, v2

Merged the patch removing the unused field into the big one, renamed the queue_datas{,_lock} variables to queue_data{,_lock}, separated the chainup with an empty line and used the g_object_unref instead of g_clear_object. I still attach the mutex outside the queue data lock.

Comment 10 Philip Withnall 2017-02-15 12:01:47 UTC

Review of attachment 345023 [details] [review]:

Here’s my review of the existing patch. However, I have some thoughts about what could be causing this, which would result in a completely different patch. I think we should fix the underlying issue now, while we can reproduce it, rather than hiding it and it resurfacing at some point in the future.

---

In order for item->async_pending to be set in the first place, idle_run_queue() must have been scheduled for the item at least once. So it must be the case that either:
 1. idle_run_queue() runs and does not process the item; or
 2. idle_run_queue() runs and processes the item, but does not complete it and does not re-schedule it.

#1: This could happen if either:
 a) item->async changes to !item->async before async_run_queue() is called. This is not possible: it’s a construct-only field.
 b) item->async_context does not match the session’s context. See below.
 c) The item uses SOUP_METHOD_CONNECT, which has already been discounted.

#2: This could happen whenever the item->state is set to SOUP_MESSAGE_FINISHING without also re-queueing the item:
 a) message_completed() (as Krzesimir says, this looks suspicious).
 b) SOUP_MESSAGE_READY handling in soup_session_process_queue_item(), iff it’s called with !loop. We could check this if we knew the value of item->msg->status_code when it stalls.

I think #1b might be a problem, since the GMainContext used in soup_session_real_kick_queue() is not necessarily the same as the one checked against in async_run_queue(). Patch coming. I also think #2a and #2b are worth investigating.

An easy way to investigate this would be to printf(G_STRFUNC, old_state, new_state) on each change to item->state to work out which code path is causing this.

::: libsoup/soup-session.c
@@ +144,3 @@
+
+	GMutex queue_datas_lock;
+	GHashTable *queue_datas;

(Technically the singular is ‘datum’, but nobody uses that in common usage.)

@@ +266,3 @@
+
+	g_mutex_init (&priv->queue_datas_lock);
+	priv->queue_datas = g_hash_table_new (NULL, NULL);

Nitpick: Use g_hash_table_new_full() and explicitly set the free functions to NULL, to make it clear what they are (and that they haven’t been forgotten about).

@@ +2134,3 @@
+	if (session) {
+		remove_idle_run_queue_data (session, queue_data);
+		g_clear_object (&session);

I agree with Krzesimir here: it’s a bit more future-proof to use g_clear_object(). Otherwise you do have a dangling GObject pointer (session) hanging around, which is messy even if it’s safely not used at the moment.

@@ +2341,3 @@
+	queue_data = g_slice_new (IdleRunQueueData);
+	g_weak_ref_init (&queue_data->wref, session);
+	queue_data->context = context;

Don’t you need a strong ref on the GMainContext here? What guarantee do you have that it won’t disappear while in the queue?

@@ +2343,3 @@
+	queue_data->context = context;
+
+	source = g_idle_source_new ();

Call g_source_set_name() too, to make debugging easier.

@@ +2345,3 @@
+	source = g_idle_source_new ();
+	g_source_set_priority (source, G_PRIORITY_DEFAULT);
+	g_source_set_callback (source, idle_run_queue, queue_data, idle_run_queue_dnotify);

Agreed, I can’t think of why calling the g_source_attach() inside the mutex would cause any harm. All the vfuncs in a GSource (prepare, check, dispatch, finalize) are called by gmain.c with the GMainContext lock unlocked. I can’t think of how any user code could be called with a GMainContext lock held.

> I decided to attach the source with no queue data lock held, that shouldn't hurt either.

Wouldn’t it be possible for thread 1 to unlock here, then thread 2 to remove the same context from the queue, then thread 1 to erroneously attach the source for it?

Comment 11 Philip Withnall 2017-02-15 12:02:47 UTC

(In reply to Philip Withnall from comment #10)
> Review of attachment 345023 [details] [review] [review]:
> 
> Here’s my review of the existing patch.

Argh, mid-air collision. Most of my comments apply to the new patch.

Comment 12 Philip Withnall 2017-02-15 12:09:10 UTC

(In reply to Philip Withnall from comment #10)
> #2: This could happen whenever the item->state is set to
> SOUP_MESSAGE_FINISHING without also re-queueing the item:
>  a) message_completed() (as Krzesimir says, this looks suspicious).
>  b) SOUP_MESSAGE_READY handling in soup_session_process_queue_item(), iff
> it’s called with !loop. We could check this if we knew the value of
> item->msg->status_code when it stalls.

Also:
#2:
 c) cache_stream_finished() calls soup_session_kick_queue(), but if the item already has item->async_pending set, that will do nothing.
 d) cancel_cache_response() calls soup_session_kick_queue() similarly.

They should probably be calling soup_session_process_queue_item() instead? I have not followed the async cache handling code through completely to see if some calling function takes care of this — for the moment it should be enough to work out whether the async cache code touches this item.

Comment 13 Krzesimir Nowak 2017-02-15 14:16:03 UTC

(In reply to Philip Withnall from comment #10)
> Review of attachment 345023 [details] [review] [review]:
> 
> Here’s my review of the existing patch. However, I have some thoughts about
> what could be causing this, which would result in a completely different
> patch. I think we should fix the underlying issue now, while we can
> reproduce it, rather than hiding it and it resurfacing at some point in the
> future.
> 
> ---
> 
> In order for item->async_pending to be set in the first place,
> idle_run_queue() must have been scheduled for the item at least once. So it
> must be the case that either:
>  1. idle_run_queue() runs and does not process the item; or
>  2. idle_run_queue() runs and processes the item, but does not complete it
> and does not re-schedule it.
> 
> #1: This could happen if either:
>  a) item->async changes to !item->async before async_run_queue() is called.
> This is not possible: it’s a construct-only field.
>  b) item->async_context does not match the session’s context. See below.

In my case:
- item->async_context comes from soup_session_get_async_context(item->session) (soup_message_queue_append())
- item->session comes from queue->session (also soup_message_queue_append())
- queue->session comes from soup_message_queue_new() called from soup_session_init()
- the session is created by the ostree fetcher with soup_session_async_new_with_options() which also passes SOUP_SESSION_USE_THREAD_CONTEXT option set as TRUE (so soup_session_get_async_context() always returns the thread default main context) (ostree_fetcher_session_thread)
- the main context is pushed as a new thread default before the session is created and is iterated later in the function (also ostree_fetcher_session_thread)
- the main context is created by the fetcher (_ostree_fetcher_constructed)
- the soup_session_send_async is always invoked indirectly with g_main_context_invoke_full on the fetcher's main context, so it's executed in the thread where the soup requests are processed, so soup_session_get_async_context() returns the expected thread default main context (that is the one equal to item->async_context) (session_thread_idle_add in ostree fetcher)

So, I'm not sure if this case is applicable here.

>  c) The item uses SOUP_METHOD_CONNECT, which has already been discounted.
> 
> #2: This could happen whenever the item->state is set to
> SOUP_MESSAGE_FINISHING without also re-queueing the item:
>  a) message_completed() (as Krzesimir says, this looks suspicious).

OTOH, the message_completed calls soup_session_kick_queue for an async item.

>  b) SOUP_MESSAGE_READY handling in soup_session_process_queue_item(), iff
> it’s called with !loop. We could check this if we knew the value of
> item->msg->status_code when it stalls.

Hm, the stalled item in my queue had the SOUP_MESSAGE_FINISHING state.

> 
> I think #1b might be a problem, since the GMainContext used in
> soup_session_real_kick_queue() is not necessarily the same as the one
> checked against in async_run_queue(). Patch coming. I also think #2a and #2b
> are worth investigating.
> 
> An easy way to investigate this would be to printf(G_STRFUNC, old_state,
> new_state) on each change to item->state to work out which code path is
> causing this.

I'll try that. Earlier I tried using watchpoints in gdb, but that worked somewhat strange and not really in a helpful way.

> 
> ::: libsoup/soup-session.c
> @@ +144,3 @@
> +
> +	GMutex queue_datas_lock;
> +	GHashTable *queue_datas;
> 
> (Technically the singular is ‘datum’, but nobody uses that in common usage.)
> 
> @@ +266,3 @@
> +
> +	g_mutex_init (&priv->queue_datas_lock);
> +	priv->queue_datas = g_hash_table_new (NULL, NULL);
> 
> Nitpick: Use g_hash_table_new_full() and explicitly set the free functions
> to NULL, to make it clear what they are (and that they haven’t been
> forgotten about).

Ok.

> 
> @@ +2134,3 @@
> +	if (session) {
> +		remove_idle_run_queue_data (session, queue_data);
> +		g_clear_object (&session);
> 
> I agree with Krzesimir here: it’s a bit more future-proof to use
> g_clear_object(). Otherwise you do have a dangling GObject pointer (session)
> hanging around, which is messy even if it’s safely not used at the moment.
> 

Ah, cool, I'll revert my change then.

> @@ +2341,3 @@
> +	queue_data = g_slice_new (IdleRunQueueData);
> +	g_weak_ref_init (&queue_data->wref, session);
> +	queue_data->context = context;
> 
> Don’t you need a strong ref on the GMainContext here? What guarantee do you
> have that it won’t disappear while in the queue?
> 

When GMainContext goes away, all the sources attached to it are destroyed. And our GSource removes the context from the queue data in the destroy notify function.

> @@ +2343,3 @@
> +	queue_data->context = context;
> +
> +	source = g_idle_source_new ();
> 
> Call g_source_set_name() too, to make debugging easier.
> 

Alright. What the name would it be? "Soup idle queue run for context %p"? Should it have an address of the SoupSession in the name too?

> @@ +2345,3 @@
> +	source = g_idle_source_new ();
> +	g_source_set_priority (source, G_PRIORITY_DEFAULT);
> +	g_source_set_callback (source, idle_run_queue, queue_data,
> idle_run_queue_dnotify);
> 
> Agreed, I can’t think of why calling the g_source_attach() inside the mutex
> would cause any harm. All the vfuncs in a GSource (prepare, check, dispatch,
> finalize) are called by gmain.c with the GMainContext lock unlocked. I can’t
> think of how any user code could be called with a GMainContext lock held.
> 
> > I decided to attach the source with no queue data lock held, that shouldn't hurt either.
> 
> Wouldn’t it be possible for thread 1 to unlock here, then thread 2 to remove
> the same context from the queue, then thread 1 to erroneously attach the
> source for it?

At this point - no, because the context is removed only in the GSource's destroy notify function and only if the IdleRunQueueData in the hash table is the same as the one associated to the GSource being destroyed. And here we already put our own, new IdleRunQueueData which is associated with the GSource we will attach. But anyway, if you think it is safe to attach the GSource inside the queue data lock then I'll just use the soup_add_completion_reffed().

Comment 14 Krzesimir Nowak 2017-02-15 14:27:02 UTC

(In reply to Philip Withnall from comment #12)
> (In reply to Philip Withnall from comment #10)
> > #2: This could happen whenever the item->state is set to
> > SOUP_MESSAGE_FINISHING without also re-queueing the item:
> >  a) message_completed() (as Krzesimir says, this looks suspicious).
> >  b) SOUP_MESSAGE_READY handling in soup_session_process_queue_item(), iff
> > it’s called with !loop. We could check this if we knew the value of
> > item->msg->status_code when it stalls.
> 
> Also:
> #2:
>  c) cache_stream_finished() calls soup_session_kick_queue(), but if the item
> already has item->async_pending set, that will do nothing.

cache_stream_finished() is set as a signal callback in async_return_from_cache, which is called from two places:
- conditional_get_ready_cb, which is set as a callback in async_respond_from_cache, which is called only in soup_session_send_async. But the async_respond_from_cache function always returns false for me, so this path is not taken.
- idle_return_from_cache_cb, same story story as above.

>  d) cancel_cache_response() calls soup_session_kick_queue() similarly.

cancel_cache_response is called in two places:
- conditional_get_ready_cb which is set as a callback in async_respond_from_cache in the path not taken
- idle_return_from_cache_cb - same as above.

> 
> They should probably be calling soup_session_process_queue_item() instead? I
> have not followed the async cache handling code through completely to see if
> some calling function takes care of this — for the moment it should be
> enough to work out whether the async cache code touches this item.

No idea if they should be calling soup_session_process_queue_item. But as noted above, we are not taking this code paths.

Comment 15 Philip Withnall 2017-02-15 14:54:07 UTC

(In reply to Krzesimir Nowak from comment #13)
> (In reply to Philip Withnall from comment #10)
> > Review of attachment 345023 [details] [review] [review] [review]:
> > 
> > Here’s my review of the existing patch. However, I have some thoughts about
> > what could be causing this, which would result in a completely different
> > patch. I think we should fix the underlying issue now, while we can
> > reproduce it, rather than hiding it and it resurfacing at some point in the
> > future.
> > 
> > ---
> > 
> > In order for item->async_pending to be set in the first place,
> > idle_run_queue() must have been scheduled for the item at least once. So it
> > must be the case that either:
> >  1. idle_run_queue() runs and does not process the item; or
> >  2. idle_run_queue() runs and processes the item, but does not complete it
> > and does not re-schedule it.
> > 
> > #1: This could happen if either:
> >  a) item->async changes to !item->async before async_run_queue() is called.
> > This is not possible: it’s a construct-only field.
> >  b) item->async_context does not match the session’s context. See below.
> 
> In my case:
> - item->async_context comes from
> soup_session_get_async_context(item->session) (soup_message_queue_append())
> - item->session comes from queue->session (also soup_message_queue_append())
> - queue->session comes from soup_message_queue_new() called from
> soup_session_init()
> - the session is created by the ostree fetcher with
> soup_session_async_new_with_options() which also passes
> SOUP_SESSION_USE_THREAD_CONTEXT option set as TRUE (so
> soup_session_get_async_context() always returns the thread default main
> context) (ostree_fetcher_session_thread)
> - the main context is pushed as a new thread default before the session is
> created and is iterated later in the function (also
> ostree_fetcher_session_thread)
> - the main context is created by the fetcher (_ostree_fetcher_constructed)
> - the soup_session_send_async is always invoked indirectly with
> g_main_context_invoke_full on the fetcher's main context, so it's executed
> in the thread where the soup requests are processed, so
> soup_session_get_async_context() returns the expected thread default main
> context (that is the one equal to item->async_context)
> (session_thread_idle_add in ostree fetcher)
> 
> So, I'm not sure if this case is applicable here.

OK. It still looks like there’s the potential for bugs here in other cases.

> >  c) The item uses SOUP_METHOD_CONNECT, which has already been discounted.
> > 
> > #2: This could happen whenever the item->state is set to
> > SOUP_MESSAGE_FINISHING without also re-queueing the item:
> >  a) message_completed() (as Krzesimir says, this looks suspicious).
> 
> OTOH, the message_completed calls soup_session_kick_queue for an async item.

Ah, yes. I discounted that because it’s before the item’s state is changed to SOUP_MESSAGE_FINISHING — but of course that doesn’t matter, because it’s queuing idle sources, not processing synchronously.

> >  b) SOUP_MESSAGE_READY handling in soup_session_process_queue_item(), iff
> > it’s called with !loop. We could check this if we knew the value of
> > item->msg->status_code when it stalls.
> 
> Hm, the stalled item in my queue had the SOUP_MESSAGE_FINISHING state.

Yes, the code which handles SOUP_MESSAGE_READY (around line 1980) potentially sets the item’s state to SOUP_MESSAGE_FINISHING and then might do nothing else:
	if (item->msg->status_code == SOUP_STATUS_TRY_AGAIN) {
		soup_message_cleanup_response (item->msg);
		item->state = SOUP_MESSAGE_STARTING;
	} else
		item->state = SOUP_MESSAGE_FINISHING;
	break;

So I think there’s the potential for a bug here? It depends on whether soup_session_process_queue_item() is called with !loop.

> > I think #1b might be a problem, since the GMainContext used in
> > soup_session_real_kick_queue() is not necessarily the same as the one
> > checked against in async_run_queue(). Patch coming. I also think #2a and #2b
> > are worth investigating.
> > 
> > An easy way to investigate this would be to printf(G_STRFUNC, old_state,
> > new_state) on each change to item->state to work out which code path is
> > causing this.
> 
> I'll try that. Earlier I tried using watchpoints in gdb, but that worked
> somewhat strange and not really in a helpful way.

You’d have to find the right SoupMessageQueueItem then set up the watchpoint, which would be fiddly. I think it’s a lot easier to log all state changes for all items, then filter the logs afterwards for the item you’re interested in.

> > @@ +2341,3 @@
> > +	queue_data = g_slice_new (IdleRunQueueData);
> > +	g_weak_ref_init (&queue_data->wref, session);
> > +	queue_data->context = context;
> > 
> > Don’t you need a strong ref on the GMainContext here? What guarantee do you
> > have that it won’t disappear while in the queue?
> > 
> 
> When GMainContext goes away, all the sources attached to it are destroyed.
> And our GSource removes the context from the queue data in the destroy
> notify function.

Of course, sorry.

> > @@ +2343,3 @@
> > +	queue_data->context = context;
> > +
> > +	source = g_idle_source_new ();
> > 
> > Call g_source_set_name() too, to make debugging easier.
> > 
> 
> Alright. What the name would it be? "Soup idle queue run for context %p"?
> Should it have an address of the SoupSession in the name too?

I don’t think it’s worth a g_strdup_printf() call. “Soup idle queue runner” would be fine, I think. The context can always be figured out from which thread it’s running in.

> > @@ +2345,3 @@
> > +	source = g_idle_source_new ();
> > +	g_source_set_priority (source, G_PRIORITY_DEFAULT);
> > +	g_source_set_callback (source, idle_run_queue, queue_data,
> > idle_run_queue_dnotify);
> > 
> > Agreed, I can’t think of why calling the g_source_attach() inside the mutex
> > would cause any harm. All the vfuncs in a GSource (prepare, check, dispatch,
> > finalize) are called by gmain.c with the GMainContext lock unlocked. I can’t
> > think of how any user code could be called with a GMainContext lock held.
> > 
> > > I decided to attach the source with no queue data lock held, that shouldn't hurt either.
> > 
> > Wouldn’t it be possible for thread 1 to unlock here, then thread 2 to remove
> > the same context from the queue, then thread 1 to erroneously attach the
> > source for it?
> 
> At this point - no, because the context is removed only in the GSource's
> destroy notify function and only if the IdleRunQueueData in the hash table
> is the same as the one associated to the GSource being destroyed. And here
> we already put our own, new IdleRunQueueData which is associated with the
> GSource we will attach.

Right.

> But anyway, if you think it is safe to attach the
> GSource inside the queue data lock then I'll just use the
> soup_add_completion_reffed().

Comment 16 Krzesimir Nowak 2017-02-15 21:00:22 UTC

Created attachment 345879 [details] [review]
Alternative patch probably fixing the same issue, tons simpler

I did logging like Philip suggested and got an interesting nugget at the end:

===============

function: get_connection
line: 1945
old: SOUP_MESSAGE_STARTING
new: SOUP_MESSAGE_CONNECTING
async pending: false

function: connect_complete
line: 1750
old: SOUP_MESSAGE_CONNECTING
new: SOUP_MESSAGE_CONNECTED
async pending: false

function: soup_session_process_queue_item
line: 1985
old: SOUP_MESSAGE_CONNECTED
new: SOUP_MESSAGE_READY
async pending: false

function: soup_session_process_queue_item
line: 2003
old: SOUP_MESSAGE_READY
new: SOUP_MESSAGE_RUNNING
async pending: false

function: soup_session_real_requeue_message
line: 2209
old: SOUP_MESSAGE_RUNNING
new: SOUP_MESSAGE_RESTARTING
async pending: false

function: soup_session_process_queue_item
line: 2029
old: SOUP_MESSAGE_RESTARTING
new: SOUP_MESSAGE_STARTING
async pending: false

function: get_connection
line: 1939
old: SOUP_MESSAGE_STARTING
new: SOUP_MESSAGE_READY
async pending: false

function: soup_session_process_queue_item
line: 2003
old: SOUP_MESSAGE_READY
new: SOUP_MESSAGE_RUNNING
async pending: false

function: message_completed
line: 1577
old: SOUP_MESSAGE_RUNNING
new: SOUP_MESSAGE_FINISHING
async pending: true

===============

In the last entry the async_pending is set to true. Looking at the code in the message_completed function makes this obvious - we call soup_session_kick_queue() which sets the async_pending to TRUE for all the async items in the queue. So I experimentally moved kicking the queue after the item's state is changed and that apparently seemed to work.

I'm attaching a patch with the change. It will hopefully obsolete the complex one. It's quite late now and I haven't yet understood why this started working. Tomorrow I will look at it again, test it more and write the proper commit message.

Comment 17 Philip Withnall 2017-02-16 13:58:51 UTC

I’m not sure how this can make things work. If I’m understanding the patch correctly, it essentially changes:

	soup_session_kick_queue (item->session);
	item->state = SOUP_MESSAGE_FINISHING;

into

	item->state = SOUP_MESSAGE_FINISHING;
	soup_session_kick_queue (item->session);

(after eliminating all paths which we know aren’t taken).

And since soup_session_real_kick_queue() doesn’t query the item->state, and shouldn’t cause idle_run_queue() to be called until after message_completed() returns anyway, I don’t see how this can make a difference.

Given that item->async_pending is changing from FALSE to TRUE on this call to soup_session_kick_queue(), that means that either idle_run_queue() is newly scheduled; or another item in the same context already has item->async_pending set already.

Can you add some debug printf()s to idle_run_queue() and async_run_queue() to see when they’re both called, and which items are seen and have soup_session_process_queue_item() called on them by async_run_queue()?

Comment 18 Krzesimir Nowak 2017-02-16 15:36:17 UTC

I did some debugging and found a culprit. Basically a race condition here, but funnily 100% reproducible?

Will post the details together with the patch.

Comment 19 Krzesimir Nowak 2017-02-16 16:23:26 UTC

Hmph, looks like the fix is not very obvious. So here is the log of what happens together with my comments:


function: soup_session_process_queue_item
line: 2005
thread: 0x562b6db31140
uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock
old: SOUP_MESSAGE_READY
new: SOUP_MESSAGE_RUNNING
async pending: false

Here we changed the state to RUNNING.


function: async_run_queue
line: 2102
thread: 0x562b6db31140
uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock
state after processing: SOUP_MESSAGE_RUNNING


function: soup_message_io_finished
line: 183
thread: 0x562b6db29150
uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock
completion: complete

After a while we can see that downloading has finished. Note that this happens in a different thread. This function calls a callback which, in this case, is the message_completed function.


function: soup_session_real_kick_queue
line: 2352
thread: 0x562b6db29150
uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock
scheduled the gsource

The first thing we do in the message_completed function is kicking the queue. The idle source for running the queue processing has been just scheduled.


function: async_run_queue
line: 2086
thread: 0x562b6db31140
uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock
method is connect: false
item async: true
same context: true
item context:    0x562b6de5a3f0
session context: 0x562b6de5a3f0

And bang, the previous thread got preempted (?) and we started processing the queue. Dunno about the preemption - it always happens. Probably it tries to lock something that's already locked?


function: async_run_queue
line: 2096
thread: 0x562b6db31140
uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock
setting async pending to FALSE

We set the async pending to FALSE (it was FALSE anyway…).


function: async_run_queue
line: 2099
thread: 0x562b6db31140
uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock
state before processing: SOUP_MESSAGE_RUNNING


function: async_run_queue
line: 2102
thread: 0x562b6db31140
uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock
state after processing: SOUP_MESSAGE_RUNNING

As we can see, absolutely nothing happens - we are still in the RUNNING state. The processing is shortcutted by:

if (item->async)
        return;


function: soup_session_real_kick_queue
line: 2362
thread: 0x562b6db29150
uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock
setting async pending to TRUE

And we are back to the thread that was preempted, at this point the idle gsource callback was already called and we set the async pending to TRUE.


function: message_completed
line: 1579
thread: 0x562b6db29150
uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock
old: SOUP_MESSAGE_RUNNING
new: SOUP_MESSAGE_FINISHING
async pending: true

And here we change the state to FINISHING. It's too late.


function: soup_message_io_finished
line: 188
thread: 0x562b6db29150
uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock
completion done

The following calls to soup_session_real_kick_queue will show that async_pending is set to TRUE, despite no idle gsource being scheduled.


function: soup_session_real_kick_queue
line: 2357
thread: 0x562b6db31140
uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock
already pending


function: soup_session_real_kick_queue
line: 2362
thread: 0x562b6db31140
uri: https://origin.ostree.endlessm.com/staging/dev/eos-apps/deltas/4L/9HinwITxx6zAv2iPXgkb+0pHRmFm5Rt3UzosB8hiM/superblock
setting async pending to TRUE






So, my idea to fix it was basically to set the idle gsource that executes the same code as in the message_completed function if item->async is TRUE. But now, I'm getting:

error: While pulling app/com.google.AndroidStudio/x86_64/eos3 from remote eos-apps: Server returned status 0: Unknown Error

Meh.

Comment 20 Krzesimir Nowak 2017-02-16 16:24:19 UTC

Created attachment 345959 [details] [review]
Another alternative patch, this time with idle GSource.

Comment 21 Philip Withnall 2017-02-17 17:47:39 UTC

Drive-by thoughts: We’ve got calls to soup_session_real_kick_queue() from both threads, which doesn’t look right. SoupSession doesn’t use any locking for message processing (just connection handling), so which thread is *supposed* to be calling soup_session_real_kick_queue()? Is the code missing another soup_add_completion_reffed() call somewhere?

Was this trace recorded with attachment #345959 [details] applied, or with git master?

Comment 22 Krzesimir Nowak 2017-02-17 18:38:00 UTC

Sorry for not being clear with the logs. The logs come from the current master branch (+ the patch that adds the logs), without the fix. The patch I added here causes the soup_session_real_kick_queue() to be invoked in the thread where item->context is iterated. But now soup_session_send_async seems to return a reply with HTTP status being 0.

Here is the repo with my fork of libsoup I'm working on now - https://github.com/kinvolk/libsoup. Currently, the tracing and the not-fully-working fix is on the krnowak/tracing-msg-state-changes branch.

Comment 23 Krzesimir Nowak 2017-02-17 22:19:49 UTC

Some more findings (I don't even know where to begin):

- So far, the problems I had were with downloading the superblock. With the patch that basically moves the message_completed code into the idle GSource, the superblock downloads just fine and nothing hangs. The "error: While pulling app/com.google.AndroidStudio/x86_64/eos3 from remote eos-apps: Server returned status 0: Unknown Error" error happens later, for a different file (just some .filez).

- My "fix" patch breaks libsoup's "make check".

- Currently in my "fix" patch, the message_completed() function will defer the kicking the queue and changing the message state if the item is async. Otherwise it will do it immediately. I added a hack that basically defers the actions when the item is async AND the current thread is different from the thread where session's async main context is iterated (well, actually different from the thread where the session was created, but I conflated those two for now, it's a hack anyway). Things seem to work again and libsoup's "make check" started to pass.

- Apparently there is an assumption that the message_completed should be sometimes running synchronously. Where this assumption is, I haven't yet pinpointed. Also, I haven't figured out why the message's status is still zero, but I have a feeling that it's some race condition here.

- I was wondering what was that second thread (the 0x562b6db29150 one) that was showing up sometimes in the logs and messed the things up. I started flatpak again in gdb and set two breakpoints inside the message_completed function - one when the item is async and the threads differ and one when the item is also async, but threads are the same.

The backtrace for the same threads case (so kicking the queue and changing the status was not deferred):

+ Trace 237155

#0 message_completed
at soup-session.c line 1621
#1 soup_message_io_finished
at soup-message-io.c line 183
#2 soup_client_input_stream_close_fn
at soup-client-input-stream.c line 134
#3 g_input_stream_close
at gio/ginputstream.c line 511
#4 send_async_spliced
at soup-session.c line 3986
#5 g_task_return_now
at gio/gtask.c line 1121
#6 g_task_return
at gio/gtask.c line 1179
#7 async_ready_splice_callback_wrapper
at gio/goutputstream.c line 1175
#8 g_task_return_now
at gio/gtask.c line 1121
#9 g_task_return
at gio/gtask.c line 1179
#10 real_splice_async_complete_cb
at gio/goutputstream.c line 1893
#11 g_task_return_now
at gio/gtask.c line 1121
#12 g_task_return
at gio/gtask.c line 1179
#13 async_ready_close_callback_wrapper
at gio/goutputstream.c line 1403
#14 g_task_return_now
at gio/gtask.c line 1121
#15 complete_in_idle_cb
at gio/gtask.c line 1135
#16 g_main_context_dispatch
at glib/gmain.c line 3203
#17 g_main_context_dispatch
at glib/gmain.c line 3856
#18 g_main_context_iterate
at glib/gmain.c line 3929
#19 g_main_context_iteration
at glib/gmain.c line 3990
#20 ostree_fetcher_session_thread
at src/libostree/ostree-fetcher.c line 579
#21 g_thread_proxy
at glib/gthread.c line 784
#22 start_thread
at pthread_create.c line 309
#23 clone
at ../sysdeps/unix/sysv/linux/x86_64/clone.S line 111
#0 message_completed
at soup-session.c line 1614
#1 soup_message_io_finished
at soup-message-io.c line 183
#2 soup_client_input_stream_close_fn
at soup-client-input-stream.c line 134
#3 g_input_stream_close
at gio/ginputstream.c line 511
#4 g_output_stream_real_splice
at gio/goutputstream.c line 605
#5 g_output_stream_splice
at gio/goutputstream.c line 533
#6 _ostree_fetcher_mirrored_request_to_membuf
at src/libostree/ostree-fetcher.c line 1382
#7 ostree_repo_pull_with_options
at src/libostree/ostree-repo-pull.c line 2770
#8 flatpak_dir_pull
at common/flatpak-dir.c line 1767
#9 flatpak_dir_pull
at common/flatpak-dir.c line 2129
#10 flatpak_dir_install
at common/flatpak-dir.c line 4322
#11 flatpak_transaction_run
at app/flatpak-transaction.c line 678
#12 flatpak_builtin_install
at app/flatpak-builtins-install.c line 501
#13 flatpak_run
at app/flatpak-main.c line 377
#14 main
at app/flatpak-main.c line 465


The other thread is apparently the main thread. And the message_completed function is called as a result of splicing the input stream returned by soup_session_send_finish to the memory output stream. The splicing closes the input stream, which in turn calls the message_completed function. So the input stream (SoupClientInputStream) is not thread-safe when my fix is not applied.

- Not sure what's the proper fix for this case. My first idea was to add another flag to SoupMessageQueueItem like "guint no_defer : 1" and set it to TRUE before calling g_input_stream_close in the send_async_spliced function. Then change the condition in message_completed to "if (item->async && !item->no_defer) { setup_gsource(); return; }". I haven't written it and haven't tested it, so it's all a wild guess. I'll continue the research on Monday.

Comment 24 Krzesimir Nowak 2017-02-17 22:21:21 UTC

Ugh, two backtraces were merged into one. Please view the raw version too, then.

Comment 25 Philip Withnall 2017-02-20 15:10:22 UTC

message_completed() is being called as the completion_cb from soup-message-io.c, which also stores a `GMainContext *async_context`. The fact that soup_message_io_finished() can be called from anywhere (it’s non-static) and will directly call completion_cb() looks a bit fishy to me. Perhaps a better fix would be to add the g_main_context_invoke() call there (in soup_message_io_finished()) rather than in message_completed()? Looks to me like that would fix a more general problem.

See also: get_headers_cb and parse_headers_cb in SoupMessageIOData.

Have you tried running libsoup with tsan or under Valgrind’s drd or helgrind? That might be interesting. (Running it under tsan would be faster than Valgrind, but means recompiling the whole stack from libsoup up to flatpak to use tsan.)

Comment 26 Dan Winship 2017-02-20 19:42:25 UTC

(In reply to Krzesimir Nowak from comment #19)
> Hmph, looks like the fix is not very obvious. So here is the log of what
> happens together with my comments:

> ...

> And we are back to the thread that was preempted, at this point the idle
> gsource callback was already called and we set the async pending to TRUE.

Right. That seems like the entire problem to me; you need to make sure that the idle_run_queue() calls queued up by soup_session_real_kick_queue() don't run until after soup_session_real_kick_queue() is done. Or at least, that the body of async_run_queue() doesn't run until soup_session_real_kick_queue() is done. So we need a mutex to protect against that.

(Actually, while you need to keep async_run_queue() and soup_session_real_kick_queue() from running at the same time, there's no problem with running async_run_queue() in two different threads simultaneously, since they'll operate on non-overlapping queue items. So you could use a GRWLock instead of a GMutex, with async_run_queue() getting a read lock, and soup_session_real_kick_queue() getting a write lock. I think.)

Comment 27 Krzesimir Nowak 2017-02-21 21:25:36 UTC

I didn't expect putting a GRWLock to work and indeed it deadlocked - see the backtrace (the reader lock was taken in async_run_queue, frame 18):

+ Trace 237162

#0 futex_wait
at ../sysdeps/unix/sysv/linux/futex-internal.h line 61
#1 futex_wait_simple
at ../sysdeps/nptl/futex-internal.h line 135
#2 __pthread_rwlock_wrlock_slow
at pthread_rwlock_wrlock.c line 67
#3 g_rw_lock_writer_lock
at gthread-posix.c line 541
#4 soup_session_real_kick_queue
at soup-session.c line 2308
#5 soup_session_kick_queue
at soup-session.c line 2343
#6 connection_state_changed
at soup-session.c line 1456
#7 g_cclosure_marshal_VOID__PARAM
at gmarshal.c line 1832
#8 g_closure_invoke
at gclosure.c line 804
#9 signal_emit_unlocked_R
at gsignal.c line 3635
#10 g_signal_emit_valist
at gsignal.c line 3391
#11 g_signal_emit
at gsignal.c line 3447
#12 g_object_dispatch_properties_changed
at gobject.c line 1064
#13 g_object_notify_queue_thaw
at gobject.c line 296
#14 g_object_thaw_notify
at gobject.c line 1301
#15 soup_connection_set_state
at soup-connection.c line 676
#16 soup_session_unqueue_item
at soup-session.c line 1480
#17 soup_session_process_queue_item
at soup-session.c line 2025
#18 async_run_queue
at soup-session.c line 2068
#19 idle_run_queue
at soup-session.c line 2096
#20 g_idle_dispatch
at gmain.c line 5545
#21 g_main_dispatch
at gmain.c line 3203
#22 g_main_context_dispatch
at gmain.c line 3856
#23 g_main_context_iterate
at gmain.c line 3929
#24 g_main_context_iteration
at gmain.c line 3990
#25 soup_session_async_send_message
at soup-session-async.c line 98
#26 soup_session_send_message
at soup-session.c line 2259
#27 setup_coding_test
at coding-test.c line 193
#28 test_case_run
at gtestutils.c line 2159
#29 g_test_run_suite_internal
at gtestutils.c line 2243
#30 g_test_run_suite_internal
at gtestutils.c line 2255
#31 g_test_run_suite_internal
at gtestutils.c line 2255
#32 g_test_run_suite
at gtestutils.c line 2331
#33 g_test_run
at gtestutils.c line 1598
#34 main
at coding-test.c line 556

I tried quickly to work it around by deferring the kicking with the idle GSource, but I got a lot of failures in make check, so I didn't try further.

I also tried running flatpak under valgrind's helgrind and drd, but I got like 500k lines in log files. Looking at some of the reports it seems to be that somehow GLib's locking utilities are transparent(?) to helgrind. I tried to add suppressions, but after a while I stated that there is just too much noise so I gave up - it takes too much time.

Here, I am attaching a patch that seems to fix the issue for me - it basically defers the execution of the code of the message_completed current thread is not an owner of the item's async main context. Initially I did as Philip suggested - by using g_main_context_invoke on the io->completion_cb, but I only could use io->async_context, as io->item is not always available. The patch didn't help - I was still getting failures in make check.

There is also a question about the thread-safety of the GInputStream object returned by soup_session_send_finish. The situation I have is that flatpak has two threads - main thread and the fetcher thread. In the fetcher thread the soup_session_send_async is called, so we eventually get the response body stream, which is then moved to the main thread and spliced to the memory buffer. Splicing closes the input stream, which results in the mess described in comment 19 (it eventually calls the soup_message_io_finished function. Is this a valid use of soup's input stream? If so, we have a bug in libsoup. Otherwise we have a bug in ostree and possibly a documentation bug in libsoup. I talked to Philip on IRC and he leaned toward the bug in ostree (as I understood it). What do you think?

Comment 28 Krzesimir Nowak 2017-02-21 21:40:25 UTC

Created attachment 346412 [details] [review]
Execute message_complete in the proper thread

I actually forgot to add the patch.

Also, for the code with GRWLock, please see https://github.com/kinvolk/libsoup/commit/65d90622ed9c9704b9f76abf26c39546150d47bd

Comment 29 Krzesimir Nowak 2017-02-24 15:01:42 UTC

So, I have a feeling that if the real culprit is that async_run_queue and soup_session_real_kick_queue shouldn't be running in parallel, then my first patch (https://bugzilla.gnome.org/attachment.cgi?id=345813) actually solves the issue properly and does it without locking, because it gets rid of the async_pending flag which is a subject of the race conditions.

I guess that the second patch should be applied too to avoid changes of the item state in the thread unrelated with item's async context.

Still, the question about thread-safety of the libsoup's G*Stream implementation needs to be answered and clarified in the documentation.

Also, a note about reproducibility - I could reproduce it with flatpak using the endless' version of ostree, which seems to be based on some older version of upstream ostree. Upstream ostree already consumes libsoup's streams in the same thread, so it's very likely not affected. For endless' ostree I filed a PR to workaround the bug in libsoup (https://github.com/endlessm/ostree/pull/59).

Comment 30 Philip Withnall 2017-02-27 14:13:15 UTC

Review of attachment 346412 [details] [review]:

This looks generally good to me, and passes libsoup’s `make check`.

::: libsoup/soup-session.c
@@ +1607,3 @@
+		data = g_slice_new (MessageCompletedData);
+		data->completion = completion;
+		data->item = item;

Nitpick: I’d move the `soup_message_queue_item_ref()` here (i.e. `data->item = soup_message_queue_item_ref (item);` to make it more obvious that the ownership of the ref is being transferred to the `MessageCompletedData` instance.

@@ +1611,3 @@
+				       message_completed_invoke_cb,
+				       data);
+		return;

Nitpick: I’d use an `else` and no early `return`, to simplify the control flow slightly.

Comment 31 Philip Withnall 2017-02-27 14:39:31 UTC

(In reply to Krzesimir Nowak from comment #27)
> Here, I am attaching a patch that seems to fix the issue for me - it basically defers the execution of the code of the message_completed current thread is not an owner of the item's async main context. Initially I did as Philip suggested - by using g_main_context_invoke on the io->completion_cb, but I only could use io->async_context, as io->item is not always available. The patch didn't help - I was still getting failures in make check.

If you were to use g_main_context_invoke() on the io->completion_cb, I think the conventional approach would be to use the GMainContext from g_main_context_get_thread_default() called in new_iostate(). The conventional assumption when passing a callback to a function (across a library boundary) is that the callback will be invoked in the GMainContext which is the thread default at the time of initially passing the callback to the library. (I hope that makes sense.)

In this case, io->async_context probably is a different GMainContext instance — the call to soup_message_io_client() in soup_message_send_request() (which is how the client-side code enters soup-message-io.c) can pass one of a couple of GMainContexts in.

I just tried to put a patch together to use g_main_context_invoke() with io->completion_cb, and it’s a complete pain to do properly (for get_headers_cb() and parse_headers_cb() too) because the latter two expect to be able to return synchronously. I think your approach is better — we just need to make sure it’s extended to all code which passes a completion_cb into soup-message-io.c.

Comment 32 Tomas Popela 2018-04-16 08:26:39 UTC

Krzesimir is this still relevant? If so would you mind address the Philip's review notes?

Comment 33 Krzesimir Nowak 2018-04-18 12:00:35 UTC

Ah, I'm not sure, really. I'm not sure what and where the issue was exactly, but it was something wrong on the ostree side, where it made a request in one thread and handled some reply in other one (or something like that) using the same object, which lead to those issues with race conditions and whatnot. Clearly a scenario libsoup does not support and probably does not intend to support.

This issue was fixed in ostree long time ago, but I tripped into it, because endless fork didn't have that fix at that time.

So, not sure. All that steam for a NOTABUG? :)

Comment 34 Philip Withnall 2018-04-18 12:29:30 UTC

(In reply to Krzesimir Nowak from comment #33)
> Ah, I'm not sure, really. I'm not sure what and where the issue was exactly,
> but it was something wrong on the ostree side, where it made a request in
> one thread and handled some reply in other one (or something like that)
> using the same object, which lead to those issues with race conditions and
> whatnot. Clearly a scenario libsoup does not support and probably does not
> intend to support.

But your patch still addresses a potential race in the libsoup code, right? Are you saying that the analysis for that patch is flawed?

Comment 35 Krzesimir Nowak 2018-04-18 13:12:45 UTC

(In reply to Philip Withnall from comment #34)
> (In reply to Krzesimir Nowak from comment #33)
> > Ah, I'm not sure, really. I'm not sure what and where the issue was exactly,
> > but it was something wrong on the ostree side, where it made a request in
> > one thread and handled some reply in other one (or something like that)
> > using the same object, which lead to those issues with race conditions and
> > whatnot. Clearly a scenario libsoup does not support and probably does not
> > intend to support.
> 
> But your patch still addresses a potential race in the libsoup code, right?
> Are you saying that the analysis for that patch is flawed?

Whether the analysis was flawed or not depends on an answer to question whether GInputStream can be safely used in a different thread from the one it was created in? Or libsoup's implementation of GInputStream. Because, IIRC, all this was caused by moving the input stream I received from libsoup to another thread and splicing it to a memory stream there, which in turn closed the input stream, which invoked some libsoup code in a racy way.

But well, if the patch closes some potential race and does not break anything, then it sounds like a win anyway (maybe at a cost of slightly complicating an already complicated code).

I'll try to allocate some time for the patch soon. Will let you know. I still need to recall whether both patches should be applied or only the last one.

Comment 36 Michael Catanzaro 2018-04-18 16:00:13 UTC

(In reply to Krzesimir Nowak from comment #35)
> Whether the analysis was flawed or not depends on an answer to question
> whether GInputStream can be safely used in a different thread from the one
> it was created in? Or libsoup's implementation of GInputStream. Because,
> IIRC, all this was caused by moving the input stream I received from libsoup
> to another thread and splicing it to a memory stream there, which in turn
> closed the input stream, which invoked some libsoup code in a racy way.

Yes, it is an API guarantee of GInputStream that it is fully threadsafe and can be used on any thread, so long as only one input operation is in progress at a time. There is a note in the GInputStream documentation: "See the documentation for GIOStream for details of thread safety of streaming APIs." And it's explained there:

"""
Operations on GIOStreams cannot be started while another operation on the GIOStream or its substreams is in progress. Specifically, an application can read from the GInputStream and write to the GOutputStream simultaneously (either in separate threads, or as asynchronous operations in the same thread), but an application cannot start any GIOStream operation while there is a GIOStream, GInputStream or GOutputStream operation in progress, and an application can’t start any GInputStream or GOutputStream operation while there is a GIOStream operation in progress.

This is a product of individual stream operations being associated with a given GMainContext (the thread-default context at the time the operation was started), rather than entire streams being associated with a single GMainContext.

GIO may run operations on GIOStreams from other (worker) threads, and this may be exposed to application code in the behaviour of wrapper streams, such as GBufferedInputStream or GTlsConnection. With such wrapper APIs, application code may only run operations on the base (wrapped) stream when the wrapper stream is idle. Note that the semantics of such operations may not be well-defined due to the state the wrapper stream leaves the base stream in (though they are guaranteed not to crash).
"""

So creating a GInputStream on thread A and then reading from it on thread B is required to be safe.

Comment 37 GNOME Infrastructure Team 2018-09-21 16:28:54 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/libsoup/issues/100.