After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 768567 - session: Clean up handling of run queue source
session: Clean up handling of run queue source
Status: RESOLVED OBSOLETE
Product: libsoup
Classification: Core
Component: Misc
unspecified
Other All
: Normal normal
: ---
Assigned To: libsoup-maint@gnome.bugs
libsoup-maint@gnome.bugs
Depends on:
Blocks:
 
 
Reported: 2016-07-08 13:09 UTC by Colin Walters
Modified: 2018-09-21 16:26 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
session: Clean up handling of run queue source (3.63 KB, patch)
2016-07-08 13:09 UTC, Colin Walters
none Details | Review
session: Clean up handling of run queue source (4.28 KB, patch)
2016-07-08 13:42 UTC, Colin Walters
none Details | Review
soup-misc: add a GDestroyNotify to soup_add_completion_reffed() (3.18 KB, patch)
2016-07-14 20:56 UTC, Dan Winship
committed Details | Review
soup-session: fix idle_run_queue() source handling (6.40 KB, patch)
2016-07-14 20:56 UTC, Dan Winship
committed Details | Review

Description Colin Walters 2016-07-08 13:09:25 UTC
We're seeing an occasional crash in ostree:
https://github.com/ostreedev/ostree/issues/373

This patch appears to fix it.  I can't say that I conclusively
know the exact bug, but there are a few things this patch cleans up.

First, if `self->disposed` is true, the source will be destroyed, but
remained on the list.  Fix that by moving down the `if
(self->disposed)` conditional.

Another thing I observed then is that actually, this list can have
at most one element in the current codebase.  So then we can just
store a `GSource*` reference in `priv`.

Then, we no longer need to search through all sources on the main
context for our `priv` pointer, we can just reference the pointer.
This is a very common pattern.

Now, we don't even need to call
`g_main_context_find_source_by_user_data()` or
`g_main_current_source()`.
Comment 1 Colin Walters 2016-07-08 13:09:33 UTC
Created attachment 331079 [details] [review]
session: Clean up handling of run queue source
Comment 2 Colin Walters 2016-07-08 13:32:01 UTC
This breaks tests/context-test.c =/  I see that each SoupMessageQueueItem may potentially have a different context.
Comment 3 Colin Walters 2016-07-08 13:42:25 UTC
Created attachment 331081 [details] [review]
session: Clean up handling of run queue source

We're seeing an occasional crash in ostree:
https://github.com/ostreedev/ostree/issues/373

This patch appears to fix it.  I can't say that I conclusively
know the exact bug, but there are a few things this patch cleans up.

First, if `self->disposed` is true, the source will be destroyed, but
remained on the list.  Fix that by moving down the `if
(self->disposed)` conditional.

Also, searching through all sources on the main context for our `priv`
pointer is ugly...it's more efficient for us to maintain a mapping,
and also ensure that mapping holds a reference to the main context
too.
Comment 4 Colin Walters 2016-07-08 13:42:46 UTC
Fixed that by having an explicit map.
Comment 5 Dan Winship 2016-07-11 21:30:29 UTC
Comment on attachment 331081 [details] [review]
session: Clean up handling of run queue source

>Also, searching through all sources on the main context for our `priv`
>pointer is ugly...it's more efficient for us to maintain a mapping,
>and also ensure that mapping holds a reference to the main context
>too.

Contexts hold refs on their sources, so sources can't hold refs on their contexts too or you'll end up leaking things. That's where all the ugliness in this code comes from. With your patch, if you finish with a GMainContext and unref it (and stop iterating it), but libsoup has queued a run_queue source on it, then the GMainContext (and everything attached to it) won't actually get freed until the session is destroyed. If you only ever create one SoupSession, and then repeatedly use GMainContexts only once [as WebKitGTK used to do and maybe still does] then you'll keep leaking more and more.


The current code is clearly not thread-safe though... idle_run_queue() could start running in one thread and get past the "if (priv->disposed)" check, and then the session could get disposed in another thread.
Comment 6 Dan Winship 2016-07-11 21:32:13 UTC
looking at https://github.com/ostreedev/ostree/issues/373#issuecomment-231341349 and trying to figure out how we could get there, I notice that there's no mutex protecting run_queue_sources, which is obviously no good.
Comment 7 Colin Walters 2016-07-12 02:56:58 UTC
(In reply to Dan Winship from comment #5)
> If you only ever create
> one SoupSession, and then repeatedly use GMainContexts only once [as
> WebKitGTK used to do and maybe still does] then you'll keep leaking more and
> more.

That sounds like a strange thing to do.

> The current code is clearly not thread-safe though... idle_run_queue() could
> start running in one thread and get past the "if (priv->disposed)" check,
> and then the session could get disposed in another thread.

That would indeed explain why the ostree change to predictably dispose it in one thread fixed the bug.

Anyways so there may be no immediate bug here, but in general I find it quite ugly how we currently search for a source via user data.  While it's obviously unlikely, what if someone was using GUINT_TO_POINTER() for their user data and it happened to be the same value as the priv pointer?

If webkit isn't any longer doing that weird multiple-context-for-single-session thing, I think it'd argue for this patch + a mutex?
Comment 8 Dan Winship 2016-07-14 14:18:01 UTC
(In reply to Colin Walters from comment #7)
> (In reply to Dan Winship from comment #5)
> > If you only ever create
> > one SoupSession, and then repeatedly use GMainContexts only once [as
> > WebKitGTK used to do and maybe still does] then you'll keep leaking more and
> > more.
> 
> That sounds like a strange thing to do.

XMLHttpRequest lets you send sync requests, which (per the spec) must block the completion of any pending async requests (and there are sites that will fail completely if you don't do it this way). Since it didn't used to be possible to mix sync and async requests on the same session, WebKit implemented this by creating a new GMainContext in the main thread, queueing an async request on that context, running that context until the request completed, and then destroying the new context.

> Anyways so there may be no immediate bug here, but in general I find it
> quite ugly how we currently search for a source via user data.  While it's
> obviously unlikely, what if someone was using GUINT_TO_POINTER() for their
> user data and it happened to be the same value as the priv pointer?

Well, that argument applies just as well to any other use of any find-by-user-data API (eg, g_signal_handlers_disconnect_by_data())...

> If webkit isn't any longer doing that weird
> multiple-context-for-single-session thing, I think it'd argue for this patch
> + a mutex?

Creating the reference counting loop is sketchy regardless of what WebKit is currently doing.


So, summing up the problems:

  1. priv->run_queue_sources is not handled thread-safely
  2. session disposal is not thread-safe wrt pending run_queue_sources
  3. g_main_context_find_source_by_user_data() is debatably a bad idea,
     but at any rate, it's dumb for us to be using it anyway since we
     could just scan priv->run_queue_sources, which is almost certainly
     going to be much smaller than context->sources
Comment 9 Dan Winship 2016-07-14 15:46:30 UTC
(working on a patch...)
Comment 10 Dan Winship 2016-07-14 20:56:51 UTC
Created attachment 331550 [details] [review]
soup-misc: add a GDestroyNotify to soup_add_completion_reffed()
Comment 11 Dan Winship 2016-07-14 20:56:59 UTC
Created attachment 331551 [details] [review]
soup-session: fix idle_run_queue() source handling

priv->run_queue_sources was un-thread-safe in multiple ways. Now:

1. We pass idle_run_queue() a GWeakRef to the SoupSession, rather than
   the SoupSession itself. So now it's not possible for the session to
   get disposed in another thread while idle_run_queue() is running,
   and it's safe to let the callback get run even after the session is
   freed.

2. The idle_run_queue() source is given a GDestroyNotify, so it can
   clean up the weakref even if the source is destroyed behind
   the session's back.

3. Since we no longer have to forcibly destroy the sources when the
   session is destroyed, and we don't have to manually clean up after
   them if they don't get run, we no longer have to explicitly
   remember the sources after creating them, and so we don't even need
   priv->run_queue_sources. However, we do still want to make sure
   that there's only ever one pending idle_run_queue() source per
   context (so we don't keep adding more and more if the context isn't
   currently being run), so we add a new "async_pending" field to
   SoupMessageQueueItem to keep track of that.
Comment 12 Dan Winship 2016-07-14 20:57:51 UTC
does that fix it for you?
Comment 13 Bastien Nocera 2016-07-18 15:09:03 UTC
I still get stuck fetches with those 2 patches on top of master.


With that running:
chip     12588  0.0  0.3   2620  1900 pts/0    S+   14:36   0:00 /bin/bash /home/chip/ostree/tests/test-remote-gpg-import.sh -k --tap
chip     12610  0.0  0.7  21624  3904 pts/0    Sl+  14:36   0:00 ostree trivial-httpd --autoexit --daemonize -p /var/tmp/tap-test.GdwdvX/httpd-port
chip     13058  0.0  1.9  27300  9924 pts/0    Sl+  14:36   0:00 ostree --repo=repo pull R2:main

(test-remote-gpg-import.sh, test-commit-sign.sh test-pull-archive-z.sh all hang that way, but killing the tests allows the test suite to carry on)
Comment 14 Dan Winship 2016-07-18 15:23:08 UTC
hm... no mutexes in that trace though. it looks like we're failing to add a source in some cases, so it just ends up sitting in poll() forever rather than starting/continuing the request
Comment 15 Simon McVittie 2016-11-30 15:39:49 UTC
Review of attachment 331551 [details] [review]:

::: libsoup/soup-message-queue.h
@@ +53,3 @@
 	guint priority          : 3;
+	guint resend_count      : 5;
+	guint _unused           : 18;

Is this a public struct? You've used 1 bit less here, and the meanings of several bits have changed.
Comment 16 Simon McVittie 2016-11-30 15:39:52 UTC
Review of attachment 331551 [details] [review]:

::: libsoup/soup-message-queue.h
@@ +53,3 @@
 	guint priority          : 3;
+	guint resend_count      : 5;
+	guint _unused           : 18;

Is this a public struct? You've used 1 bit less here, and the meanings of several bits have changed.
Comment 17 Simon McVittie 2016-11-30 15:41:35 UTC
(In reply to Bastien Nocera from comment #13)
> I still get stuck fetches with those 2 patches on top of master.

Is this a regression with those two patches, or are they merely not fixing a different symptom?
Comment 18 Dan Winship 2016-11-30 16:23:36 UTC
(In reply to Simon McVittie from comment #16)
> Is this a public struct?

no
Comment 19 Simon McVittie 2016-12-01 12:06:36 UTC
These patches look good to me, but they do not appear to fix this particular form of crashing in OSTree's tests (libsoup, ostree and glib master):

Thread 2 (Thread 0x7f3aed85f400 (LWP 7812))

  • #0 write
    at ../sysdeps/unix/syscall-template.S line 84
  • #1 g_wakeup_signal
    at /home/smcv/src/glib/glib/gwakeup.c line 239
  • #2 g_main_context_wakeup
    at /home/smcv/src/glib/glib/gmain.c line 4578
  • #3 _ostree_fetcher_finalize
    at /home/smcv/src/ostree/src/libostree/ostree-fetcher.c line 645
  • #4 g_object_unref
    at /home/smcv/src/glib/gobject/gobject.c line 3185
  • #5 ostree_repo_pull_with_options
    at /home/smcv/src/ostree/src/libostree/ostree-repo-pull.c line 3140
  • #6 ostree_builtin_pull
    at /home/smcv/src/ostree/src/ostree/ot-builtin-pull.c line 296
  • #7 ostree_run
    at /home/smcv/src/ostree/src/ostree/ot-main.c line 204
  • #8 main
    at /home/smcv/src/ostree/src/ostree/main.c line 78

I also sometimes get an "ostree pull" process hanging, but I don't think that's a regression with these patches.

I also sometimes get https://github.com/ostreedev/ostree/issues/583 but that is certainly not a regression.

All of these seem to reproduce more easily when spamming pull tests in parallel:

for x in `seq 1 100`; do echo "==$x=="; systemd-run --user --scope env VERBOSE=1 LD_LIBRARY_PATH=$prefix/lib make -j -C $builddir check-TESTS TESTS="tests/test-pull-repeated.sh tests/test-pull-archive-z.sh tests/test-pull-commit-only.sh tests/test-pull-contenturl.sh tests/test-pull-corruption.sh tests/test-pull-depth.sh tests/test-pull-metalink.sh tests/test-pull-large-metadata.sh tests/test-pull-mirrorlist.sh tests/test-pull-mirror-summary.sh tests/test-pull-override-url.sh tests/test-pull-resume.sh tests/test-pull-subpath.sh tests/test-pull-summary-sigs.sh tests/test-pull-untrusted.sh" || { echo "Error in iteration $x: $?"; break }; done
Comment 20 Simon McVittie 2016-12-01 20:45:02 UTC
(In reply to Simon McVittie from comment #19)
> These patches look good to me, but they do not appear to fix this particular
> form of crashing in OSTree's tests

Actually, that might just have been user error. When using -fsanitize=address, I definitely get an "ostree pull" process hanging/timing out quite often - I don't know whether that's a libsoup issue or not - but I haven't been able to reproduce the use-after-free. I could easily reproduce the use-after-free with asan but without these patches.
Comment 21 Colin Walters 2016-12-02 20:08:37 UTC
Review of attachment 331551 [details] [review]:

I can't say I fully grasp this code, but this patch looks good to me.
Comment 22 Dan Winship 2016-12-03 15:04:06 UTC
Pushed the patch but not sure if I should count the bug as FIXED?

Attachment 331550 [details] pushed as 8005ba4 - soup-misc: add a GDestroyNotify to soup_add_completion_reffed()
Attachment 331551 [details] pushed as 8e1b0f5 - soup-session: fix idle_run_queue() source handling
Comment 23 Michael Catanzaro 2017-02-14 18:21:10 UTC
This seems to have introduced a regression, bug #778135.
Comment 24 GNOME Infrastructure Team 2018-09-21 16:26:28 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/libsoup/issues/95.