GNOME Bugzilla – Bug 768567
session: Clean up handling of run queue source
Last modified: 2018-09-21 16:26:28 UTC
We're seeing an occasional crash in ostree: https://github.com/ostreedev/ostree/issues/373 This patch appears to fix it. I can't say that I conclusively know the exact bug, but there are a few things this patch cleans up. First, if `self->disposed` is true, the source will be destroyed, but remained on the list. Fix that by moving down the `if (self->disposed)` conditional. Another thing I observed then is that actually, this list can have at most one element in the current codebase. So then we can just store a `GSource*` reference in `priv`. Then, we no longer need to search through all sources on the main context for our `priv` pointer, we can just reference the pointer. This is a very common pattern. Now, we don't even need to call `g_main_context_find_source_by_user_data()` or `g_main_current_source()`.
Created attachment 331079 [details] [review] session: Clean up handling of run queue source
This breaks tests/context-test.c =/ I see that each SoupMessageQueueItem may potentially have a different context.
Created attachment 331081 [details] [review] session: Clean up handling of run queue source We're seeing an occasional crash in ostree: https://github.com/ostreedev/ostree/issues/373 This patch appears to fix it. I can't say that I conclusively know the exact bug, but there are a few things this patch cleans up. First, if `self->disposed` is true, the source will be destroyed, but remained on the list. Fix that by moving down the `if (self->disposed)` conditional. Also, searching through all sources on the main context for our `priv` pointer is ugly...it's more efficient for us to maintain a mapping, and also ensure that mapping holds a reference to the main context too.
Fixed that by having an explicit map.
Comment on attachment 331081 [details] [review] session: Clean up handling of run queue source >Also, searching through all sources on the main context for our `priv` >pointer is ugly...it's more efficient for us to maintain a mapping, >and also ensure that mapping holds a reference to the main context >too. Contexts hold refs on their sources, so sources can't hold refs on their contexts too or you'll end up leaking things. That's where all the ugliness in this code comes from. With your patch, if you finish with a GMainContext and unref it (and stop iterating it), but libsoup has queued a run_queue source on it, then the GMainContext (and everything attached to it) won't actually get freed until the session is destroyed. If you only ever create one SoupSession, and then repeatedly use GMainContexts only once [as WebKitGTK used to do and maybe still does] then you'll keep leaking more and more. The current code is clearly not thread-safe though... idle_run_queue() could start running in one thread and get past the "if (priv->disposed)" check, and then the session could get disposed in another thread.
looking at https://github.com/ostreedev/ostree/issues/373#issuecomment-231341349 and trying to figure out how we could get there, I notice that there's no mutex protecting run_queue_sources, which is obviously no good.
(In reply to Dan Winship from comment #5) > If you only ever create > one SoupSession, and then repeatedly use GMainContexts only once [as > WebKitGTK used to do and maybe still does] then you'll keep leaking more and > more. That sounds like a strange thing to do. > The current code is clearly not thread-safe though... idle_run_queue() could > start running in one thread and get past the "if (priv->disposed)" check, > and then the session could get disposed in another thread. That would indeed explain why the ostree change to predictably dispose it in one thread fixed the bug. Anyways so there may be no immediate bug here, but in general I find it quite ugly how we currently search for a source via user data. While it's obviously unlikely, what if someone was using GUINT_TO_POINTER() for their user data and it happened to be the same value as the priv pointer? If webkit isn't any longer doing that weird multiple-context-for-single-session thing, I think it'd argue for this patch + a mutex?
(In reply to Colin Walters from comment #7) > (In reply to Dan Winship from comment #5) > > If you only ever create > > one SoupSession, and then repeatedly use GMainContexts only once [as > > WebKitGTK used to do and maybe still does] then you'll keep leaking more and > > more. > > That sounds like a strange thing to do. XMLHttpRequest lets you send sync requests, which (per the spec) must block the completion of any pending async requests (and there are sites that will fail completely if you don't do it this way). Since it didn't used to be possible to mix sync and async requests on the same session, WebKit implemented this by creating a new GMainContext in the main thread, queueing an async request on that context, running that context until the request completed, and then destroying the new context. > Anyways so there may be no immediate bug here, but in general I find it > quite ugly how we currently search for a source via user data. While it's > obviously unlikely, what if someone was using GUINT_TO_POINTER() for their > user data and it happened to be the same value as the priv pointer? Well, that argument applies just as well to any other use of any find-by-user-data API (eg, g_signal_handlers_disconnect_by_data())... > If webkit isn't any longer doing that weird > multiple-context-for-single-session thing, I think it'd argue for this patch > + a mutex? Creating the reference counting loop is sketchy regardless of what WebKit is currently doing. So, summing up the problems: 1. priv->run_queue_sources is not handled thread-safely 2. session disposal is not thread-safe wrt pending run_queue_sources 3. g_main_context_find_source_by_user_data() is debatably a bad idea, but at any rate, it's dumb for us to be using it anyway since we could just scan priv->run_queue_sources, which is almost certainly going to be much smaller than context->sources
(working on a patch...)
Created attachment 331550 [details] [review] soup-misc: add a GDestroyNotify to soup_add_completion_reffed()
Created attachment 331551 [details] [review] soup-session: fix idle_run_queue() source handling priv->run_queue_sources was un-thread-safe in multiple ways. Now: 1. We pass idle_run_queue() a GWeakRef to the SoupSession, rather than the SoupSession itself. So now it's not possible for the session to get disposed in another thread while idle_run_queue() is running, and it's safe to let the callback get run even after the session is freed. 2. The idle_run_queue() source is given a GDestroyNotify, so it can clean up the weakref even if the source is destroyed behind the session's back. 3. Since we no longer have to forcibly destroy the sources when the session is destroyed, and we don't have to manually clean up after them if they don't get run, we no longer have to explicitly remember the sources after creating them, and so we don't even need priv->run_queue_sources. However, we do still want to make sure that there's only ever one pending idle_run_queue() source per context (so we don't keep adding more and more if the context isn't currently being run), so we add a new "async_pending" field to SoupMessageQueueItem to keep track of that.
does that fix it for you?
I still get stuck fetches with those 2 patches on top of master.
+ Trace 236467
With that running: chip 12588 0.0 0.3 2620 1900 pts/0 S+ 14:36 0:00 /bin/bash /home/chip/ostree/tests/test-remote-gpg-import.sh -k --tap chip 12610 0.0 0.7 21624 3904 pts/0 Sl+ 14:36 0:00 ostree trivial-httpd --autoexit --daemonize -p /var/tmp/tap-test.GdwdvX/httpd-port chip 13058 0.0 1.9 27300 9924 pts/0 Sl+ 14:36 0:00 ostree --repo=repo pull R2:main (test-remote-gpg-import.sh, test-commit-sign.sh test-pull-archive-z.sh all hang that way, but killing the tests allows the test suite to carry on)
hm... no mutexes in that trace though. it looks like we're failing to add a source in some cases, so it just ends up sitting in poll() forever rather than starting/continuing the request
Review of attachment 331551 [details] [review]: ::: libsoup/soup-message-queue.h @@ +53,3 @@ guint priority : 3; + guint resend_count : 5; + guint _unused : 18; Is this a public struct? You've used 1 bit less here, and the meanings of several bits have changed.
(In reply to Bastien Nocera from comment #13) > I still get stuck fetches with those 2 patches on top of master. Is this a regression with those two patches, or are they merely not fixing a different symptom?
(In reply to Simon McVittie from comment #16) > Is this a public struct? no
These patches look good to me, but they do not appear to fix this particular form of crashing in OSTree's tests (libsoup, ostree and glib master):
+ Trace 236910
Thread 2 (Thread 0x7f3aed85f400 (LWP 7812))
I also sometimes get an "ostree pull" process hanging, but I don't think that's a regression with these patches. I also sometimes get https://github.com/ostreedev/ostree/issues/583 but that is certainly not a regression. All of these seem to reproduce more easily when spamming pull tests in parallel: for x in `seq 1 100`; do echo "==$x=="; systemd-run --user --scope env VERBOSE=1 LD_LIBRARY_PATH=$prefix/lib make -j -C $builddir check-TESTS TESTS="tests/test-pull-repeated.sh tests/test-pull-archive-z.sh tests/test-pull-commit-only.sh tests/test-pull-contenturl.sh tests/test-pull-corruption.sh tests/test-pull-depth.sh tests/test-pull-metalink.sh tests/test-pull-large-metadata.sh tests/test-pull-mirrorlist.sh tests/test-pull-mirror-summary.sh tests/test-pull-override-url.sh tests/test-pull-resume.sh tests/test-pull-subpath.sh tests/test-pull-summary-sigs.sh tests/test-pull-untrusted.sh" || { echo "Error in iteration $x: $?"; break }; done
(In reply to Simon McVittie from comment #19) > These patches look good to me, but they do not appear to fix this particular > form of crashing in OSTree's tests Actually, that might just have been user error. When using -fsanitize=address, I definitely get an "ostree pull" process hanging/timing out quite often - I don't know whether that's a libsoup issue or not - but I haven't been able to reproduce the use-after-free. I could easily reproduce the use-after-free with asan but without these patches.
Review of attachment 331551 [details] [review]: I can't say I fully grasp this code, but this patch looks good to me.
Pushed the patch but not sure if I should count the bug as FIXED? Attachment 331550 [details] pushed as 8005ba4 - soup-misc: add a GDestroyNotify to soup_add_completion_reffed() Attachment 331551 [details] pushed as 8e1b0f5 - soup-session: fix idle_run_queue() source handling
This seems to have introduced a regression, bug #778135.
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/libsoup/issues/95.