GNOME Bugzilla – Bug 753336
repo-queue: Add a queue for scanning
Last modified: 2015-08-26 19:20:05 UTC
We hit this at Endless. See patches.
Created attachment 308869 [details] [review] Update .gitignore
Created attachment 308870 [details] [review] libtest: Fix file paths of valgrind suppressions / LD_PRELOAD file
Created attachment 308871 [details] [review] repo-pull: Add a queue for scanning On systems with slow disks, the recursive scanning of directories can be expensive -- it takes upwards of 2 minutes on our systems. This can block the main loop for such a long time that it allows the download to time out... As such, move all the scanning of objects to a queue, processed from an idle, to make sure that we don't block the main loop when scanning.
Review of attachment 308870 [details] [review]: I don't understand this. In what cases is libtest.sh a symlink? Can you elaborate on the test setup? I personally am using: ./autogen.sh --prefix=/usr && make && sudo make install gnome-desktop-testing-runner -p 0 ostree/
Review of attachment 308871 [details] [review]: The way the pull API iterates the caller's mainloop is the root bug here. I screwed that up. One should expect a synchronous API to block, except...I wanted progress reporting, but that should probably have been done with an explicit callback API. One thing that would likely help here too is to cache some metadata somewhere when we've retrieved all of the child objects of a given tree. Anyways...offhand, I think this patch is OK, but I need to spend some more time looking at it.
Review of attachment 308869 [details] [review]: OK
(In reply to Colin Walters from comment #4) > Review of attachment 308870 [details] [review] [review]: > > I don't understand this. In what cases is libtest.sh a symlink? Can you > elaborate on the test setup? > > I personally am using: > ./autogen.sh --prefix=/usr && make && sudo make install > gnome-desktop-testing-runner -p 0 ostree/ This is to get an absolute path, so I can dirname it again to get the parent directory. I also realized it can also just be $SRCDIR/../.libs/libreaddir-rand.so. Your pick. (In reply to Colin Walters from comment #5) > Review of attachment 308871 [details] [review] [review]: > > The way the pull API iterates the caller's mainloop is the root bug here. I > screwed that up. One should expect a synchronous API to block, except...I > wanted progress reporting, but that should probably have been done with an > explicit callback API. Well, even with that, if you have both libsoup and scanning going at the same time, you're gonna need to cooperatively manage both, even if you use an internal mainloop. I also debated using a separate task thread for scanning. > One thing that would likely help here too is to cache some metadata > somewhere when we've retrieved all of the child objects of a given tree. Agreed. > Anyways...offhand, I think this patch is OK, but I need to spend some more > time looking at it. Now that you mention it, I think we need to explicitly destroy the idle src, otherwise it won't go away when the nested main loop exits, because it's attached to the same default context.
libreaddir-rand.so should be *installed* as part of installed-tests into (e.g.) /usr/libexec/ostree/installed-tests/libreaddir-rand.so. Does that model work for you? Were we to support running the tests uninstalled, I'd expect us to use libtool --mode=execute, not hardcode .libs.
Ah, I didn't install the tests, I simply ran gnome-desktop-testing-runner in the local directory. Which worked, but gave me errors about being unable to find libreaddir-rand.so, so I fixed that up.
Review of attachment 308871 [details] [review]: ::: src/libostree/ostree-repo-pull.c @@ +323,2 @@ check_outstanding_requests_handle_error (user_data, NULL); + return G_SOURCE_CONTINUE; Wait, I am confused now. The old idle source had `return FALSE` and AFAICS nothing requeued it, so it would just run at most once...how did this work? Were we possibly relying on the fact that idle sources have a lower priority, and the libsoup sources just happened to run first? =( =( Some initial investigation makes this appear to be the case...
No, okay we're calling `check_outstanding_requests_handle_error` in each callback - so the old idle source was just vestigal. Given that we don't actually want to have a high frequency polling loop here, let's just entirely remove the idle source?
We had a similar thought, but then figured it out. check_outstanding_requests was called after anything important happened (libsoup finished a request, the fetch changed), called as-needed. The idle source was always meant to run once, just in case no work was queued.
Created attachment 309184 [details] [review] see patch
Well, we need *something* to manage the scan queue. We could start and stop the idle source depending on when the queue has work and when it doesn't, or we could push scanning to a worker thread and have a condition wake up the thread when that thread has work on the queue, but the simplest thing was having the idle source just keep polling on the queue.
Review of attachment 309184 [details] [review]: As mentioned above, wouldn't this mean that nothing would stop the loop if we're doing an empty pull, where we already have everything we need?
Let me try fixing the current logic.
Created attachment 309189 [details] [review] stop using gmainloop, clean up termination
Review of attachment 309189 [details] [review]: This seems like it will have a similar breakage: if the fetcher is on its own mainloop, then pull could request some data on its mainloop, and time out while the fetcher is still spinning. Though I, admittedly, am a bit confused about why we have three nested mainloops going at once. ::: src/libostree/ostree-repo-pull.c @@ +1655,2 @@ pull_data->async_error = error; pull_data->main_context = g_main_context_ref_thread_default (); So we *don't* create our own main context here? We still use the user's main context? @@ +2181,3 @@ g_source_destroy (update_timeout); + if (pull_data->main_context) + g_main_context_unref (pull_data->main_context); You already unref it a few lines above...
So the terrible mess here again is that this API iterates the caller's context, because I tried to do progress reporting that way, but it was a mistake. The pull has two phases: 1) Downloading small metadata (summary file etc) 2) Downloading numerous/large content Currently #1 is coded as a series of synchronous steps because doing it async would be very tedious. #2 is async internally. For #1 we were previously creating a new main context, so no progress reporting. That isn't a regression. As for 3 contexts, well that's not uncommon currently with sync APIs that are written using async APIs internally. The cost is pretty small I think; we have an eventfd per context, but the nesting here is small.
Created attachment 309190 [details] [review] stop using gmainloop, clean up termination Fix double unref, squash patch with idle removal
Review of attachment 309190 [details] [review]: ::: src/libostree/ostree-repo-pull.c @@ +2181,3 @@ g_source_destroy (update_timeout); + if (pull_data->main_context) + g_main_context_unref (pull_data->main_context); Uh, I still see the double unref. Uploaded an old patch?
Review of attachment 309190 [details] [review]: So, this means that we're no longer spinning the main mainloop anymore, and that means that ostree_fetcher_request_uri_to_membuf is a fully blocking call. That means that the progress callback won't be called while the fetcher is going, and it also means that we're not going to fetch multiple URIs at once, which seems like it will slow things down -- we're back to being fully serialized. Not good for an HTTP/2 future, or retrieving multiple objects at once, which we should really do more of. Although, yes, the way that worked right now was extraordinarily hacky, relying on the fetcher spinning another request on the same loop. My bug about the queue is also fixed, because now the scanning code is never running at the same time as the fetching code, but that seems like a poor side-effect more than anything. I'm quite sure it would be faster to be hitting both the network and the disk at the same time. Weirdly enough, with this patch, I think the only time we actually use the main loop that the user passes is in is simply for the timeout source about progress callbacks. If we remove those from the pull loop and put them in the fetcher, that would fix the bug about the progress callback never being called, and then we can remove the main loop from the pull code entirely.
In the two phases I talked about, ostree_fetcher_request_uri_to_membuf is used for #1, which is synchronous. Phase 2 does not use it, instead calling _ostree_fetcher_request_uri_with_partial_async() which, being async, does not make its own main context. Except...empirical testing makes it look like you're right, we're running all of the requests serially for some reason. I think this may have to do with the fact that SoupSession is binding to the first main context...yes, this patch on top fixes it: $ git diff diff --git a/src/libostree/ostree-repo-pull.c b/src/libostree/ostree-repo-pull.c index 2c1809f..b754940 100644 --- a/src/libostree/ostree-repo-pull.c +++ b/src/libostree/ostree-repo-pull.c @@ -2003,6 +2003,14 @@ ostree_repo_pull_with_options (OstreeRepo *self, pull_data->phase = OSTREE_PULL_PHASE_FETCHING_OBJECTS; + /* Now discard the previous fetcher, as it was bound to a temporary main context + * for synchronous requests. + */ + g_clear_object (&pull_data->fetcher); + pull_data->fetcher = _ostree_repo_remote_new_fetcher (self, remote_name_or_baseurl, error); + if (pull_data->fetcher == NULL) + goto out; + if (!ostree_repo_prepare_transaction (pull_data->repo, &pull_data->transaction_resuming, cancellable, error)) goto out;
Created attachment 309206 [details] [review] stop using gmainloop, clean up termination
Review of attachment 309206 [details] [review]: Ah, I missed the fact that we are using the fetcher in an async fashion during the second phase. This makes sense to me, then. It still doesn't fix the scanning bug, but let's do that after we land this. ::: src/libostree/ostree-repo-pull.c @@ +235,3 @@ +/* The core logic function for whether we should continue the main loop */ +static gboolean +pull_termination_condition (OtPullData *pull_data) Not gonna bikeshed on this, but The name "pull_termination_condition" is a bit confusing for something that returns FALSE when it should terminate. Perhaps "should_continue_pull_loop" ?
Comment on attachment 309206 [details] [review] stop using gmainloop, clean up termination Thanks for the review, I ended up swapping the condition boolean return, then it's: while (!termination_condition()) loop() which reads more clearly to me. Want to rebase your patch on top now?
Created attachment 309421 [details] [review] repo-pull: Add a queue for scanning On systems with slow disks, the recursive scanning of directories can be expensive -- it takes upwards of 2 minutes on our systems. This can block the main loop for such a long time that it allows the download to time out... As such, move all the scanning of objects to a queue, processed from an idle, to make sure that we don't block the main loop when scanning.
Review of attachment 309421 [details] [review]: ::: src/libostree/ostree-repo-pull.c @@ +321,3 @@ + process_scan_queue (user_data); + check_outstanding_requests_handle_error (user_data, NULL); + return G_SOURCE_CONTINUE; Returning G_SOURCE_CONTINUE unconditionally from an idle handler will pointlessly occupy a core. We don't need to call check_outstanding_requests_handle_error() in an idle anymore. So what you want to do here is have the idle queued only when there is work to do in the scan queue. There's a variant of this in OstreeFetcher, and many other projects too.
Created attachment 309991 [details] [review] repo-pull: Add a queue for scanning On systems with slow disks, the recursive scanning of directories can be expensive -- it takes upwards of 2 minutes on our systems. This can block the main loop for such a long time that it allows the download to time out... As such, move all the scanning of objects to a queue, processed from an idle, to make sure that we don't block the main loop when scanning. Sorry, I took a long time to get this done. It should be OK -- it passes all the tests here on my machine. I did get a weird issue when I ran the tests once while developing, but I was never able to reproduce it again. If you can sanity-check it, that would be great. I saw this below: timestamp-orig.txt timestamp-new.txt differ: byte 10, line 1 failed to update mtime on repo
Review of attachment 309991 [details] [review]: I think this is OK, just one leak: ::: src/libostree/ostree-repo-pull.c @@ +2120,3 @@ while (!pull_termination_condition (pull_data)) g_main_context_iteration (pull_data->main_context, TRUE); + Need to have: g_clear_pointer (&pull_data->idle_src, g_source_destroy); here (or in the out: block)
Comment on attachment 309991 [details] [review] repo-pull: Add a queue for scanning Attachment 309991 [details] pushed as 20647ed - repo-pull: Add a queue for scanning Thanks -- pushed with the minor leak fix.