After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 753336 - repo-queue: Add a queue for scanning
repo-queue: Add a queue for scanning
Status: RESOLVED FIXED
Product: ostree
Classification: Infrastructure
Component: general
unspecified
Other All
: Normal normal
: ---
Assigned To: OSTree maintainer(s)
OSTree maintainer(s)
Depends on:
Blocks:
 
 
Reported: 2015-08-07 00:42 UTC by Jasper St. Pierre (not reading bugmail)
Modified: 2015-08-26 19:20 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Update .gitignore (938 bytes, patch)
2015-08-07 00:42 UTC, Jasper St. Pierre (not reading bugmail)
committed Details | Review
libtest: Fix file paths of valgrind suppressions / LD_PRELOAD file (1.11 KB, patch)
2015-08-07 00:43 UTC, Jasper St. Pierre (not reading bugmail)
needs-work Details | Review
repo-pull: Add a queue for scanning (11.82 KB, patch)
2015-08-07 00:43 UTC, Jasper St. Pierre (not reading bugmail)
none Details | Review
see patch (1.83 KB, patch)
2015-08-12 21:05 UTC, Colin Walters
none Details | Review
stop using gmainloop, clean up termination (26.67 KB, patch)
2015-08-13 03:38 UTC, Colin Walters
none Details | Review
stop using gmainloop, clean up termination (27.44 KB, patch)
2015-08-13 04:03 UTC, Colin Walters
none Details | Review
stop using gmainloop, clean up termination (28.08 KB, patch)
2015-08-13 13:12 UTC, Colin Walters
committed Details | Review
repo-pull: Add a queue for scanning (12.12 KB, patch)
2015-08-17 20:22 UTC, Jasper St. Pierre (not reading bugmail)
none Details | Review
repo-pull: Add a queue for scanning (11.80 KB, patch)
2015-08-26 03:37 UTC, Jasper St. Pierre (not reading bugmail)
committed Details | Review

Description Jasper St. Pierre (not reading bugmail) 2015-08-07 00:42:51 UTC
We hit this at Endless. See patches.
Comment 1 Jasper St. Pierre (not reading bugmail) 2015-08-07 00:42:57 UTC
Created attachment 308869 [details] [review]
Update .gitignore
Comment 2 Jasper St. Pierre (not reading bugmail) 2015-08-07 00:43:00 UTC
Created attachment 308870 [details] [review]
libtest: Fix file paths of valgrind suppressions / LD_PRELOAD file
Comment 3 Jasper St. Pierre (not reading bugmail) 2015-08-07 00:43:03 UTC
Created attachment 308871 [details] [review]
repo-pull: Add a queue for scanning

On systems with slow disks, the recursive scanning of directories can
be expensive -- it takes upwards of 2 minutes on our systems. This can
block the main loop for such a long time that it allows the download to
time out...

As such, move all the scanning of objects to a queue, processed from
an idle, to make sure that we don't block the main loop when scanning.
Comment 4 Colin Walters 2015-08-09 02:07:41 UTC
Review of attachment 308870 [details] [review]:

I don't understand this.  In what cases is libtest.sh a symlink?  Can you elaborate on the test setup?

I personally am using:
./autogen.sh --prefix=/usr && make && sudo make install
gnome-desktop-testing-runner -p 0 ostree/
Comment 5 Colin Walters 2015-08-09 02:23:27 UTC
Review of attachment 308871 [details] [review]:

The way the pull API iterates the caller's mainloop is the root bug here.  I screwed that up.  One should expect a synchronous API to block, except...I wanted progress reporting, but that should probably have been done with an explicit callback API.

One thing that would likely help here too is to cache some metadata somewhere when we've retrieved all of the child objects of a given tree.

Anyways...offhand, I think this patch is OK, but I need to spend some more time looking at it.
Comment 6 Colin Walters 2015-08-09 02:23:50 UTC
Review of attachment 308869 [details] [review]:

OK
Comment 7 Jasper St. Pierre (not reading bugmail) 2015-08-09 02:27:44 UTC
(In reply to Colin Walters from comment #4)
> Review of attachment 308870 [details] [review] [review]:
> 
> I don't understand this.  In what cases is libtest.sh a symlink?  Can you
> elaborate on the test setup?
> 
> I personally am using:
> ./autogen.sh --prefix=/usr && make && sudo make install
> gnome-desktop-testing-runner -p 0 ostree/

This is to get an absolute path, so I can dirname it again to get the parent directory. I also realized it can also just be $SRCDIR/../.libs/libreaddir-rand.so. Your pick.

(In reply to Colin Walters from comment #5)
> Review of attachment 308871 [details] [review] [review]:
> 
> The way the pull API iterates the caller's mainloop is the root bug here.  I
> screwed that up.  One should expect a synchronous API to block, except...I
> wanted progress reporting, but that should probably have been done with an
> explicit callback API.

Well, even with that, if you have both libsoup and scanning going at the same time, you're gonna need to cooperatively manage both, even if you use an internal mainloop. I also debated using a separate task thread for scanning.

> One thing that would likely help here too is to cache some metadata
> somewhere when we've retrieved all of the child objects of a given tree.

Agreed.

> Anyways...offhand, I think this patch is OK, but I need to spend some more
> time looking at it.

Now that you mention it, I think we need to explicitly destroy the idle src, otherwise it won't go away when the nested main loop exits, because it's attached to the same default context.
Comment 8 Colin Walters 2015-08-12 20:49:00 UTC
libreaddir-rand.so should be *installed* as part of installed-tests into (e.g.) /usr/libexec/ostree/installed-tests/libreaddir-rand.so.

Does that model work for you?

Were we to support running the tests uninstalled, I'd expect us to use libtool --mode=execute, not hardcode .libs.
Comment 9 Jasper St. Pierre (not reading bugmail) 2015-08-12 20:51:13 UTC
Ah, I didn't install the tests, I simply ran gnome-desktop-testing-runner in the local directory. Which worked, but gave me errors about being unable to find libreaddir-rand.so, so I fixed that up.
Comment 10 Colin Walters 2015-08-12 21:00:34 UTC
Review of attachment 308871 [details] [review]:

::: src/libostree/ostree-repo-pull.c
@@ +323,2 @@
   check_outstanding_requests_handle_error (user_data, NULL);
+  return G_SOURCE_CONTINUE;

Wait, I am confused now.  The old idle source had `return FALSE` and AFAICS nothing requeued it, so it would just run at most once...how did this work?

Were we possibly relying on the fact that idle sources have a lower priority, and the libsoup sources just happened to run first? =( =(

Some initial investigation makes this appear to be the case...
Comment 11 Colin Walters 2015-08-12 21:02:43 UTC
No, okay we're calling `check_outstanding_requests_handle_error` in each callback - so the old idle source was just vestigal.

Given that we don't actually want to have a high frequency polling loop here, let's just entirely remove the idle source?
Comment 12 Jasper St. Pierre (not reading bugmail) 2015-08-12 21:04:10 UTC
We had a similar thought, but then figured it out.

check_outstanding_requests was called after anything important happened (libsoup finished a request, the fetch changed), called as-needed. The idle source was always meant to run once, just in case no work was queued.
Comment 13 Colin Walters 2015-08-12 21:05:40 UTC
Created attachment 309184 [details] [review]
see patch
Comment 14 Jasper St. Pierre (not reading bugmail) 2015-08-12 21:05:58 UTC
Well, we need *something* to manage the scan queue. We could start and stop the idle source depending on when the queue has work and when it doesn't, or we could push scanning to a worker thread and have a condition wake up the thread when that thread has work on the queue, but the simplest thing was having the idle source just keep polling on the queue.
Comment 15 Jasper St. Pierre (not reading bugmail) 2015-08-12 21:06:30 UTC
Review of attachment 309184 [details] [review]:

As mentioned above, wouldn't this mean that nothing would stop the loop if we're doing an empty pull, where we already have everything we need?
Comment 16 Colin Walters 2015-08-12 21:11:09 UTC
Let me try fixing the current logic.
Comment 17 Colin Walters 2015-08-13 03:38:31 UTC
Created attachment 309189 [details] [review]
stop using gmainloop, clean up termination
Comment 18 Jasper St. Pierre (not reading bugmail) 2015-08-13 03:46:14 UTC
Review of attachment 309189 [details] [review]:

This seems like it will have a similar breakage: if the fetcher is on its own mainloop, then pull could request some data on its mainloop, and time out while the fetcher is still spinning. Though I, admittedly, am a bit confused about why we have three nested mainloops going at once.

::: src/libostree/ostree-repo-pull.c
@@ +1655,2 @@
   pull_data->async_error = error;
   pull_data->main_context = g_main_context_ref_thread_default ();

So we *don't* create our own main context here? We still use the user's main context?

@@ +2181,3 @@
     g_source_destroy (update_timeout);
+  if (pull_data->main_context)
+    g_main_context_unref (pull_data->main_context);

You already unref it a few lines above...
Comment 19 Colin Walters 2015-08-13 04:02:19 UTC
So the terrible mess here again is that this API iterates the caller's context, because I tried to do progress reporting that way, but it was a mistake.

The pull has two phases:
1) Downloading small metadata (summary file etc)
2) Downloading numerous/large content

Currently #1 is coded as a series of synchronous steps because doing it async would be very tedious.  #2 is async internally.

For #1 we were previously creating a new main context, so no progress reporting.  That isn't a regression.

As for 3 contexts, well that's not uncommon currently with sync APIs that are written using async APIs internally.  The cost is pretty small I think; we have an eventfd per context, but the nesting here is small.
Comment 20 Colin Walters 2015-08-13 04:03:08 UTC
Created attachment 309190 [details] [review]
stop using gmainloop, clean up termination

Fix double unref, squash patch with idle removal
Comment 21 Jasper St. Pierre (not reading bugmail) 2015-08-13 04:07:26 UTC
Review of attachment 309190 [details] [review]:

::: src/libostree/ostree-repo-pull.c
@@ +2181,3 @@
     g_source_destroy (update_timeout);
+  if (pull_data->main_context)
+    g_main_context_unref (pull_data->main_context);

Uh, I still see the double unref. Uploaded an old patch?
Comment 22 Jasper St. Pierre (not reading bugmail) 2015-08-13 04:59:34 UTC
Review of attachment 309190 [details] [review]:

So, this means that we're no longer spinning the main mainloop anymore, and that means that ostree_fetcher_request_uri_to_membuf is a fully blocking call. That means that the progress callback won't be called while the fetcher is going, and it also means that we're not going to fetch multiple URIs at once, which seems like it will slow things down -- we're back to being fully serialized. Not good for an HTTP/2 future, or retrieving multiple objects at once, which we should really do more of. Although, yes, the way that worked right now was extraordinarily hacky, relying on the fetcher spinning another request on the same loop.

My bug about the queue is also fixed, because now the scanning code is never running at the same time as the fetching code, but that seems like a poor side-effect more than anything. I'm quite sure it would be faster to be hitting both the network and the disk at the same time.

Weirdly enough, with this patch, I think the only time we actually use the main loop that the user passes is in is simply for the timeout source about progress callbacks. If we remove those from the pull loop and put them in the fetcher, that would fix the bug about the progress callback never being called, and then we can remove the main loop from the pull code entirely.
Comment 23 Colin Walters 2015-08-13 13:07:21 UTC
In the two phases I talked about, ostree_fetcher_request_uri_to_membuf is used for #1, which is synchronous.

Phase 2 does not use it, instead calling _ostree_fetcher_request_uri_with_partial_async() which, being async, does not make its own main context.

Except...empirical testing makes it look like you're right, we're running all of the requests serially for some reason.

I think this may have to do with the fact that SoupSession is binding to the first main context...yes, this patch on top fixes it:

$ git diff
diff --git a/src/libostree/ostree-repo-pull.c b/src/libostree/ostree-repo-pull.c
index 2c1809f..b754940 100644
--- a/src/libostree/ostree-repo-pull.c
+++ b/src/libostree/ostree-repo-pull.c
@@ -2003,6 +2003,14 @@ ostree_repo_pull_with_options (OstreeRepo             *self,
 
   pull_data->phase = OSTREE_PULL_PHASE_FETCHING_OBJECTS;
 
+  /* Now discard the previous fetcher, as it was bound to a temporary main context
+   * for synchronous requests.
+   */
+  g_clear_object (&pull_data->fetcher);
+  pull_data->fetcher = _ostree_repo_remote_new_fetcher (self, remote_name_or_baseurl, error);
+  if (pull_data->fetcher == NULL)
+    goto out;
+
   if (!ostree_repo_prepare_transaction (pull_data->repo, &pull_data->transaction_resuming,
                                         cancellable, error))
     goto out;
Comment 24 Colin Walters 2015-08-13 13:12:18 UTC
Created attachment 309206 [details] [review]
stop using gmainloop, clean up termination
Comment 25 Jasper St. Pierre (not reading bugmail) 2015-08-13 19:48:00 UTC
Review of attachment 309206 [details] [review]:

Ah, I missed the fact that we are using the fetcher in an async fashion during the second phase. This makes sense to me, then. It still doesn't fix the scanning bug, but let's do that after we land this.

::: src/libostree/ostree-repo-pull.c
@@ +235,3 @@
+/* The core logic function for whether we should continue the main loop */
+static gboolean
+pull_termination_condition (OtPullData          *pull_data)

Not gonna bikeshed on this, but The name "pull_termination_condition" is a bit confusing for something that returns FALSE when it should terminate.

Perhaps "should_continue_pull_loop" ?
Comment 26 Colin Walters 2015-08-14 02:03:16 UTC
Comment on attachment 309206 [details] [review]
stop using gmainloop, clean up termination

Thanks for the review, I ended up swapping the condition boolean return, then it's:

while (!termination_condition())
  loop()

which reads more clearly to me.  Want to rebase your patch on top now?
Comment 27 Jasper St. Pierre (not reading bugmail) 2015-08-17 20:22:32 UTC
Created attachment 309421 [details] [review]
repo-pull: Add a queue for scanning

On systems with slow disks, the recursive scanning of directories can
be expensive -- it takes upwards of 2 minutes on our systems. This can
block the main loop for such a long time that it allows the download to
time out...

As such, move all the scanning of objects to a queue, processed from
an idle, to make sure that we don't block the main loop when scanning.
Comment 28 Colin Walters 2015-08-17 21:09:33 UTC
Review of attachment 309421 [details] [review]:

::: src/libostree/ostree-repo-pull.c
@@ +321,3 @@
+  process_scan_queue (user_data);
+  check_outstanding_requests_handle_error (user_data, NULL);
+  return G_SOURCE_CONTINUE;

Returning G_SOURCE_CONTINUE unconditionally from an idle handler will pointlessly occupy a core.

We don't need to call check_outstanding_requests_handle_error() in an idle anymore.

So what you want to do here is have the idle queued only when there is work to do in the scan queue.  There's a variant of this in OstreeFetcher, and many other projects too.
Comment 29 Jasper St. Pierre (not reading bugmail) 2015-08-26 03:37:16 UTC
Created attachment 309991 [details] [review]
repo-pull: Add a queue for scanning

On systems with slow disks, the recursive scanning of directories can
be expensive -- it takes upwards of 2 minutes on our systems. This can
block the main loop for such a long time that it allows the download to
time out...

As such, move all the scanning of objects to a queue, processed from
an idle, to make sure that we don't block the main loop when scanning.





Sorry, I took a long time to get this done. It should be OK -- it passes all the tests here on my machine. I did get a weird issue when I ran the tests once while developing, but I was never able to reproduce it again. If you can sanity-check it, that would be great. I saw this below:

timestamp-orig.txt timestamp-new.txt differ: byte 10, line 1
failed to update mtime on repo
Comment 30 Colin Walters 2015-08-26 18:48:34 UTC
Review of attachment 309991 [details] [review]:

I think this is OK, just one leak:

::: src/libostree/ostree-repo-pull.c
@@ +2120,3 @@
   while (!pull_termination_condition (pull_data))
     g_main_context_iteration (pull_data->main_context, TRUE);
+

Need to have:

g_clear_pointer (&pull_data->idle_src, g_source_destroy);

here (or in the out: block)
Comment 31 Jasper St. Pierre (not reading bugmail) 2015-08-26 19:16:15 UTC
Comment on attachment 309991 [details] [review]
repo-pull: Add a queue for scanning

Attachment 309991 [details] pushed as 20647ed - repo-pull: Add a queue for scanning


Thanks -- pushed with the minor leak fix.