Bug 771390 – Do not read the whole Google Drive on enumerate.

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 771390 - Do not read the whole Google Drive on enumerate.


Summary:	Do not read the whole Google Drive on enumerate.


Status:	RESOLVED FIXED

Product:	gvfs
Classification:	Core
Component:	google backend
Version:	git master
Hardware:	Other All

Importance:	Normal major
Target Milestone:	---
Assigned To:	Debarshi Ray
QA Contact:	gvfs-maint

URL:
Whiteboard:

Duplicates:	757219 765608 774591 790155 (view as bug list)
Depends on:	783046
Blocks:

Reported:	2016-09-13 20:36 UTC by Ben Hagen
Modified:	2018-09-21 12:13 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
google: Improve performance on Drives with many files (58.43 KB, patch) 2017-05-19 13:19 UTC, Ondrej Holy	none	Details \| Review
google: Improve performance on Drives with many files (57.76 KB, patch) 2017-05-23 12:55 UTC, Ondrej Holy	none	Details \| Review
google: Improve performance for Drives with many files (61.83 KB, patch) 2017-05-24 13:54 UTC, Ondrej Holy	reviewed	Details \| Review

Description Ben Hagen 2016-09-13 20:36:55 UTC

On larger Google Drives this makes the backend unusable.

Given the following folder structure:
/
|->foo
   |-> a
   |-> b
|->bar
   |-> 1
   |-> 2
   |-> 3
   ...
   |-> 4999
   |-> 5000 

`ls /`and even `ls /foo` currently takes more than 4min in our setup. Speeding up `ls /bar` might also be possible but is way less important.

Comment 1 Ondrej Holy 2016-09-14 07:46:00 UTC

Thanks for your bug report. I have around 500 files and it takes about 20 seconds, which is somehow acceptable, but 4min are not really acceptable... 

I remember that I proposed an idea to not load all files everytime when the backend was in development yet, but there were some problems, why it was not possible... I think that libgdata had some problems with filtering. Rishi, don't you recall why this behavior was chosen and whether is it possible to change it?

Comment 2 Debarshi Ray 2016-09-16 19:46:07 UTC

First, Google Drive has the Changes API [1], which lets us query only the delta since the last 'change ID' seen by us. Unless the Drive is rapidly changing, which is unusual, I think this should speed up our attempts to update the cache. Right now we re-download the metadata for the entire Drive. Unfortunately, this isn't implemented in libgdata [2], so we need to do that first.

Second, maybe we can do something to speed up the initial metadata fetching. I can think of some options, but I haven't tried any of them. We could probably have an offline cache of the Drive. Everything is expressed in JSON. So, maybe we can just save it to disk, and re-use the metadata from the last mount until we have completed the network I/O. Or, maybe we can just query whatever is needed for the directory in question, and then query the rest in the background. I, kind of, like having the entire Drive's metadata cached since subsequent operations are more unlikely to suffer a cache miss.

The reason we didn't do this initially is (i) libgdata didn't offer the Changes API which is a relatively easy optimization (ii) things, kind of, worked OK for the Drives that we tested it against.

[1] https://developers.google.com/drive/v2/reference/changes
[2] https://bugzilla.gnome.org/show_bug.cgi?id=654652

Comment 3 Ben Hagen 2016-09-16 20:10:42 UTC

Thank you for your detailed informations.

Querying only what is needed for the directory in question seems to be the easiest fix to this problem. Building a cache which can be updated using the changes API and even saved to disk could be added back in later.

Unfortunately I can not write any C code but I am happy to test any patches someone might come up with.

Comment 4 Ondrej Holy 2017-05-19 13:19:17 UTC

Created attachment 352157 [details] [review]
google: Improve performance on Drives with many files

The backend is totally unusable if you have too many files on your
Drive. This happens because the backend caches the whole Drive. Various
operations require rereading the cache, because the proper invalidation
is not possible. Let's change the workflow in order to improve the
performance. The summary of changes done by this patch:

- Do not cache the whole Drive, but cache just requested folders.
- Do not use cache for enumeration, but just for additional operations.
It is not common that applications call enumeration for the same folder
several times in a short time...
- Use cache entries only if they are not older than 100 second, which
reflects Google quotas, which are specified per 100 seconds.
- Remove functions get_parent_id and get_entry_path which are wrong
because one GDataEntry may have multiple parents.

As a result, the backend works smoothly regardless of the total number
of Drive files, because the total number of transmitted data is
significantly reduced. It is true, that more requests to Drive are done,
but I think that the Drive quotas are big enough.

Comment 5 Ondrej Holy 2017-05-19 13:20:12 UTC

It is not based on Changes API and anyway I think we should build the cache incrementally and we can use Changes API in the future for proper invalidation. I wanted to make just a simple improvement initially based on etags provided by Google. Unfortunately, etags change per each query regardless of actually changes, so I ended up with this bigger patch. Rishi, what do you think?

Comment 6 Ondrej Holy 2017-05-19 13:22:14 UTC

Ben, can you please test it in your envirnment?

Comment 7 Bastien Nocera 2017-05-20 18:21:55 UTC

(In reply to Ondrej Holy from comment #4)
> Created attachment 352157 [details] [review] [review]
<snip>
> - Do not cache the whole Drive, but cache just requested folders.
> - Do not use cache for enumeration, but just for additional operations.
> It is not common that applications call enumeration for the same folder
> several times in a short time...

I think you can carry on caching for enumerations, if the caching is actually caching and not preloading, as the code used to do according to this changelog.

Comment 8 Ondrej Holy 2017-05-23 12:55:30 UTC

Created attachment 352422 [details] [review]
google: Improve performance on Drives with many files

Rebased to master.

Comment 9 Ondrej Holy 2017-05-23 12:56:15 UTC

(In reply to Bastien Nocera from comment #7)
> (In reply to Ondrej Holy from comment #4)
> > Created attachment 352157 [details] [review] [review] [review]
> <snip>
> > - Do not cache the whole Drive, but cache just requested folders.
> > - Do not use cache for enumeration, but just for additional operations.
> > It is not common that applications call enumeration for the same folder
> > several times in a short time...
> 
> I think you can carry on caching for enumerations, if the caching is
> actually caching and not preloading, as the code used to do according to
> this changelog.

Hmm, actually why not, I will update the patch accordingly...

Comment 10 Ondrej Holy 2017-05-24 13:54:15 UTC

Created attachment 352498 [details] [review]
google: Improve performance for Drives with many files

The backend is totally unusable if you have too many files on your
Drive. This happens because the backend preloads the whole Drive's
metadata. Various operations require rereading the whole cache,
because the proper invalidation is not possible currently. Let's
change the workflow in order to improve the performance. The summary
of changes done by this patch:

- Do not preload the whole Drive, but cache just requested folders.
Invalidate by time separately for each folder and not for the whole
cache.
- Remove functions get_parent_id and get_entry_path which are totally
wrong because one GDataEntry may have multiple parents.
- Invalidate cache entries for all parents, not just for concrete
folder, because operations affects the files on all places.
- Files without parent folder are not listed anymore, they are not
listed on the web anyway. They used to be shown as hidden in the root
folder.

As a result, the backend works smoothly regardless of the total number
of Drive files, because the total number of transmitted data is
significantly reduced. It is true, that slightly more requests to Drive
have to done, but I think that the Drive quotas are big enough.

Comment 11 Debarshi Ray 2017-06-13 19:45:41 UTC

Review of attachment 352498 [details] [review]:

Sorry for the delay! The performance problem needs to be tackled, so thanks for looking into this.

::: daemon/gvfsbackendgoogle.c
@@ +55,3 @@
   GDataDocumentsService *service;
   GDataEntry *root;
+  GHashTable *dir_entries; /* gchar *parent_id -> DirEntriesData */

This and ...

@@ +88,3 @@
+  GHashTable *entries; /* gchar *child_title_or_id -> GDataEntry */
+  gint64 timestamp;
+} DirEntriesData;

... this makes me think that the fundamental data structures have significantly changed. That worries me because they were chosen to make Drive's database-backed semantics work with GIO's POSIX-based abstraction.

I wish my discussions with Alex were not scattered across IRC and bugzilla, but I will try to come up with an ultra-brief summary.

This will give you a feel of the problem:
https://debarshiray.wordpress.com/2015/09/13/google-drive-and-gnome-what-is-a-volatile-path/

The backend generally tries its best to be robust against an arbitrary sequence of GIO operations. It is a square peg in a round hole, so there are some trade-offs and some oddities of Drive are ignored to varying lengths.

First, we ignore that an entry can theoretically be part of multiple parents. The Drive API allows this, but, as far as I have seen, there is no way to achieve that using the web UI.

Second, the backend tries to degrade gracefully when there are two entries with the same title in the same location. It is also not possible to use GIO to re-title an entry if the name is already taken.

The backend has two main data structures. A hash table that mapped IDs to entries; and another that mapped (parent ID, ID or title) tuples to entries. So, for each entry you could have:
 * entry-ID -> entry
 * (parent-ID, entry-ID) -> entry
 * (parent-ID, entry-title) -> entry

This design let us resolve different kinds of paths leading to the same entry. For example, all these paths can be equivalent:
 * /parent-id1/parent-id2/entry-id
 * /parent-id1/parent-title2/entry-title
 * /parent-title1/parent-id2/entry-id

This is important because of the problem with volatile paths described above.

I am not saying that the changes in this patch are completely wrong. I will have to scrutinise them in more detail to arrive at a conclusion.

Comment 12 Ondrej Holy 2017-06-14 09:21:32 UTC

(In reply to Debarshi Ray from comment #11)
> Review of attachment 352498 [details] [review] [review]:
> 
> Sorry for the delay! The performance problem needs to be tackled, so thanks
> for looking into this.

Thanks for looking at this...

> ::: daemon/gvfsbackendgoogle.c
> @@ +55,3 @@
>    GDataDocumentsService *service;
>    GDataEntry *root;
> +  GHashTable *dir_entries; /* gchar *parent_id -> DirEntriesData */
> 
> This and ...
> 
> @@ +88,3 @@
> +  GHashTable *entries; /* gchar *child_title_or_id -> GDataEntry */
> +  gint64 timestamp;
> +} DirEntriesData;
> 
> ... this makes me think that the fundamental data structures have
> significantly changed. That worries me because they were chosen to make
> Drive's database-backed semantics work with GIO's POSIX-based abstraction.

It is true that the fundamental data structures are changed. Trust me, it was not an initial intention, but I think that the patch significantly simplifies cache handling in various aspects finally and I hope that it still fulfill all the needs...

> I wish my discussions with Alex were not scattered across IRC and bugzilla,
> but I will try to come up with an ultra-brief summary.
> 
> This will give you a feel of the problem:
> https://debarshiray.wordpress.com/2015/09/13/google-drive-and-gnome-what-is-
> a-volatile-path/
> 
> The backend generally tries its best to be robust against an arbitrary
> sequence of GIO operations. It is a square peg in a round hole, so there are
> some trade-offs and some oddities of Drive are ignored to varying lengths.
> 
> First, we ignore that an entry can theoretically be part of multiple
> parents. The Drive API allows this, but, as far as I have seen, there is no
> way to achieve that using the web UI.

It is not just a theory. There are more ways, how it can be created over the web UI as far as I know, e.g.:
1/ drag some entry + ctrl and drop
2/ right click -> Move to... + ctrl

It is quite hidden, but it is possible (and intuitive - at least for me). I think it used to be easier to create such entries earlier. It is also possible to create such entries over other apps... I think that we can't simply ignore this fact. It might lead to data loss, see Bug 783358 (I will fix that on top of this changes).

> Second, the backend tries to degrade gracefully when there are two entries
> with the same title in the same location. It is also not possible to use GIO
> to re-title an entry if the name is already taken.
> 
> The backend has two main data structures. A hash table that mapped IDs to
> entries; and another that mapped (parent ID, ID or title) tuples to entries.
> So, for each entry you could have:
>  * entry-ID -> entry
>  * (parent-ID, entry-ID) -> entry
>  * (parent-ID, entry-title) -> entry

This patch introduces "hierarchical" cache instead:
parent_id -> child_title_or_id -> entry

So, you need two lookups in most cases instead of just one, but I don't think this is a problem with hash tables... on the other hand, we can efficiently find all children of some dir, which is the main reason for introducing the new structures... and it is needed if we want per dir caching.
 
> This design let us resolve different kinds of paths leading to the same
> entry. For example, all these paths can be equivalent:
>  * /parent-id1/parent-id2/entry-id
>  * /parent-id1/parent-title2/entry-title
>  * /parent-title1/parent-id2/entry-id
>
> This is important because of the problem with volatile paths described above.

I see, but all of this cases should still work nicely with the new data structures as far as I can tell.
 
> I am not saying that the changes in this patch are completely wrong. I will
> have to scrutinise them in more detail to arrive at a conclusion.

Please take a deeper look...

(I know that it would be better to split the cache related changes and multiple parents related fixes in two patches, but it would take a lot of time and I don't have capacity on it right now...)

Comment 13 Ondrej Holy 2017-07-24 12:08:17 UTC

*** Bug 757219 has been marked as a duplicate of this bug. ***

Comment 14 Debarshi Ray 2017-07-24 17:51:15 UTC

*** Bug 774591 has been marked as a duplicate of this bug. ***

Comment 15 António Fernandes 2017-08-29 10:34:53 UTC

*** Bug 765608 has been marked as a duplicate of this bug. ***

Comment 16 Ondrej Holy 2017-11-10 09:47:53 UTC

*** Bug 790155 has been marked as a duplicate of this bug. ***

Comment 17 damianatorrpm@gmail.com 2017-11-10 11:44:06 UTC

Thank you patch works like a charm.
dnf copr enable damianatorrpm/gvfsd_drive_patch

Comment 18 damianatorrpm@gmail.com 2018-05-13 09:19:55 UTC

Why isn't this merged? The patch may not be perfect but without it GVFS-GOOGLE is completely useless in some cases. I have been using it with the patch productively for month.

Comment 19 Ondrej Holy 2018-05-14 11:53:36 UTC

Because it changes fundamental parts of the backend and needs to be properly tested and reviewed before pushing. Unfortunately, Gvfs lacks manpower and the google is a kinda non-standard filesystem. I am not 100% sure that this change won't cause some issues elsewhere. Thanks for testing, this is valuable feedback for us (but be careful that this is just proposal which is not yet fully verified). 

Rishi, is there any chance that you will find time for proper review?

Comment 20 Ondrej Holy 2018-09-21 11:38:25 UTC

I've just merged the reworked proposal:
https://gitlab.gnome.org/GNOME/gvfs/merge_requests/9