GNOME Bugzilla – Bug 771390
Do not read the whole Google Drive on enumerate.
Last modified: 2018-09-21 12:13:25 UTC
On larger Google Drives this makes the backend unusable. Given the following folder structure: / |->foo |-> a |-> b |->bar |-> 1 |-> 2 |-> 3 ... |-> 4999 |-> 5000 `ls /`and even `ls /foo` currently takes more than 4min in our setup. Speeding up `ls /bar` might also be possible but is way less important.
Thanks for your bug report. I have around 500 files and it takes about 20 seconds, which is somehow acceptable, but 4min are not really acceptable... I remember that I proposed an idea to not load all files everytime when the backend was in development yet, but there were some problems, why it was not possible... I think that libgdata had some problems with filtering. Rishi, don't you recall why this behavior was chosen and whether is it possible to change it?
First, Google Drive has the Changes API [1], which lets us query only the delta since the last 'change ID' seen by us. Unless the Drive is rapidly changing, which is unusual, I think this should speed up our attempts to update the cache. Right now we re-download the metadata for the entire Drive. Unfortunately, this isn't implemented in libgdata [2], so we need to do that first. Second, maybe we can do something to speed up the initial metadata fetching. I can think of some options, but I haven't tried any of them. We could probably have an offline cache of the Drive. Everything is expressed in JSON. So, maybe we can just save it to disk, and re-use the metadata from the last mount until we have completed the network I/O. Or, maybe we can just query whatever is needed for the directory in question, and then query the rest in the background. I, kind of, like having the entire Drive's metadata cached since subsequent operations are more unlikely to suffer a cache miss. The reason we didn't do this initially is (i) libgdata didn't offer the Changes API which is a relatively easy optimization (ii) things, kind of, worked OK for the Drives that we tested it against. [1] https://developers.google.com/drive/v2/reference/changes [2] https://bugzilla.gnome.org/show_bug.cgi?id=654652
Thank you for your detailed informations. Querying only what is needed for the directory in question seems to be the easiest fix to this problem. Building a cache which can be updated using the changes API and even saved to disk could be added back in later. Unfortunately I can not write any C code but I am happy to test any patches someone might come up with.
Created attachment 352157 [details] [review] google: Improve performance on Drives with many files The backend is totally unusable if you have too many files on your Drive. This happens because the backend caches the whole Drive. Various operations require rereading the cache, because the proper invalidation is not possible. Let's change the workflow in order to improve the performance. The summary of changes done by this patch: - Do not cache the whole Drive, but cache just requested folders. - Do not use cache for enumeration, but just for additional operations. It is not common that applications call enumeration for the same folder several times in a short time... - Use cache entries only if they are not older than 100 second, which reflects Google quotas, which are specified per 100 seconds. - Remove functions get_parent_id and get_entry_path which are wrong because one GDataEntry may have multiple parents. As a result, the backend works smoothly regardless of the total number of Drive files, because the total number of transmitted data is significantly reduced. It is true, that more requests to Drive are done, but I think that the Drive quotas are big enough.
It is not based on Changes API and anyway I think we should build the cache incrementally and we can use Changes API in the future for proper invalidation. I wanted to make just a simple improvement initially based on etags provided by Google. Unfortunately, etags change per each query regardless of actually changes, so I ended up with this bigger patch. Rishi, what do you think?
Ben, can you please test it in your envirnment?
(In reply to Ondrej Holy from comment #4) > Created attachment 352157 [details] [review] [review] <snip> > - Do not cache the whole Drive, but cache just requested folders. > - Do not use cache for enumeration, but just for additional operations. > It is not common that applications call enumeration for the same folder > several times in a short time... I think you can carry on caching for enumerations, if the caching is actually caching and not preloading, as the code used to do according to this changelog.
Created attachment 352422 [details] [review] google: Improve performance on Drives with many files Rebased to master.
(In reply to Bastien Nocera from comment #7) > (In reply to Ondrej Holy from comment #4) > > Created attachment 352157 [details] [review] [review] [review] > <snip> > > - Do not cache the whole Drive, but cache just requested folders. > > - Do not use cache for enumeration, but just for additional operations. > > It is not common that applications call enumeration for the same folder > > several times in a short time... > > I think you can carry on caching for enumerations, if the caching is > actually caching and not preloading, as the code used to do according to > this changelog. Hmm, actually why not, I will update the patch accordingly...
Created attachment 352498 [details] [review] google: Improve performance for Drives with many files The backend is totally unusable if you have too many files on your Drive. This happens because the backend preloads the whole Drive's metadata. Various operations require rereading the whole cache, because the proper invalidation is not possible currently. Let's change the workflow in order to improve the performance. The summary of changes done by this patch: - Do not preload the whole Drive, but cache just requested folders. Invalidate by time separately for each folder and not for the whole cache. - Remove functions get_parent_id and get_entry_path which are totally wrong because one GDataEntry may have multiple parents. - Invalidate cache entries for all parents, not just for concrete folder, because operations affects the files on all places. - Files without parent folder are not listed anymore, they are not listed on the web anyway. They used to be shown as hidden in the root folder. As a result, the backend works smoothly regardless of the total number of Drive files, because the total number of transmitted data is significantly reduced. It is true, that slightly more requests to Drive have to done, but I think that the Drive quotas are big enough.
Review of attachment 352498 [details] [review]: Sorry for the delay! The performance problem needs to be tackled, so thanks for looking into this. ::: daemon/gvfsbackendgoogle.c @@ +55,3 @@ GDataDocumentsService *service; GDataEntry *root; + GHashTable *dir_entries; /* gchar *parent_id -> DirEntriesData */ This and ... @@ +88,3 @@ + GHashTable *entries; /* gchar *child_title_or_id -> GDataEntry */ + gint64 timestamp; +} DirEntriesData; ... this makes me think that the fundamental data structures have significantly changed. That worries me because they were chosen to make Drive's database-backed semantics work with GIO's POSIX-based abstraction. I wish my discussions with Alex were not scattered across IRC and bugzilla, but I will try to come up with an ultra-brief summary. This will give you a feel of the problem: https://debarshiray.wordpress.com/2015/09/13/google-drive-and-gnome-what-is-a-volatile-path/ The backend generally tries its best to be robust against an arbitrary sequence of GIO operations. It is a square peg in a round hole, so there are some trade-offs and some oddities of Drive are ignored to varying lengths. First, we ignore that an entry can theoretically be part of multiple parents. The Drive API allows this, but, as far as I have seen, there is no way to achieve that using the web UI. Second, the backend tries to degrade gracefully when there are two entries with the same title in the same location. It is also not possible to use GIO to re-title an entry if the name is already taken. The backend has two main data structures. A hash table that mapped IDs to entries; and another that mapped (parent ID, ID or title) tuples to entries. So, for each entry you could have: * entry-ID -> entry * (parent-ID, entry-ID) -> entry * (parent-ID, entry-title) -> entry This design let us resolve different kinds of paths leading to the same entry. For example, all these paths can be equivalent: * /parent-id1/parent-id2/entry-id * /parent-id1/parent-title2/entry-title * /parent-title1/parent-id2/entry-id This is important because of the problem with volatile paths described above. I am not saying that the changes in this patch are completely wrong. I will have to scrutinise them in more detail to arrive at a conclusion.
(In reply to Debarshi Ray from comment #11) > Review of attachment 352498 [details] [review] [review]: > > Sorry for the delay! The performance problem needs to be tackled, so thanks > for looking into this. Thanks for looking at this... > ::: daemon/gvfsbackendgoogle.c > @@ +55,3 @@ > GDataDocumentsService *service; > GDataEntry *root; > + GHashTable *dir_entries; /* gchar *parent_id -> DirEntriesData */ > > This and ... > > @@ +88,3 @@ > + GHashTable *entries; /* gchar *child_title_or_id -> GDataEntry */ > + gint64 timestamp; > +} DirEntriesData; > > ... this makes me think that the fundamental data structures have > significantly changed. That worries me because they were chosen to make > Drive's database-backed semantics work with GIO's POSIX-based abstraction. It is true that the fundamental data structures are changed. Trust me, it was not an initial intention, but I think that the patch significantly simplifies cache handling in various aspects finally and I hope that it still fulfill all the needs... > I wish my discussions with Alex were not scattered across IRC and bugzilla, > but I will try to come up with an ultra-brief summary. > > This will give you a feel of the problem: > https://debarshiray.wordpress.com/2015/09/13/google-drive-and-gnome-what-is- > a-volatile-path/ > > The backend generally tries its best to be robust against an arbitrary > sequence of GIO operations. It is a square peg in a round hole, so there are > some trade-offs and some oddities of Drive are ignored to varying lengths. > > First, we ignore that an entry can theoretically be part of multiple > parents. The Drive API allows this, but, as far as I have seen, there is no > way to achieve that using the web UI. It is not just a theory. There are more ways, how it can be created over the web UI as far as I know, e.g.: 1/ drag some entry + ctrl and drop 2/ right click -> Move to... + ctrl It is quite hidden, but it is possible (and intuitive - at least for me). I think it used to be easier to create such entries earlier. It is also possible to create such entries over other apps... I think that we can't simply ignore this fact. It might lead to data loss, see Bug 783358 (I will fix that on top of this changes). > Second, the backend tries to degrade gracefully when there are two entries > with the same title in the same location. It is also not possible to use GIO > to re-title an entry if the name is already taken. > > The backend has two main data structures. A hash table that mapped IDs to > entries; and another that mapped (parent ID, ID or title) tuples to entries. > So, for each entry you could have: > * entry-ID -> entry > * (parent-ID, entry-ID) -> entry > * (parent-ID, entry-title) -> entry This patch introduces "hierarchical" cache instead: parent_id -> child_title_or_id -> entry So, you need two lookups in most cases instead of just one, but I don't think this is a problem with hash tables... on the other hand, we can efficiently find all children of some dir, which is the main reason for introducing the new structures... and it is needed if we want per dir caching. > This design let us resolve different kinds of paths leading to the same > entry. For example, all these paths can be equivalent: > * /parent-id1/parent-id2/entry-id > * /parent-id1/parent-title2/entry-title > * /parent-title1/parent-id2/entry-id > > This is important because of the problem with volatile paths described above. I see, but all of this cases should still work nicely with the new data structures as far as I can tell. > I am not saying that the changes in this patch are completely wrong. I will > have to scrutinise them in more detail to arrive at a conclusion. Please take a deeper look... (I know that it would be better to split the cache related changes and multiple parents related fixes in two patches, but it would take a lot of time and I don't have capacity on it right now...)
*** Bug 757219 has been marked as a duplicate of this bug. ***
*** Bug 774591 has been marked as a duplicate of this bug. ***
*** Bug 765608 has been marked as a duplicate of this bug. ***
*** Bug 790155 has been marked as a duplicate of this bug. ***
Thank you patch works like a charm. dnf copr enable damianatorrpm/gvfsd_drive_patch
Why isn't this merged? The patch may not be perfect but without it GVFS-GOOGLE is completely useless in some cases. I have been using it with the patch productively for month.
Because it changes fundamental parts of the backend and needs to be properly tested and reviewed before pushing. Unfortunately, Gvfs lacks manpower and the google is a kinda non-standard filesystem. I am not 100% sure that this change won't cause some issues elsewhere. Thanks for testing, this is valuable feedback for us (but be careful that this is just proposal which is not yet fully verified). Rishi, is there any chance that you will find time for proper review?
I've just merged the reworked proposal: https://gitlab.gnome.org/GNOME/gvfs/merge_requests/9