GNOME Bugzilla – Bug 728878
Add a PicasaWeb miner
Last modified: 2014-07-23 23:07:37 UTC
I'm working on the PicasaWeb support for GNOME Photos and will create a PicasaWeb miner for the same. This bug report will keep track of the changes.
Created attachment 278122 [details] [review] Add argument to RefreshDB This is how I thought of handling requests from a client for indexing different types of data. Please let me know what you think of this implementation.
Review of attachment 278122 [details] [review]: Now that our API is no longer a simple 'void RefreshDB (GError **error)', I am wondering if we should ship a gdbus-codegen-ed client library. That way we can have proper sonames and applications can better express their dependency on gnome-online-miners. Just a passing thought. ::: src/gom-miner-main.c @@ +131,1 @@ g_dbus_method_invocation_return_value (invocation, NULL); Even though this will work, it is wrong. The invocation is tied to the RPC made by the client. So you should not return anything for the invocation until it is complete. In this case, you should only call g_dbus_method_invocation_return_value after all the elements in the array have been handled. Having said that, I think we should take a slightly different approach. See below. @@ +156,3 @@ + index_type = NULL; + --current_size; + } Instead of looping over the elements of the array and repeatedly calling gom_miner_refresh_db_async, I think we should extract the array from the parameters GVariant pass it to the miner. That way if an application has sent ['documents', 'photos'] to the ownCloud miner, it can cover both types in a single pass. Otherwise it will have to go through the remote storage twice.
By the way, why did you not use 'git format-patch' or 'git bz'? :)
(In reply to comment #3) > By the way, why did you not use 'git format-patch' or 'git bz'? :) Sorry about that. I had a lot of other changes in the commit and somehow didn't realise then that I could create a patch from selected files after a commit.
(In reply to comment #2) Thanks for the review, Debarshi. A few things that came to my mind after our discussion on IRC yesterday: *After thinking of what we discussed yesterday in terms of code, I realised GData is the only miner which has different services for Photos and Documents. All the other miners have the same service for both Photos and Documents. So, creating a hash table for services in the create_service () method instead of the current implementation doesn't have benefits for any of the other miners. *All other miners do their tasks in a single pass e.g. if we want to index photos or documents or both for ownCloud, it will only take one pass irrespective of what we pass as index types, since there is no way to process only one type currently. Same is the case for other miners. > Instead of looping over the elements of the array and repeatedly calling > gom_miner_refresh_db_async, I think we should extract the array from the > parameters GVariant pass it to the miner. That way if an application has sent > ['documents', 'photos'] to the ownCloud miner, it can cover both types in a > single pass. Otherwise it will have to go through the remote storage twice. True. But maybe we can determine the provider type and do the double pass (if both 'photos' and 'documents' are passed as arguments) only for the case of GData. In all other cases, as discussed in the above point, even if a single argument is passed (like only 'photos' or only 'documents'), the miner will index everything since there is no mechanism currently to selectively index only photos or only documents. Please feel free to correct me if I went wrong anywhere. I thought it would be good to discuss these before we move towards what we discussed yesterday.
(In reply to comment #5) > But maybe we can determine the provider type and do the double pass (if > both 'photos' and 'documents' are passed as arguments) only for the case of > GData. In all other cases, as discussed in the above point, even if a single > argument is passed (like only 'photos' or only 'documents'), the miner will > index everything since there is no mechanism currently to selectively index > only photos or only documents. I prefer to keep all the provider specific details in each miner and not pollute the base classes with any provider specific code: - It makes it easier to add a new miner because you don't have to go and edit files all over the place. Instead you hook into a well-defined interface or base class. - It keeps the all the providers nicely isolated from each other. Editing the base class to achieve something in one miner can have an adverse effect on another one. In other words, I want to keep the common code as generic as possible and put the provider specific stuff into the child classes. After all that is what base classes, virtual methods and child classes are for.
Created attachment 278578 [details] [review] Add argument to RefreshDB Thanks for the explanation, Debarshi. Please have a look at this implementation which we discussed earlier.
Review of attachment 278578 [details] [review]: Thanks for the updated patch, Saurav. Some comments below. How do you handle the case where a request is already mid-flight and another one comes in with a different service type? ::: src/gom-facebook-miner.c @@ +330,3 @@ GError *local_error = NULL; + me = gfbgraph_user_get_me (GFBGRAPH_AUTHORIZER (g_hash_table_lookup (job->services, "photos")), &local_error); Probably nicer to save a pointer to this at the beginning of the function (also in the other subclasses) @@ +378,3 @@ authorizer = gfbgraph_goa_authorizer_new (object); + for (l = types; l != NULL; l = l->next) Since you duplicate this code everywhere, I think it would be useful to have a gom_miner_supports_type kind of function in the base class. ::: src/gom-gdata-miner.c @@ +385,3 @@ + for (l = types; l != NULL; l = l->next) + { + if (g_strcmp0 (l->data, "photos") == 0 && goa_object_peek_photos (object) != NULL) Don't you already do the "peek" checks in the base class? @@ +389,3 @@ + service = GDATA_SERVICE (gdata_picasaweb_service_new (GDATA_AUTHORIZER (authorizer))); + } + if (g_strcmp0 (l->data, "documents") == 0 && goa_object_peek_documents (object) != NULL) Should be else if ::: src/gom-miner-main.c @@ +164,3 @@ + while (g_variant_iter_loop (iter_index_types, "&s", &type)) + { + index_types = g_list_append (index_types, type); Memory management here is wrong; "type" will is free-d after the loop iteration, so you should g_strdup() it, or use g_variant_iter_next(). @@ +167,3 @@ + } + + gom_miner_set_index_types (miner, index_types); Should also free the list here; gom_miner_set_index_types() should make a copy. ::: src/gom-miner.c @@ +85,2 @@ g_free (self->priv->display_name); + g_list_free (self->priv->index_types); Should own the strings too - see other comment. So this would become g_list_free_full(..., g_free) @@ +651,2 @@ GomMinerClass *miner_class = GOM_MINER_GET_CLASS (self); + gboolean skipPhotos, skipDocuments; Coding style: no camel case @@ +690,3 @@ + } + + if (g_strcmp0 (type->data, "documents") == 0) Why do you need these variables instead of the former == NULL check? @@ +768,3 @@ +gom_miner_set_index_types (GomMiner *self, GList *index_types) +{ + self->priv->index_types = index_types; Should copy the list and all of its contents.
Review of attachment 278578 [details] [review]: Cosimo already covered most of what I was about to say. ::: src/gom-gdata-miner.c @@ +347,3 @@ +{ + /* TODO */ +} Can you please move this to a separate patch? Then this patch can be just about changing the DBus signature and associated adjustments. @@ +362,3 @@ + { + query_gdata_photos (job, error); + } Ditto.
(In reply to comment #8) Thanks for the review, Cosimo. > How do you handle the case where a request is already mid-flight and another > one comes in with a different service type? That was on my TODO list, then. I will put in a queuing system in the next patch for review. > ::: src/gom-gdata-miner.c > @@ +385,3 @@ > + for (l = types; l != NULL; l = l->next) > + { > + if (g_strcmp0 (l->data, "photos") == 0 && goa_object_peek_photos > (object) != NULL) > > Don't you already do the "peek" checks in the base class? Yes. But that is to determine whether we want to skip an account or not. Let's say we have ['documents', 'photos'] passed as argument to RefreshDB. And Photos switch is turned off and Documents' is turned on in GOA. The check in base class will allow that account. However, we should only create services for items which have their switch in GOA turned on. Hence this check. > > @@ +690,3 @@ > + } > + > + if (g_strcmp0 (type->data, "documents") == 0) > > Why do you need these variables instead of the former == NULL check? In our discussion on IRC, Debarshi rightly pointed out that the earlier check wouldn't work because now we have more than one possible types for a miner. To reiterate the example which he gave me, if we think of "google", documents [off], photos [on] and the app is gnome-documents (i.e. we pass ['documents'] as the argument), then the earlier code wouldn't skip it when it should. I will take care of the rest of the issues in the next patch.
Created attachment 278837 [details] [review] Add argument to RefreshDB I have changed the patch as per the review and also moved out the relevant code for Photos from gom-gdata-miner.c as Debarshi suggested. I have also put in a system to queue the requests. Currently, it queues all the requests irrespective of the index types. I did this just to show my implementation. We can improve upon this by only queuing those requests which are not a subset of the current request being refreshed.
Created attachment 279114 [details] [review] Add a Picasa Web miner Apart from a few EXIF details, I have tried to cover everything.
Review of attachment 278837 [details] [review]: Much better, Saurav. ::: src/gom-facebook-miner.c @@ +333,3 @@ GError *local_error = NULL; + authorizer = GFBGRAPH_AUTHORIZER (g_hash_table_lookup (job->services, "photos")); We should guard against the authorizer == NULL. Otherwise if you pass some garbage string in index_types, you will get a CRITICAL. @@ +377,1 @@ + table_services = g_hash_table_new (g_str_hash, g_str_equal); You need g_hash_table_new_full here. Otherwise, you are leaking the authorizer because nobody knows how to destroy it. ::: src/gom-flickr-miner.c @@ +473,3 @@ + GHashTable *table_services; + + table_services = g_hash_table_new (g_str_hash, g_str_equal); Ditto. ::: src/gom-gdata-miner.c @@ +316,3 @@ + GDataDocumentsService *service; + + service = GDATA_DOCUMENTS_SERVICE (g_hash_table_lookup (job->services, "documents")); We should guard against the service == NULL. Otherwise if you pass some garbage string in index_types, you will get a CRITICAL. @@ +352,1 @@ + table_services = g_hash_table_new (g_str_hash, g_str_equal); Ditto. ::: src/gom-owncloud-miner.c @@ +351,3 @@ data.job = job; volumes = g_volume_monitor_get_volumes (priv->monitor); + object = g_hash_table_lookup (job->services, "documents"); Ditto. @@ +414,3 @@ + GHashTable *table_services; + + table_services = g_hash_table_new (g_str_hash, g_str_equal); Ditto. ::: src/gom-zpj-miner.c @@ +216,1 @@ + zpjskydrive = ZPJ_SKYDRIVE (g_hash_table_lookup (job->services, "documents")); Ditto. @@ +270,3 @@ + GHashTable *table_services; + + table_services = g_hash_table_new (g_str_hash, g_str_equal); Ditto.
Created attachment 279135 [details] [review] RefreshDB API: Add argument to RefreshDB Rebased against master for bug 731976
Created attachment 279201 [details] [review] RefreshDB API: Add argument to RefreshDB
Created attachment 279202 [details] [review] picasaweb: Add a Picasa Web miner Updated after modifying the RefreshDB API patch.
(In reply to comment #14) > Created an attachment (id=279135) [details] [review] > RefreshDB API: Add argument to RefreshDB > > Rebased against master for bug 731976 Thanks a lot, Debarshi. I have updated this to include your suggestions from the review.
Review of attachment 279201 [details] [review]: ::: src/gom-facebook-miner.c @@ +336,3 @@ + + if (authorizer == NULL) + return; Instead of just returning, I think it would be better to set the error. You have already done that in gom-flickr-miner.c.
Created attachment 279213 [details] [review] src: Pass the content type to RefreshDB Fixed and committed. Could not spot any breakage after a few rounds of testing.
Created attachment 279886 [details] [review] picasaweb: Add a Picasa Web miner I have made a few improvements to the earlier patch. * Included the equipment make and model, width and height EXIF data. * I realised that I had earlier put the photo creator's name same as the album owner's in the contact resource. The right way to do it would be to use the photo's credit (which returns a name) as the photo creator's name.
Created attachment 280373 [details] [review] src: Add a Picasa Web miner based on GData
Review of attachment 280373 [details] [review]: Thanks for the fantastic patch, Saurav. A few minor comments. ::: configure.ac @@ +20,3 @@ AC_HEADER_STDC +GDATA_MIN_VERSION=0.15.2 We should try to get a libgdata release with the patches that we need. ::: src/gom-gdata-miner.c @@ +351,3 @@ + photo_summary = gdata_entry_get_summary ((GDATA_ENTRY (photo))); + photo_timestamp = gdata_picasaweb_file_get_timestamp (photo); + photo_edited = gdata_picasaweb_file_get_edited (photo); Could you please move these where they are actually used like in the Google Drive code? @@ +368,3 @@ + photo_width = g_strdup_printf ("%d", gdata_picasaweb_file_get_width (photo)); + photo_height = g_strdup_printf ("%d", gdata_picasaweb_file_get_height (photo)); + photo_credit = gdata_picasaweb_file_get_credit (photo); And these too. @@ +370,3 @@ + photo_credit = gdata_picasaweb_file_get_credit (photo); + + identifier = g_strdup_printf ("picasaweb:%s", photo_id); What about making it 'google:picasaweb:%s' ? eg., for SkyDrive we use 'windows-live:skydrive:%s'. @@ +634,3 @@ + album_summary = gdata_entry_get_summary ((GDATA_ENTRY (album))); + link = gdata_entry_look_up_link (GDATA_ENTRY (album), GDATA_LINK_ALTERNATE); + album_uri = gdata_link_get_uri (link); Could you please move these where they are actually used like in the Google Drive code? That way it is easier to see how they are used in the database. Otherwise one has to scroll up and down the huge function. Lazy allocation is also good because if you are short circuited out of the function then some of these statements might not need to be executed. eg., in the case of errors or if the album has not changed. @@ +636,3 @@ + album_uri = gdata_link_get_uri (link); + album_timestamp = gdata_picasaweb_album_get_timestamp (album); + album_edited = gdata_picasaweb_album_get_edited (album); Is gdata_entry_get_updated not applicable here? Does _get_edited tell you when the album's meta-data is changed, and does _get_updated tell you when the contents of the album has changed? @@ +637,3 @@ + album_timestamp = gdata_picasaweb_album_get_timestamp (album); + album_edited = gdata_picasaweb_album_get_edited (album); + album_user = gdata_picasaweb_album_get_nickname (album); Does it fallback to the username if nickname is not set? Can't we use gdata_entry_get_authors? @@ +642,3 @@ + album_timestamp_iso8601 = gom_iso8601_from_timestamp (album_timestamp / 1000); + + identifier = g_strdup_printf ("photos:collection:picasaweb:%s", album_id); What about 'photos:collection:google:picasaweb:%s'? @@ +743,3 @@ + album_photos: + query = gdata_picasaweb_query_new (NULL); + gdata_picasaweb_query_set_image_size (query, "d"); Umm... is the link to the online PicasaWeb documentation given in libgdata's API documentation outdated? @@ +785,1 @@ GError **error) The next line needs to be aligned. @@ +869,3 @@ + GList *keys, *l; + + keys = g_hash_table_get_keys (job->services); keys is getting leaked. @@ +880,3 @@ + { + query_gdata_documents (job, error); + } Or, instead of the loop we could do g_hash_table_lookup, and even pass the service object to the query functions. All of them are having to do the lookup, anyway.
(In reply to comment #22) Thanks for the review, Debarshi. > We should try to get a libgdata release with the patches that we need. I'll talk to Philip Withnall regarding this. > What about making it 'google:picasaweb:%s' ? eg., for SkyDrive we use > 'windows-live:skydrive:%s'. Changed to 'google' instead of 'picasaweb' > Is gdata_entry_get_updated not applicable here? Does _get_edited tell you when > the album's meta-data is changed, and does _get_updated tell you when the > contents of the album has changed? Both of them work in the same way. Changed to gdata_entry_get_updated for consistency with the Drive code. > Does it fallback to the username if nickname is not set? Can't we use > gdata_entry_get_authors? Changed to gdata_entry_get_authors for consistency with the Drive code. > What about 'photos:collection:google:picasaweb:%s'? Changed to 'google' instead of 'picasaweb' I have taken care of other changes, too, in the new patch.
Created attachment 281468 [details] [review] src: Add a Picasa Web miner based on GData
Review of attachment 281468 [details] [review]: ::: src/gom-gdata-miner.c @@ +318,3 @@ + const gchar *photo_uri; + const gchar *photo_title; + const gchar *photo_summary; It is quite obvious that we are dealing with photos, so the prefix isn't necessary. It just makes the names longer. @@ +345,3 @@ + + photo_id = gdata_entry_get_id (GDATA_ENTRY (photo)); + identifier = g_strdup_printf ("google:%s", photo_id); 'google:picasaweb:%s' Since the ID is a URL starting with docs.google.com or picasaweb.google.com, we are safe even without any prefix, which is what the Drive code does. However, we don't control the value of ID. Having a prefix guards us from any trouble if Google chooses to change the format of its IDs. @@ +603,3 @@ + GError **error) +{ + GDataFeed *feed; We must initialize 'feed' to NULL, otherwise it will have garbage when exiting through the error path. @@ +611,3 @@ + const gchar *album_summary; + const gchar *album_uri; + const gchar *album_user; Apart from 'album_id' there is no need for the 'album_' prefix because we are in account_miner_job_process_album and it is obvious that we are dealing with albums. @@ +617,3 @@ + + gchar *identifier; + const gchar *class = "nfo:DataContainer"; No need for the variable. Use the string literal directly. @@ +623,3 @@ + + album_id = gdata_entry_get_id (GDATA_ENTRY (album)); + identifier = g_strdup_printf ("photos:collection:google:%s", album_id); 'photos:collection:google:picasaweb:%s' @@ +731,3 @@ + job->cancellable, error, + job->datasource_urn, resource, + "nie:contentCreated", album_timestamp_iso8601); Free album_timestamp_iso8601 here. @@ +737,3 @@ + + /* Album photos */ + album_photos: The indentation for the goto label should be one, not two, spaces. Also, it is not very useful to have a comment which has exactly the same words as the label. :) @@ +748,3 @@ + goto out; + + photos = gdata_feed_get_entries (GDATA_FEED (feed)); 'feed' is already a GDataFeed, so no need to typecast it. @@ +752,3 @@ + for (l = photos; l != NULL; l = l->next) + { + account_miner_job_process_photo (job, GDATA_PICASAWEB_FILE (l->data), resource, error); Since we are typecasting it more than once, how about doing it once and caching it? I know that the Drive code does this, but we can fix it up in a separate patch. @@ +767,3 @@ + g_free (resource); + g_free (identifier); + g_free (album_timestamp_iso8601); You can't free album_timestamp_iso8601 here because it can have garbage when exiting through the error path. @@ +768,3 @@ + g_free (identifier); + g_free (album_timestamp_iso8601); + g_object_unref (feed); Use g_clear_object to drop the reference on 'feed'. @@ +847,3 @@ + for (l = albums; l != NULL; l = l->next) + { + account_miner_job_process_album (job, service, GDATA_PICASAWEB_ALBUM (l->data), error); Ditto, about caching the typecast. @@ +868,3 @@ + gpointer service; + + if (g_hash_table_lookup_extended (job->services, "photos", NULL, &service)) Just use g_hash_table_lookup. We need not worry about the gdata_*_service_new methods failing. GDataService does not implement GInitable and the call to g_object_new won't fail unless you run out of virtual memory or something equally catastrophic happens.
Review of attachment 281468 [details] [review]: ::: src/gom-gdata-miner.c @@ +342,3 @@ + gboolean resource_exists, mtime_changed; + gchar *contact_resource; + gchar *equipment_resource; Initializing equipment_resource to NULL will make our lives much easier. @@ +460,3 @@ + job->cancellable, error, + job->datasource_urn, resource, + "nmm:exposureTime", photo_exposure); Free exposure here. @@ +470,3 @@ + job->cancellable, error, + job->datasource_urn, resource, + "nmm:focalLength", photo_focal_length); Free focal_length here. @@ +480,3 @@ + job->cancellable, error, + job->datasource_urn, resource, + "nmm:fnumber", photo_fstop); Ditto. @@ +485,3 @@ + goto out; + + photo_iso = g_strdup_printf ("%ld", gdata_picasaweb_file_get_iso (photo)); The function returns a gint, but the property is glong. Looks like a libgdata bug. I think glong is better because Google says that ISO is unsigned int, and we want -1 for 'unknown'. Lets typecast the value to glong till this is fixed. @@ +490,3 @@ + job->cancellable, error, + job->datasource_urn, resource, + "nmm:isoSpeed", photo_iso); Ditto. @@ +505,3 @@ + goto out; + + if (photo_make != NULL || photo_model != NULL) This does not look right. Those pointers have garbage at this point. Why do we need to check their value? @@ +547,3 @@ + } + + photo_width = g_strdup_printf ("%d", gdata_picasaweb_file_get_width (photo)); The width is unsigned int. @@ +557,3 @@ + goto out; + + photo_height = g_strdup_printf ("%d", gdata_picasaweb_file_get_height (photo)); The height is unsigned int. @@ +585,3 @@ + g_free (photo_focal_length); + g_free (photo_fstop); + g_free (photo_iso); We can't free these here. They might have garbage when exiting through the error path.
Created attachment 281487 [details] [review] gdata: Add a PicasaWeb miner
(In reply to comment #27) > Created an attachment (id=281487) [details] [review] > gdata: Add a PicasaWeb miner Thanks for the review and patch, Debarshi. Looks good to me and works perfectly on my system. Just one thing. I am not sure but like we discussed on IRC, the service == NULL check shouldn't be required anymore in query_gdata_photos and query_gdata_documents.
(In reply to comment #28) > Just one thing. I am not sure but like we discussed on IRC, the service == NULL > check shouldn't be required anymore in query_gdata_photos and > query_gdata_documents. Yes, you are right. I will remove it.
Created attachment 281539 [details] [review] gdata: Add a PicasaWeb miner
Thanks for all your work on this, Saurav.