Bug 728072 – [PATCH] Add Metadata methods to get all tags and values

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 728072 - [PATCH] Add Metadata methods to get all tags and values


Summary:	[PATCH] Add Metadata methods to get all tags and values


Status:	RESOLVED WONTFIX

Product:	gexiv2
Classification:	Other
Component:	implementation
Version:	unspecified
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	Gexiv2 Maintainers
QA Contact:	Gexiv2 Maintainers

URL:
Whiteboard:	review

Depends on:
Blocks:

Reported:	2014-04-12 02:54 UTC by Michael Pratt
Modified:	2014-09-02 18:43 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
GExiv2.Metadata vs MemoizedMetadata (1.61 KB, text/x-python) 2014-04-12 02:55 UTC, Michael Pratt		Details
Add methods to get all tag, value pairs (7.31 KB, patch) 2014-04-12 02:56 UTC, Michael Pratt	needs-work	Details \| Review

Description Michael Pratt 2014-04-12 02:54:20 UTC

The attached patch adds three methods to GExiv2: get_xmp_tags_string(), get_exif_tags_string(), get_iptc_tags_string().  These methods return a GHashTable of all XMP, EXIF, IPTC tags and their (string) values.  Using the Python bindings, the GHashTable becomes a dict.

I have not written anything using GLib before, so let me know if I have misused anything.

The motivation behind this patch was to help overcome performance issues I have been experiencing.  I have applications that read every tag (about 300, mostly XMP) from many images.  Based on my profiling, this took about 50ms per image, which was becoming a significant burden, as I frequently handle 1000+ images.  Further profiling showed that the majority of time was in actual tag reading.  Exiv2, and thus also GExiv2, provides O(n) lookup time for individual tags.  Thus, looking up all tags is O(n^2) time.

These new methods perform a single pass over all tags, saving tags and values to a hash table.  This hash table can be used as a tag cache, providing O(1) lookup times, and overall O(n) time to lookup all tags.

The attached benchmark.py compares the standard GExiv2.Metadata object with a subtype that uses the new methods to "memoize" tags at instantiation.  The benchmark performs a dummy read of all tags in the image.  On my machine, with an image with about 300 tags, GExiv2.Metadata averages about 50ms, while MemoizedMetadata averages about 5ms.

Comment 1 Michael Pratt 2014-04-12 02:55:06 UTC

Created attachment 274145 [details]
GExiv2.Metadata vs MemoizedMetadata

Comment 2 Michael Pratt 2014-04-12 02:56:29 UTC

Created attachment 274146 [details] [review]
Add methods to get all tag, value pairs

Comment 3 Jim Nelson 2014-05-12 23:35:50 UTC

Thanks for your patience, Michael.  Looking over your patch, I don't see any code issues per se.  This is a clean patch.

My concern stems from API pollution.  I'm not absolutely convinced these calls are necessary.  The first problem is that Exiv2 has two notions of string values: raw strings (where the metadata value is converted straight to string, kind of like duck typing) and interpreted strings (where well-known metadata keys are converted into human-readable strings).  For example, EXIF orientation is an integer (1) which can be converted to a raw string ("1") and an interpreted string ("Top Left").  Values can also be long integers, rationals, date/times, and more.  Converting all the values to raw strings make sense for some applications (like yours, obviously) but may not be useful for other applications.  We really don't want to get in the business of creating API variants for all value types ("gexiv2_metadata_get_exif_tags_long" et al) as that will lead to an interface explosion.

That doesn't mean we shouldn't attempt to improve performance.  I'm wondering if there's other work we can do to achieve those gains without adding another set of calls into gexiv2.  The advantage of this approach is that everyone benefits.

Looking at gexiv2_metadata_has_exif_tag() for example, it appears that call is walking the entire ExifData record looking for a matching tag name by case-insensitive comparison.  This is obviously a big problem with your use-case because the Python bindings make heavy use of get_tag() (for example, with get(), which I assume is the method that makes Metadata look like a dict).  has_tag() should use the same accessor that get_tag_string() uses, namely findKey(), which I bet is a dictionary lookup under-the-covers.

The Python bindings also call has_tag() before calling get_tag_string() in a few places.  Instead, they should simply call get_tag_string() and if it returns NULL follow the error case.

In short, I'm encouraging you to attempt to fix the existing API/bindings before introducing new ones to work around the performance problems.  I basically think you should see similar performance gains simply by calling get_tags() and iterating over that, calling get_tag_string() for each.

Comment 4 Jim Nelson 2014-05-12 23:36:21 UTC

Review of attachment 274146 [details] [review]:

As noted in previous comment.

Comment 5 Michael Pratt 2014-07-06 02:22:38 UTC

Jim,

Thanks for your feedback.  You are right that this could get very messy if we try to support this for all of the various types.

There are probably some small things we would do on the gexiv2 side to improve performance a bit.  I also finally had time to dig further into Exiv2 itself.  findKey() is in fact *not* a dictionary under the covers, but a vector which must be walked for each lookup, hence the abysmal performance, especially with lots of tags.

It looks me a while, but I finally managed to completely switch Exiv2 to a hash map under the covers, which is the true solution for this work around.  I have submitted a patch to Exiv2, which they will hopefully accept.

http://dev.exiv2.org/issues/967

Comment 6 Jim Nelson 2014-08-29 23:16:19 UTC

Is it okay to close this ticket as WONTFIX then?  It sounds like this issue is resolved (although, looking at the Exiv2 ticket, their interface may change due to it).

Comment 7 Michael Pratt 2014-08-30 17:24:46 UTC

I think it is OK to close.  Unfortunately, I'm not sure if the Exiv2 patch will be accepted, as there are concerns about ABI changes.  I made sure that the API did not change, but the ABI did, so users of the library would need to recompile their application for the new version.

Comment 8 Jim Nelson 2014-09-02 18:43:10 UTC

Well, I hope your work lands in one form or another, as it looks like an important (and desirable) change.