GNOME Bugzilla – Bug 730136
provide a way to handle badly encoded strings in python3
Last modified: 2015-11-06 11:24:42 UTC
Created attachment 276541 [details] image with bad 'Exif.Image.Artist' Provided the attached image, in python3, it is not possible to display decoded string using error='replace' field. >>> from gi.repository import GExiv2 >>> m = GExiv2.Metadata('sample-author-badencoding.jpg') >>> m.get('Exif.Image.Artist') Traceback (most recent call last):
+ Trace 233599
return self.get_tag_string(key) if self.has_tag(key) else default
return info.invoke(*args, **kwargs)
In python 2, string was decoded in the application, but there is no access to the raw bytes in python3. The Gexiv2 python bindings should either return unicode strings or provide access to raw bytes. Or I am missing some available API?
Looking at the artist field in this file, it's these bytes: C0h EBh E5h EAh F1h E0h EDh E4h F0h 20h CAh EEh F8h E5h EBh E5h E2h Converting this into Unicode begs the question of what encoding is being used in the first place and the assumption that gexiv2 has easy access to a decoder for that encoding. The above doesn't appear to have any Unicode BOM I can find. gexiv2 could provide a mechanism to get the raw bytes of a field. I believe Exiv2::ExifDatum::dataArea() will provide that. However, there doesn't appear to be a corresponding dataArea() for IptcDatum or XmpDatum, which limits the utility of such a call. Even if gexiv2 did provide such a thing, how would you use it in this case?
My use cas would be to catch the UnicodeDecodeError and handle it by getting the raw bytes (bytes python3 type) : from gi.repository import GExiv2 m = GExiv2.Metadata('sample-author-badencoding.jpg') try: v = m.get('Exif.Image.Artist') except UnicodeDecodeError: artist_bytes = m.get_raw('Exif.Image.Artist') v = artist_bytes.decode('utf-8', errors='replace') or even simpler from gi.repository import GExiv2 m = GExiv2.Metadata('sample-author-badencoding.jpg') v = m.get('Exif.Image.Artist', errors='replace')
Two things would be required: (a) Add a new method to gexiv2, gexiv2_metadata_get_raw_exif(const gchar *) (b) Add a Python binding (although, if properly annotated, get_raw_exif() should be available via GObject Introspection). I don't believe I'm going to have time to do this in the near future. Patches are certainly welcome.
Created attachment 287973 [details] [review] proposed patch The attached patch accomplishes what I want. Any comment appreciated as I am completely new to GLib hacking.
Review of attachment 287973 [details] [review]: This is the right idea, but there's a few more things that need to be done here: * First, thank you for using GBytes. We're using it in Geary and it's certainly a saner way to deal with moving buffers around and transferring ownership. So, that's the right approach. However, note that g_bytes_new_take's second argument is not the sizeof the pointer but the number of bytes in the array it's taking. That would be strlen+1 for a string, but: * gexiv2_metadata_get_tag_string is not the right way to get the raw bytes of a tag value. That call assumes the returned buffer is a NUL-terminated string and hence may return a partial buffer for binary data. Yes, that will work for UTF-8, but if gexiv2 is going to return all the raw bytes, it has to return everything (including the terminating NUL if a UTF-8 or ASCII string). Looking closely at the Exiv2 docs, I see that ExifDatum has a dataArea() method that returns the raw bytes for a tag. IptcDatum is structured a little differently; you have to get its Value object, which in turn has a dataArea() method. XmpDatum works similarly. The DataBuf returned by these dataArea() calls has its own ownership semantics, so be careful when moving/copying the byte array into GBytes. Thus, much like the other calls, you'll need to build separate EXIF, IPTC, and XMP handlers that are fronted by a generic get_raw() function that calls each depending on the tag the caller supplies. (Look at gexiv2_metadata_get_tag_type for an example of what I'm talking about.) I know it's a bit more work, but that's how these things go with gexiv2.
Created attachment 288810 [details] [review] proposed patch v2 I cannot make dataArea() work. But copy() works well.
Great work! Pushed to master, commit a7e10b
Sorry to chip in here a year later but I've just started trying to use the "new" get_tag_raw method. Would it be possible to add a "get_tag_raw_multiple" method to get the unmodified data that get_tag_multiple is converting to utf-8? Some IPTC tags of type "String" can repeat and I can't get at any but the first string with get_tag_raw.