Bug 730136 – provide a way to handle badly encoded strings in python3

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 730136 - provide a way to handle badly encoded strings in python3


Summary:	provide a way to handle badly encoded strings in python3


Status:	RESOLVED FIXED

Product:	gexiv2
Classification:	Other
Component:	implementation
Version:	0.10.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	0.12.0
Assigned To:	Gexiv2 Maintainers
QA Contact:	Gexiv2 Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-05-14 15:25 UTC by Alexandre Rossi
Modified:	2015-11-06 11:24 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
image with bad 'Exif.Image.Artist' (614 bytes, image/jpeg) 2014-05-14 15:25 UTC, Alexandre Rossi		Details
proposed patch (1.82 KB, patch) 2014-10-07 15:57 UTC, Alexandre Rossi	needs-work	Details \| Review
proposed patch v2 (10.25 KB, patch) 2014-10-18 15:25 UTC, Alexandre Rossi	none	Details \| Review

Description Alexandre Rossi 2014-05-14 15:25:03 UTC

Created attachment 276541 [details]
image with bad 'Exif.Image.Artist'

Provided the attached image, in python3, it is not possible to display decoded string using error='replace' field.

>>> from gi.repository import GExiv2
>>> m = GExiv2.Metadata('sample-author-badencoding.jpg')
>>> m.get('Exif.Image.Artist')
Traceback (most recent call last):

+ Trace 233599

File "<stdin>", line 1 in <module>
File "/usr/lib/python3/dist-packages/gi/overrides/GExiv2.py", line 80 in get
```
return self.get_tag_string(key) if self.has_tag(key) else default
```
File "/usr/lib/python3/dist-packages/gi/types.py", line 43 in function
```
return info.invoke(*args, **kwargs)
```

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 0: invalid start byte


In python 2, string was decoded in the application, but there is no access to the raw bytes in python3.

The Gexiv2 python bindings should either return unicode strings or provide access to raw bytes. Or I am missing some available API?

Comment 1 Jim Nelson 2014-08-08 19:46:15 UTC

Looking at the artist field in this file, it's these bytes:

C0h EBh E5h EAh F1h E0h EDh E4h F0h 20h CAh EEh F8h E5h EBh E5h E2h

Converting this into Unicode begs the question of what encoding is being used in the first place and the assumption that gexiv2 has easy access to a decoder for that encoding.  The above doesn't appear to have any Unicode BOM I can find.

gexiv2 could provide a mechanism to get the raw bytes of a field.  I believe Exiv2::ExifDatum::dataArea() will provide that.  However, there doesn't appear to be a corresponding dataArea() for IptcDatum or XmpDatum, which limits the utility of such a call.

Even if gexiv2 did provide such a thing, how would you use it in this case?

Comment 2 Alexandre Rossi 2014-08-09 10:58:37 UTC

My use cas would be to catch the UnicodeDecodeError and handle it by getting the raw bytes (bytes python3 type) :

from gi.repository import GExiv2
m = GExiv2.Metadata('sample-author-badencoding.jpg')
try:
    v = m.get('Exif.Image.Artist')
except UnicodeDecodeError:
    artist_bytes = m.get_raw('Exif.Image.Artist')
    v = artist_bytes.decode('utf-8', errors='replace')

or even simpler

from gi.repository import GExiv2
m = GExiv2.Metadata('sample-author-badencoding.jpg')
v = m.get('Exif.Image.Artist', errors='replace')

Comment 3 Jim Nelson 2014-08-12 20:20:51 UTC

Two things would be required:

(a) Add a new method to gexiv2, gexiv2_metadata_get_raw_exif(const gchar *)

(b) Add a Python binding (although, if properly annotated, get_raw_exif() should be available via GObject Introspection).

I don't believe I'm going to have time to do this in the near future.  Patches are certainly welcome.

Comment 4 Alexandre Rossi 2014-10-07 15:57:43 UTC

Created attachment 287973 [details] [review]
proposed patch

The attached patch accomplishes what I want.

Any comment appreciated as I am completely new to GLib hacking.

Comment 5 Jim Nelson 2014-10-08 01:38:43 UTC

Review of attachment 287973 [details] [review]:

This is the right idea, but there's a few more things that need to be done here:

* First, thank you for using GBytes.  We're using it in Geary and it's certainly a saner way to deal with moving buffers around and transferring ownership.  So, that's the right approach.  However, note that g_bytes_new_take's second argument is not the sizeof the pointer but the number of bytes in the array it's taking.  That would be strlen+1 for a string, but:

* gexiv2_metadata_get_tag_string is not the right way to get the raw bytes of a tag value.  That call assumes the returned buffer is a NUL-terminated string and hence may return a partial buffer for binary data.  Yes, that will work for UTF-8, but if gexiv2 is going to return all the raw bytes, it has to return everything (including the terminating NUL if a UTF-8 or ASCII string).

Looking closely at the Exiv2 docs, I see that ExifDatum has a dataArea() method that returns the raw bytes for a tag.  IptcDatum is structured a little differently; you have to get its Value object, which in turn has a dataArea() method.  XmpDatum works similarly.  The DataBuf returned by these dataArea() calls has its own ownership semantics, so be careful when moving/copying the byte array into GBytes.

Thus, much like the other calls, you'll need to build separate EXIF, IPTC, and XMP handlers that are fronted by a generic get_raw() function that calls each depending on the tag the caller supplies.  (Look at gexiv2_metadata_get_tag_type for an example of what I'm talking about.)  I know it's a bit more work, but that's how these things go with gexiv2.

Comment 6 Alexandre Rossi 2014-10-18 15:25:49 UTC

Created attachment 288810 [details] [review]
proposed patch v2

I cannot make dataArea() work.

But copy() works well.

Comment 7 Jim Nelson 2014-10-23 19:17:31 UTC

Great work!

Pushed to master, commit a7e10b

Comment 8 Jim Easterbrook 2015-11-06 11:24:42 UTC

Sorry to chip in here a year later but I've just started trying to use the "new" get_tag_raw method.

Would it be possible to add a "get_tag_raw_multiple" method to get the unmodified data that get_tag_multiple is converting to utf-8? Some IPTC tags of type "String" can repeat and I can't get at any but the first string with get_tag_raw.