GNOME Bugzilla – Bug 747315
EXIF tags: should write strings as UTF-8 by default, not Latin1
Last modified: 2018-11-03 11:36:33 UTC
+++ This bug was initially created as a clone of Bug #723252 +++ Currently we seem to write all EXIF tag strings in Ascii/Latin1 format, which is rather suboptimal. We should investigate how to write strings in UTF-8 or some other unicode variant.
Created attachment 330094 [details] [review] UTF8 strings in exif I investigated the issue and it happens that EXIF standard is mute on other character encodings outside 7bit ASCII. This is largely ignored by the industry, as in the case of gstreamer, that uses 8 bit ASCII/latin1 encoding. With the above, encoding the tags in UTF8 would not result in a visible difference from what is implemented now due to UTF8 being equivalent to latin1 in the single byte range. The differences would only be spotted for characters outside the ASCII range which previously was not supported. Even when multi byte characters are used UTF8 assures that it is endian independent. Another argument to make gstreamer exif implementation write UTF8 tags is that it already supports parsing UTF8 encoded tags (see gstexiftag.c@parse_exif_ascii_tag), as verified by tests.
Review of attachment 330094 [details] [review]: ::: gst-libs/gst/tag/gstexiftag.c @@ +814,3 @@ + + /* UTF8 is endianness independent */ + if (g_utf8_validate (str, -1, &str_end)) Shouldn't all tags *we* get from a taglist and write into EXIF be valid UTF8 in any case? The old code also assumes that @@ -832,3 @@ - else - ascii_str = - g_convert (str, -1, "latin1", "utf8", NULL, &ascii_size, &error); IMHO if the standard does not define anything, latin1 is as valid as UTF8 but I guess nowadays UTF8 is more common. It would be good if a new version of the EXIF standard could define this (how can you define a standard containing text without thinking of the character set encoding... who doesn't speak/write English anyway?) ::: tests/check/libs/tag.c @@ +1447,3 @@ + /* utf8 characters */ + g_value_set_static_string (&value, + "Τη γλώσσα μου έδωσαν ελληνική"); Might make sense to also add a test for >2 byte characters. Where does this one come from btw? Just curious :)
UTF-8 makes slightly more sense because it can be detected more reliably than Latin1 (where pretty much any byte is valid).
@Sebastian: Indeed the GST_TAGs are assumed to be UTF-8, but are they validated when written in all the code? If the assumption is strong then i can remove the g_utf8_validate conditional in a new patch. About the latin1. The conversion previously done was surely not founded in anything stated in the standard. As I said, only ANSI ASCII was foreseen. It is indeed ironic that a Japanese industry standard depends on ANSI ASCII. One of the possible explanations I read was that the TIFF standard is the base of the JIF standard which in turn is an ANSI standard, ergo ANSI ASCII. I thought that the Greek characters would be inherently multi byte. Can you confirm me they are not? The text is from an Odysseus Elytis poem i think. Sorry but it has nothing to do with my understanding of Greek == 0. I found it in a page which has lots of poems in various languages that use Unicode: http://www.columbia.edu/~fdc/utf8/. I think i read in the gstreamer conference page that that lots of developers of gstreamer are living in Greece. @Tim-Philipp I hope this patch can solve a problem then. If you have more problems regarding tags that need to be solved let me know. I worked on this bug because I skimmed your bugzilla profile looking for a place to help.
(In reply to Paulo Neves from comment #4) > I thought that the Greek characters would be inherently multi byte. Can you > confirm me they are not? They are, but just like the German umlauts in the test just above they are only 2 bytes per character and not more.
(In reply to Paulo Neves from comment #4) > @Sebastian: > > Indeed the GST_TAGs are assumed to be UTF-8, but are they validated when > written in all the code? If the assumption is strong then i can remove the > g_utf8_validate conditional in a new patch. If the strings we pass around are not UTF8, then lots of other things will fall apart already :) You can assume that
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-base/issues/177.