Bug 747315 – EXIF tags: should write strings as UTF-8 by default, not Latin1

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 747315 - EXIF tags: should write strings as UTF-8 by default, not Latin1


Summary:	EXIF tags: should write strings as UTF-8 by default, not Latin1


Status:	RESOLVED OBSOLETE

Product:	GStreamer
Classification:	Platform
Component:	gst-plugins-base
Version:	git master
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	git master
Assigned To:	GStreamer Maintainers
QA Contact:	GStreamer Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2015-04-03 20:11 UTC by Tim-Philipp Müller
Modified:	2018-11-03 11:36 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
UTF8 strings in exif (4.13 KB, patch) 2016-06-20 16:57 UTC, Paulo Neves	needs-work	Details \| Review

Description Tim-Philipp Müller 2015-04-03 20:11:12 UTC

+++ This bug was initially created as a clone of Bug #723252 +++

Currently we seem to write all EXIF tag strings in Ascii/Latin1 format, which is rather suboptimal. We should investigate how to write strings in UTF-8 or some other unicode variant.

Comment 1 Paulo Neves 2016-06-20 16:57:19 UTC

Created attachment 330094 [details] [review]
UTF8 strings in exif

I investigated the issue and it happens that EXIF
standard is mute on other character encodings
outside 7bit ASCII. This is largely ignored
by the industry, as in the case of gstreamer,
that uses 8 bit ASCII/latin1 encoding.

With the above, encoding the tags in UTF8 would
not result in a visible difference from what is
implemented now due to UTF8 being equivalent to
latin1 in the single byte range.
The differences would only be spotted for
characters outside the ASCII range which
previously was not supported. Even when multi
byte characters are used UTF8 assures that
it is endian independent.

Another argument to make gstreamer exif
implementation write UTF8 tags is
that it already supports parsing UTF8 encoded
tags (see gstexiftag.c@parse_exif_ascii_tag),
as verified by tests.

Comment 2 Sebastian Dröge (slomo) 2016-06-21 08:40:42 UTC

Review of attachment 330094 [details] [review]:

::: gst-libs/gst/tag/gstexiftag.c
@@ +814,3 @@
+
+  /* UTF8 is endianness independent */
+  if (g_utf8_validate (str, -1, &str_end))

Shouldn't all tags *we* get from a taglist and write into EXIF be valid UTF8 in any case? The old code also assumes that

@@ -832,3 @@
-  else
-    ascii_str =
-        g_convert (str, -1, "latin1", "utf8", NULL, &ascii_size, &error);

IMHO if the standard does not define anything, latin1 is as valid as UTF8 but I guess nowadays UTF8 is more common. It would be good if a new version of the EXIF standard could define this (how can you define a standard containing text without thinking of the character set encoding... who doesn't speak/write English anyway?)

::: tests/check/libs/tag.c
@@ +1447,3 @@
+  /* utf8 characters */
+  g_value_set_static_string (&value,
+      "Τη γλώσσα μου έδωσαν ελληνική");

Might make sense to also add a test for >2 byte characters. Where does this one come from btw? Just curious :)

Comment 3 Tim-Philipp Müller 2016-06-21 08:53:14 UTC

UTF-8 makes slightly more sense because it can be detected more reliably than Latin1 (where pretty much any byte is valid).

Comment 4 Paulo Neves 2016-06-21 13:47:02 UTC

@Sebastian:

Indeed the GST_TAGs are assumed to be UTF-8, but are they validated when written in all the code? If the assumption is strong then i can remove the g_utf8_validate conditional in a new patch.

About the latin1. The conversion previously done was surely not founded in anything stated in the standard. As I said, only ANSI ASCII was foreseen. It is indeed ironic that a Japanese industry standard depends on ANSI ASCII. One of the possible explanations I read was that the TIFF standard is the base of the JIF standard which in turn is an ANSI standard, ergo ANSI ASCII.

I thought that the Greek characters would be inherently multi byte. Can you confirm me they are not? 

The text is from an Odysseus Elytis poem i think. Sorry but it has nothing to do with my understanding of Greek == 0. I found it in a page which has lots of poems in various languages that use Unicode: http://www.columbia.edu/~fdc/utf8/. I think i read in the gstreamer conference page that that lots of developers of gstreamer are living in Greece.

@Tim-Philipp I hope this patch can solve a problem then. If you have more problems regarding tags that need to be solved let me know. I worked on this bug because I skimmed your bugzilla profile looking for a place to help.

Comment 5 Sebastian Dröge (slomo) 2016-06-21 14:59:14 UTC

(In reply to Paulo Neves from comment #4)

> I thought that the Greek characters would be inherently multi byte. Can you
> confirm me they are not? 

They are, but just like the German umlauts in the test just above they are only 2 bytes per character and not more.

Comment 6 Sebastian Dröge (slomo) 2016-06-21 14:59:51 UTC

(In reply to Paulo Neves from comment #4)
> @Sebastian:
> 
> Indeed the GST_TAGs are assumed to be UTF-8, but are they validated when
> written in all the code? If the assumption is strong then i can remove the
> g_utf8_validate conditional in a new patch.

If the strings we pass around are not UTF8, then lots of other things will fall apart already :) You can assume that

Comment 7 GStreamer system administrator 2018-11-03 11:36:33 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/gstreamer/gst-plugins-base/issues/177.