Bug 772221 – Take advantage of Unicode

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 772221 - Take advantage of Unicode


Summary:	Take advantage of Unicode


Status:	RESOLVED OBSOLETE

Product:	glib
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:	772263 772870

Reported:	2016-09-30 03:48 UTC by Piotr Drąg
Modified:	2018-05-24 19:06 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Use Unicode in translatable strings (166.48 KB, patch) 2016-09-30 03:48 UTC, Piotr Drąg	committed	Details \| Review

Description Piotr Drąg 2016-09-30 03:48:24 UTC

Created attachment 336576 [details] [review]
Use Unicode in translatable strings

Attached patch converts ASCII characters to Unicode, as recommended by <https://developer.gnome.org/hig/stable/typography.html>.

Comment 1 Matthias Clasen 2016-10-05 18:32:24 UTC

Since these are all string channges, this needs to wait until we branch.

Comment 2 Matthias Clasen 2016-10-05 18:32:47 UTC

Review of attachment 336576 [details] [review]:

.

Comment 3 Piotr Drąg 2016-10-12 19:31:54 UTC

Comment on attachment 336576 [details] [review]
Use Unicode in translatable strings

Pushed, thank you!

Comment 4 Matthias Clasen 2016-10-24 13:45:31 UTC

I reverted the gmarkup changes, since they broke most of our gmarkup tests. This will need to be redone with the necessary fixes for the tests.

Comment 5 Simon McVittie 2016-10-24 18:36:39 UTC

(In reply to Matthias Clasen from comment #4)
> This will need to be redone with the necessary fixes for the tests.

There's a partial patch on Bug #772870; I also ran out of time to get through all the test failures, but perhaps it could be a starting point for someone.

Comment 6 Scott Hutton 2018-05-17 15:19:32 UTC

This update has some unintended side effects.

We use glib2 extensively as a foundation library for service development.  Our unit tests and test automation suites watch for messages (often generated from glib2) to perform qualification.  We've certainly had this break in the past when some messages subtly changed (e.g., when more information was added to the messages reported from gkeyfile.c), but the inclusion of "smart quotes" in error messages that are getting sent to systems that expect 7-bit ASCII (e.g., some implementations of syslog, stripped-down embedded systems, dumb terminals, etc.) is kind of making a mess of things.

Our bad for not noticing this change when it got introduced in 2.53.1 (our CentOS systems were locked at 2.50 until last week, when 2.54 filtered in via updates), but now we're dealing with it, and likely going to have to write filters to "fix" all the non-ASCII output.

My suggestion?  Limit the C code to 7-bit ASCII.  If you really want Unicode in messages, do it via the gettext() transforms.

Comment 7 Philip Withnall 2018-05-17 16:18:35 UTC

(In reply to Scott Hutton from comment #6)
> We use glib2 extensively as a foundation library for service development. 
> Our unit tests and test automation suites watch for messages (often
> generated from glib2) to perform qualification.

If you are basing test suites on human-readable messages which are outputted by code, you have to accept that those messages might change. GLib does not provide API guarantees for its translatable strings/messages.

Changing quotes in translatable messages is (while a bit more pervasive), theoretically no different to adding more debugging information in messages or changing the order in which messages are emitted.

> the
> inclusion of "smart quotes" in error messages that are getting sent to
> systems that expect 7-bit ASCII (e.g., some implementations of syslog,
> stripped-down embedded systems, dumb terminals, etc.) is kind of making a
> mess of things.

If your locale is set up as (for example) C in ASCII, then GLib should transliterate its output to valid ASCII. If that’s not happening, please file a separate bug report about it.

However, if you are using 7-bit systems which don’t correctly set their locale as non-UTF-8, then you can’t expect GLib to output messages not in UTF-8.

> My suggestion?  Limit the C code to 7-bit ASCII.  If you really want Unicode
> in messages, do it via the gettext() transforms.

No, that would mean twice as much work maintaining the untranslated strings in GLib. GLib (and anything which uses it) is defined as using UTF-8 internally, which means we can put fancy quotes in our internal strings. Transliteration to ASCII should happen correctly on output if the environment sets its locale correctly.

Comment 8 Scott Hutton 2018-05-17 21:50:53 UTC

Perhaps I'm missing something, but it appears that *all* of the translation files are UTF-8 now.  Some (e.g., po/en_CA.po) do still have the non-UTF-8 variants of these messages, but I assume that's an oversight.

Even if the locale is set to "C", you'll still get back the msgid (i.e., from the C source), which contains the unicode.  So, there doesn't appear to be any way (going forward) to obtain clean messages, even by fiddling with the locale.

Comment 9 Philip Withnall 2018-05-18 10:29:02 UTC

(In reply to Scott Hutton from comment #8)
> Perhaps I'm missing something, but it appears that *all* of the translation
> files are UTF-8 now.  Some (e.g., po/en_CA.po) do still have the non-UTF-8
> variants of these messages, but I assume that's an oversight.
> 
> Even if the locale is set to "C", you'll still get back the msgid (i.e.,
> from the C source), which contains the unicode.  So, there doesn't appear to
> be any way (going forward) to obtain clean messages, even by fiddling with
> the locale.

Whenever GLib prints anything (for example, using g_message()), it converts from UTF-8 to the current locale’s character set (as obtained by calling g_get_charset()) on output. See strdup_convert() in gmessages.c, for example.

As long as your locale variables are set correctly, this should work. It’s possible there are some places where GLib doesn’t convert before output (which would be a bug), but generally I think we convert in all the right places.

What are your locale variables set to?

Comment 10 Scott Hutton 2018-05-18 21:33:24 UTC

Thanks to the character conversions, we're seeing a couple of things.  In stripped-down server environments, which only support the portable C (non-UTF-8) locale, the smart quotes are translated into question marks, which is almost as bad, and possibly worse (since at least the smart quotes *might* render if someone's terminal is set properly).

Test program, which forces one of the offending error messages:

-----
#include <glib.h>
#include <locale.h>

int
main(int argc, char *argv[])
{
    const gchar *charset = NULL;
    gchar *cur_locale = setlocale(LC_ALL, "");
    g_autoptr(GKeyFile) keyfile = NULL;
    g_autoptr(GError) err = NULL;

    g_printerr("setlocale(LC_ALL, \"\") => \"%s\"\n", cur_locale);
    if (g_get_charset(&charset)) {
        g_print("g_get_charset() => \"%s\"\n", charset);
    }
    else {
        g_print("g_get_charset() => FALSE\n");
    }

    keyfile = g_key_file_new();

    if (!g_key_file_load_from_file(keyfile, "/etc/profile", G_KEY_FILE_NONE, &err)) {
        g_print("error = (%s)\n", err->message);
    }

    return 0;
}
-----

Output (in a clean CentOS-7 Docker container):

-----
setlocale(LC_ALL, "") => "C"
g_get_charset() => FALSE
error = (Key file contains line ?pathmunge () {? which is not a key-value pair, group, or comment)
-----

So, it's working as you suggested it would (in that the unicode characters are getting "fixed"), just not really what you'd expect.  Before glib 2.53.1, those presented as proper quotation marks.

It doesn't look like strdup_convert() really knows how to do anything other than convert to question marks.  Given that there are really only a few offending characters in the log messages (mainly quotes and ellipses), it seems like it might be worth looking into remapping those few.  Unfortunately, it doesn't look like this function can be overridden from the current messages APIs.

Comment 11 GNOME Infrastructure Team 2018-05-24 19:06:44 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/1205.