GNOME Bugzilla – Bug 772221
Take advantage of Unicode
Last modified: 2018-05-24 19:06:44 UTC
Created attachment 336576 [details] [review] Use Unicode in translatable strings Attached patch converts ASCII characters to Unicode, as recommended by <https://developer.gnome.org/hig/stable/typography.html>.
Since these are all string channges, this needs to wait until we branch.
Review of attachment 336576 [details] [review]: .
Comment on attachment 336576 [details] [review] Use Unicode in translatable strings Pushed, thank you!
I reverted the gmarkup changes, since they broke most of our gmarkup tests. This will need to be redone with the necessary fixes for the tests.
(In reply to Matthias Clasen from comment #4) > This will need to be redone with the necessary fixes for the tests. There's a partial patch on Bug #772870; I also ran out of time to get through all the test failures, but perhaps it could be a starting point for someone.
This update has some unintended side effects. We use glib2 extensively as a foundation library for service development. Our unit tests and test automation suites watch for messages (often generated from glib2) to perform qualification. We've certainly had this break in the past when some messages subtly changed (e.g., when more information was added to the messages reported from gkeyfile.c), but the inclusion of "smart quotes" in error messages that are getting sent to systems that expect 7-bit ASCII (e.g., some implementations of syslog, stripped-down embedded systems, dumb terminals, etc.) is kind of making a mess of things. Our bad for not noticing this change when it got introduced in 2.53.1 (our CentOS systems were locked at 2.50 until last week, when 2.54 filtered in via updates), but now we're dealing with it, and likely going to have to write filters to "fix" all the non-ASCII output. My suggestion? Limit the C code to 7-bit ASCII. If you really want Unicode in messages, do it via the gettext() transforms.
(In reply to Scott Hutton from comment #6) > We use glib2 extensively as a foundation library for service development. > Our unit tests and test automation suites watch for messages (often > generated from glib2) to perform qualification. If you are basing test suites on human-readable messages which are outputted by code, you have to accept that those messages might change. GLib does not provide API guarantees for its translatable strings/messages. Changing quotes in translatable messages is (while a bit more pervasive), theoretically no different to adding more debugging information in messages or changing the order in which messages are emitted. > the > inclusion of "smart quotes" in error messages that are getting sent to > systems that expect 7-bit ASCII (e.g., some implementations of syslog, > stripped-down embedded systems, dumb terminals, etc.) is kind of making a > mess of things. If your locale is set up as (for example) C in ASCII, then GLib should transliterate its output to valid ASCII. If that’s not happening, please file a separate bug report about it. However, if you are using 7-bit systems which don’t correctly set their locale as non-UTF-8, then you can’t expect GLib to output messages not in UTF-8. > My suggestion? Limit the C code to 7-bit ASCII. If you really want Unicode > in messages, do it via the gettext() transforms. No, that would mean twice as much work maintaining the untranslated strings in GLib. GLib (and anything which uses it) is defined as using UTF-8 internally, which means we can put fancy quotes in our internal strings. Transliteration to ASCII should happen correctly on output if the environment sets its locale correctly.
Perhaps I'm missing something, but it appears that *all* of the translation files are UTF-8 now. Some (e.g., po/en_CA.po) do still have the non-UTF-8 variants of these messages, but I assume that's an oversight. Even if the locale is set to "C", you'll still get back the msgid (i.e., from the C source), which contains the unicode. So, there doesn't appear to be any way (going forward) to obtain clean messages, even by fiddling with the locale.
(In reply to Scott Hutton from comment #8) > Perhaps I'm missing something, but it appears that *all* of the translation > files are UTF-8 now. Some (e.g., po/en_CA.po) do still have the non-UTF-8 > variants of these messages, but I assume that's an oversight. > > Even if the locale is set to "C", you'll still get back the msgid (i.e., > from the C source), which contains the unicode. So, there doesn't appear to > be any way (going forward) to obtain clean messages, even by fiddling with > the locale. Whenever GLib prints anything (for example, using g_message()), it converts from UTF-8 to the current locale’s character set (as obtained by calling g_get_charset()) on output. See strdup_convert() in gmessages.c, for example. As long as your locale variables are set correctly, this should work. It’s possible there are some places where GLib doesn’t convert before output (which would be a bug), but generally I think we convert in all the right places. What are your locale variables set to?
Thanks to the character conversions, we're seeing a couple of things. In stripped-down server environments, which only support the portable C (non-UTF-8) locale, the smart quotes are translated into question marks, which is almost as bad, and possibly worse (since at least the smart quotes *might* render if someone's terminal is set properly). Test program, which forces one of the offending error messages: ----- #include <glib.h> #include <locale.h> int main(int argc, char *argv[]) { const gchar *charset = NULL; gchar *cur_locale = setlocale(LC_ALL, ""); g_autoptr(GKeyFile) keyfile = NULL; g_autoptr(GError) err = NULL; g_printerr("setlocale(LC_ALL, \"\") => \"%s\"\n", cur_locale); if (g_get_charset(&charset)) { g_print("g_get_charset() => \"%s\"\n", charset); } else { g_print("g_get_charset() => FALSE\n"); } keyfile = g_key_file_new(); if (!g_key_file_load_from_file(keyfile, "/etc/profile", G_KEY_FILE_NONE, &err)) { g_print("error = (%s)\n", err->message); } return 0; } ----- Output (in a clean CentOS-7 Docker container): ----- setlocale(LC_ALL, "") => "C" g_get_charset() => FALSE error = (Key file contains line ?pathmunge () {? which is not a key-value pair, group, or comment) ----- So, it's working as you suggested it would (in that the unicode characters are getting "fixed"), just not really what you'd expect. Before glib 2.53.1, those presented as proper quotation marks. It doesn't look like strdup_convert() really knows how to do anything other than convert to question marks. Given that there are really only a few offending characters in the log messages (mainly quotes and ellipses), it seems like it might be worth looking into remapping those few. Unfortunately, it doesn't look like this function can be overridden from the current messages APIs.
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/1205.