GNOME Bugzilla – Bug 408637
g_date_strftime failure
Last modified: 2018-05-24 10:58:15 UTC
Add the following sniplet to testgdate.c setlocale(LC_TIME,"fi_FI"); g_date_set_dmy(d, 12, 1, 2006); g_date_strftime(buf,100,"Today is a %b\n", d); g_print ("[%s]\n", buf); and observe: (process:19948): GLib-WARNING **: gdate.c:1493Error converting results of strftime to UTF-8: Invalid byte sequence in conversion input [which is an ugly error message, btw.] The problem is that strftime looks at LC_TIME whereas g_locale_to_utf8 looks at something else. My LANG is en_US.UTF-8
Don't do that then?
Why would you say that? I need to get translated month names in data dependent locale without interfering with number formatting, so LC_ALL is out of the question. strftime is documented to use LC_TIME so the code shouldn't assume it uses anything else. Or are you saying that people should set different LC_* variables to different values? Doing so is quite common, although typically it is LC_NUMERIC that is set.
I mean, doesn't the warning mean that LC_TIME=fi_FI uses a different charset than that which g_get_charset() returns (presumably UTF-8 in your case, as your have LANG set to use UTF-8)? Isn't such a combination broken by design? What happens if you set LC_TIME to fi_FI.UTF-8 instead?
There is nothing broken about setting LC_whatever to something that uses a different character set than other LC settings. I set it, the C library uses it and returns the right value. And that's it. (Well, in this case the C library doesn't return the right value -- the character that trips up glib is 0xa0 which shouldn't have been there to begin with. But June and July would cause actual problems since they contain \"a.) The problem arises inside glib when it assumes that all strings from the C library come back in the same encoding. Well, strftime is the documented exception to that rule. > What happens if you set LC_TIME to fi_FI.UTF-8 instead? Then, of course, I don't get an error message. That's beside the point, though. But locale values are not something I get to pick and choose. There is a fixed set of valid values. Worse, the codeset part is not even standardized, see http://www.debian.org/doc/manuals/intro-i18n/ch-locale.en.html: [...] There are no standard for codeset and modifier. [...] My language strings are data dependent so I can't punt it out to the user. And I cannot simply tag on the result of g_get_charset because that doesn't work due to aliases, notably hyphens-vs-dashes-vs-nothing in ISO-8859-1. I cannot parse the result of "setlocale (LC_MESSAGES, NULL)" because that is (in theory and practice) an opaque string.
> There is nothing broken about setting LC_whatever to something that uses > a different character set than other LC settings. I set it, the C > library uses it and returns the right value. And that's it. Well, one thing thats broken is that there is nl_langinfo (CODESET) which returns "the character encoding used in the selected locale". There is no similar function to get "the character encoding used for the parts of localedata which happens to depend on LC_TIME". So at least nl_langinfo seems to promote the idea that there should be a single charset for all aspects of localedata.
I know this is a very old report, but I ran into the issue today. I started to use g_date_set_parse() and saw it works fine if I don't touch any locale settings, but fails miserably if I set LANG=C, which is a very common thing to do to get an untranslated program. Maybe as you suggest the libc is broken in some aspect at not allowing to fetch per-group charset, but this is a really annoying issue. A workaround for some (most?) situations would perhaps be to get g_get_charset() to guess UTF-8 as a fallback instead of US-ASCII -- although it's not a real solution since it wouldn't fix it with systems using a non-UTF-8 locale. However, it's most likely an harmless thing because UTF-8 is compatible with US-ASCII and has a very strict and unambiguous representation, so it's really unlikely a non-UTF-8 charset could be successfully parsed as UTF-8. What I mean is that if LC_TIME actually use ISO-8859-1, it's most likely that g_loacale_to_utf8() will just fail like it currently does; while if it is UTF-8 it'd work just fine. A more real fix maybe would be to parse LC_TIME to get the charset, and if not found fallback on g_get_charset(). I don't know the complete rules for locale settings, but extracting the encoding from something like fr_FR.UTF-8 is mostly a matter of: if (p = strchr(g_getenv("LC_TIME"), '.') && p[1]) { lc_time_charset = &p[1]; } else { g_get_charset(&lc_time_charset); }
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/81.