Bug 766092 – Incorrect locale handling in g_date_time_format_locale()

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 766092 - Incorrect locale handling in g_date_time_format_locale()


Summary:	Incorrect locale handling in g_date_time_format_locale()


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	win32
Version:	unspecified
Hardware:	Other Windows

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtk-win32 maintainers
QA Contact:	gtk-win32 maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2016-05-07 02:08 UTC by LRN
Modified:	2016-05-16 05:19 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Testcase (1.11 KB, text/plain) 2016-05-07 02:08 UTC, LRN		Details
g_date_time_format_locale: ensure locale encoding is used (4.98 KB, patch) 2016-05-07 17:09 UTC, LRN	none	Details \| Review
g_date_time_format_locale: ensure locale encoding is used (v. 1.5) (5.35 KB, patch) 2016-05-13 09:21 UTC, Fan, Chun-wei	none	Details \| Review
g_date_time_format_locale: ensure locale encoding is used (6.71 KB, patch) 2016-05-13 11:47 UTC, LRN	committed	Details \| Review
Output on zh-TW Windows (CodePage 950) (61.76 KB, image/png) 2016-05-16 04:50 UTC, Fan, Chun-wei		Details

Description LRN 2016-05-07 02:08:01 UTC

Created attachment 327422 [details]
Testcase

How to reproduce:
* Compile and run the attached test program

Expected result:
g_date_time_format() returns "Вс" (Sunday)

Actual result:
g_date_time_format() returns gibberish

What happens:
* g_date_time_format() calls g_get_charset()
* g_get_charset() calls GetACP()
* GetACP() returns current Windows system codepage (1251, if Windows locale is Russian)
* g_get_charset() returns "CP1251" charset in its output buffer and FALSE in its return value, indicating that locale is not UTF-8
(if Windows locale is not Russian, it probably doesn't matter, as it will practically never be UTF-8, and that is the important part)
* g_date_time_format() calls g_date_time_format_locale()
* g_date_time_format_locale() calls WEEKDAY_ABBR (datetime),
* which expands to nl_langinfo (weekday_item[0][g_date_time_get_day_of_week (d) - 1])
  OR (when langinfo is missing - that is the case on Windows):
  which expands to (get_weekday_name_abbr (g_date_time_get_day_of_week (d)))
* get_weekday_name_abbr() calls C_("abbreviated weekday name", "Sun")
* C_() reads the Russian string for "Sun" from the appropriate .mo file, which is UTF-8-encoded
* g_date_time_format_locale() returns "Вс", UTF-8-encoded
* g_date_time_format() checks whether OS locale is UTF-8 (see above)
  if OS locale is not UTF-8 (it never is), g_date_time_format() calls g_locale_to_utf8() to convert UTF-8-encoded-string (casted as locale-encoded) to UTF-8

What glib does wrong:
* Calls gettext, which returs strings in encoding that has absolutely nothing to do with the OS locale, and then assumes that the result is locale-encoded (as if it was returned from nl_langinfo(), which indeed would have returned locale-encoded string).

How to fix:
A) Implement a minimal W32 version of nl_langinfo() that relies on Windows API to get the appropriate strings. This is doable.

OR

B) Change g_date_time_format(), forcing it to assume the output of g_date_time_format_locale() is UTF-8-encoded when there's no langinfo available (and thus gettext is used; assuming that gettext always returns UTF-8)

OR

C) Change g_date_time_format_locale() to convert gettext output into locale encoding (assuming that gettext always returns UTF-8)


Testcase also calls C runtime (for comparison, its output would be locale-encoded, depending on Windows locale) and also prints the expected UTF-8-encoded output (for comparison).

Comment 1 LRN 2016-05-07 02:09:56 UTC

This bug manifests itself in non-native GtkFileChooser (i.e. not reproducible in gtk3-widget-factory, thank you, mclasen!), where it uses g_date_time_format() to create strings for the "Date" column.

Comment 2 LRN 2016-05-07 17:09:19 UTC

Created attachment 327446 [details] [review]
g_date_time_format_locale: ensure locale encoding is used

This patch implements the fix (C) as outlined above.

I think it's appropriate, as some code for charset conversion already
existed in this function (for other purposes).

Comment 3 Matthias Clasen 2016-05-07 20:29:47 UTC

(In reply to LRN from comment #0)

> 
> C) Change g_date_time_format_locale() to convert gettext output into locale
> encoding (assuming that gettext always returns UTF-8)
> 

No need to assume: gettext returns utf8 because we are calling bind_textdomain_codeset

Comment 4 Ignacio Casal Quinteiro (nacho) 2016-05-12 08:17:27 UTC

Review of attachment 327446 [details] [review]:

Seems fine to me

Comment 5 Fan, Chun-wei 2016-05-13 04:05:40 UTC

Review of attachment 327446 [details] [review]:

Hi,

I had a spin at this, but I think option (B) would be the best for me.  Option (C) will still return gibberish if I run it in Chinese (obviously with the text/codepage in Chinese, so this may mean that the g_utf8_to_locale() would need to take into account of the nul characther that may appear in such an encoding).

Sorry for the delay, with blessings, and cheers!

Comment 6 LRN 2016-05-13 04:29:37 UTC

(In reply to Fan, Chun-wei from comment #5)
> Review of attachment 327446 [details] [review] [review]:
> I had a spin at this, but I think option (B) would be the best for me. 
> Option (C) will still return gibberish if I run it in Chinese (obviously
> with the text/codepage in Chinese, so this may mean that the
> g_utf8_to_locale() would need to take into account of the nul characther
> that may appear in such an encoding).

Um...so what you're saying is that
>	  name = WEEKDAY_ABBR (datetime);
          ^^ this code returns Chinese weekday string, UTF-8-encoded, and then
>#if !defined (HAVE_LANGINFO_TIME)
>	  if (!locale_is_utf8)
          ^^ this condition evaluates to TRUE for Chinese locale, and
>	    {
>	      tmp = g_locale_from_utf8 (name, -1, NULL, NULL, NULL);
              ^^ this call produces Chinese weekday string in some kind of local Chinese encoding (what are you guys using there anyway?), which can have NUL characters (??? no NUL-terminated strings?), and because of that
>	      if (!tmp)
>		return FALSE;
>	      g_string_append (outstr, tmp);
              ^^ this code fails to property append locale-encoded string, because it interprets valid NUL-character as a string terminator
>	      g_free (tmp);
>	    }
>	  else
>#endif
>	    {
>	      g_string_append (outstr, name);
>	    }

...or am i wrong?
I had hoped that this code would work on all non-US-ASCII locales.

Option (B) shouldn't be much different from option (C), it just makes one global assumption (*all* output of g_date_time_format_locale() is UTF-8-encoded when not using langinfo) instead of handling encoding on a call-by-call basis (where we know for sure that the string just returned from gettext has UTF-8 encoding). At this point i am not entirely sure that we *can* make such global assumption (and that future changes to this function won't break this assumption).

Comment 7 LRN 2016-05-13 04:33:52 UTC

Also, i think it's worth mentioning that my testcase has lots of hardcoded stuff, and its output might look like gibberish on some locales (this is related to console text encoding). The important part is that the "expected" and "glib" outputs should look identical.

If needed, i can make a GTK+-based testcase (which uses non-native filechooser dialog, where UTF-8 strings can be correctly displayed), for which you would only need to set LANG=... to make it work.

Comment 8 Fan, Chun-wei 2016-05-13 09:21:45 UTC

Created attachment 327768 [details] [review]
g_date_time_format_locale: ensure locale encoding is used (v. 1.5)

(In reply to LRN from comment #6)
> 
> Um...so what you're saying is that
> >	  name = WEEKDAY_ABBR (datetime);
>           ^^ this code returns Chinese weekday string, UTF-8-encoded

Yes.

> >#if !defined (HAVE_LANGINFO_TIME)
> >	  if (!locale_is_utf8)
>           ^^ this condition evaluates to TRUE for Chinese locale, and

No, it returned FALSE.

> >	    {
> >	      tmp = g_locale_from_utf8 (name, -1, NULL, NULL, NULL);
>               ^^ this call produces Chinese weekday string in some kind of
> local Chinese encoding (what are you guys using there anyway?), which can
> have NUL characters (??? no NUL-terminated strings?), and because of that

>               ^^ this code fails to property append locale-encoded string,
> because it interprets valid NUL-character as a string terminator

zh_TW, Codepage 950.  Note that due to the complexity of the characters, each UTF-8 character of this (and Japanese and Korean characters, at least) can take from 2-4 bytes.  There could then be a nul byte within.  It seems that I was mistaken--it seems that a combination of (B) and (C) are needed for me, so that things will work.

(In reply to LRN from comment #7)
> Also, i think it's worth mentioning that my testcase has lots of hardcoded
> stuff, and its output might look like gibberish on some locales (this is
> related to console text encoding). The important part is that the "expected"
> and "glib" outputs should look identical.

Yup, I understand that.  To double-check this, I ran a g_message on the name that was returned and the text matched the expected text.  I am totally fine with this test.

---
I updated your patch a bit as a result-does it still work alright with yours?

Comment 9 LRN 2016-05-13 11:47:26 UTC

Created attachment 327783 [details] [review]
g_date_time_format_locale: ensure locale encoding is used

v2:
* Keep the length of locale-encoded string after UTF8->locale conversion,
  use that length when appending.

(In reply to Fan, Chun-wei from comment #8)
>>>#if !defined (HAVE_LANGINFO_TIME)
>>>	  if (!locale_is_utf8)
>>           ^^ this condition evaluates to TRUE for Chinese locale, and
> 
> No, it returned FALSE.

Does "locale_is_utf8" evaluate to FALSE or does "!locale_is_utf8" evaluate to FALSE?

>>>	    {
>>>	      tmp = g_locale_from_utf8 (name, -1, NULL, NULL, NULL);
>>               ^^ this call produces Chinese weekday string in some kind of
>> local Chinese encoding (what are you guys using there anyway?), which can
>> have NUL characters (??? no NUL-terminated strings?), and because of that
>
>>               ^^ this code fails to property append locale-encoded string,
>> because it interprets valid NUL-character as a string terminator
> 
> zh_TW, Codepage 950.  Note that due to the complexity of the characters,
> each UTF-8 character of this (and Japanese and Korean characters, at least)
> can take from 2-4 bytes.  There could then be a nul byte within.  It seems
> that I was mistaken--it seems that a combination of (B) and (C) are needed
> for me, so that things will work.
Okay, so these are multibyte encodings. With NUL bytes.
What it means is that instead of
>   tmp = g_locale_from_utf8 (name, -1, NULL, NULL, NULL);
>   g_string_append (outstr, tmp);
we should do
>   tmp = g_locale_from_utf8 (name, -1, NULL, &tmp_len, NULL);
>   g_string_append_len (outstr, tmp, tmp_len);
as g_string_append_len() can handle embedded NULs. To be honest, i'm surprised that this code worked for multibyte locales *at all*.

>
>(In reply to LRN from comment #7)
>> Also, i think it's worth mentioning that my testcase has lots of hardcoded
>> stuff, and its output might look like gibberish on some locales (this is
>> related to console text encoding). The important part is that the "expected"
>> and "glib" outputs should look identical.
>
> Yup, I understand that.  To double-check this, I ran a g_message on the name
> that was returned and the text matched the expected text.  I am totally fine
> with this test.
> 
> ---
> I updated your patch a bit as a result-does it still work alright with yours?

Your version of the patch does not work ("expected" and "GLib" output are different).

Your changes do not appear to make sense at the moment. What your addition in g_date_time_format() does is it makes it always assume that locale is UTF-8, but only at the end, when it needs to return the result.
But your locale is not UTF-8. Or, at least, it shouldn't be. This, given your earlier statement regarding "locale_is_utf8", is very suspicious.

Would you please debug the code and see what happens in g_get_charset()? What does GetACP() returns on your system?

I'm attaching the new version of my original patch (without your changes from 1.5), which should handle embedded NULs correctly.

I've tried to modify my testcase to get strings in zh_TW.UTF-8 locale, but g_locale_from_utf8() just fails with EILSEQ error, as it tries to convert zh_TW UTF-8 characters into CP1251 and, obviously, fails spectacularly; i don't see a convenient way to change locale (what GetACP() returns) programmatically.
You might still want to change the testcase (switch LANG=ru_RU.UTF-8 to LANG=zh_TW.UTF-8 - that way both UTF-8 strings from gettext and the end result are representable in your locale encoding).

Comment 10 Fan, Chun-wei 2016-05-14 03:27:05 UTC

(In reply to LRN from comment #9)

> Does "locale_is_utf8" evaluate to FALSE or does "!locale_is_utf8" evaluate
> to FALSE?

Sorry, I was not clear in my previous reply.  This means locale_is_utf8 is FALSE.

> 
> >
> >(In reply to LRN from comment #7)
> >> Also, i think it's worth mentioning that my testcase has lots of hardcoded
> >> stuff, and its output might look like gibberish on some locales (this is
> >> related to console text encoding). The important part is that the "expected"
> >> and "glib" outputs should look identical.
> >
> > Yup, I understand that.  To double-check this, I ran a g_message on the name
> > that was returned and the text matched the expected text.  I am totally fine
> > with this test.
> > 
> > ---
> > I updated your patch a bit as a result-does it still work alright with yours?
> 
> Your version of the patch does not work ("expected" and "GLib" output are
> different).
> 
> Your changes do not appear to make sense at the moment. What your addition
> in g_date_time_format() does is it makes it always assume that locale is
> UTF-8, but only at the end, when it needs to return the result.
> But your locale is not UTF-8. Or, at least, it shouldn't be. This, given
> your earlier statement regarding "locale_is_utf8", is very suspicious.
> 
> Would you please debug the code and see what happens in g_get_charset()?
> What does GetACP() returns on your system?

GetACP() returns 950, so _g_locale_charset_raw() would return CP950.

I do agree it would be tough for you to get a Chinese locale working on your system. :)  Thanks though :).

It did seem to me that when we don't have HAVE_LANGINFO_TIME, if locale_is_utf8 is FALSE (which is going to be our case on Windows), the outstr that is "returned" by g_date_time_format() would contain a string that is already encoded in UTF-8, so doing g_locale_to_utf8() would break the string, causing it to be gibberish (but it will match the expected string of the test program if the test program was encoded in UTF-8 without BOM, but this is most probably not we are ultimately looking for).  The "correct" string, as listed in zh_TW.po ("週日", the abbreviated form of "Sunday" for me), would be displayed correctly if we just do "return g_string_free (outstr, FALSE);" (hence my add-on patch in v1.5).

The "expected" string is going to differ depending on what encoding the test program is encoded, at least, and what locale your system is set to be when that code was built.  For me, if the test source is not encoded in UTF-8 but in ANSI/CodePage 950, the GLib output string and the expected string will match when we just directly do "return g_string_free (outstr, FALSE);" without doing g_locale_to_utf8() on outstr->str in g_date_time_format().

My take for this: what I think we really need to look for is whether GLib outputs the string that matches the one we need to look for in the corresponding .po that is going to be used (i.e. ru.po for you, zh_TW.po for me).

Hope this explains the situation better.

With blessings, thank you, and cheers!

Comment 11 LRN 2016-05-14 05:15:43 UTC

(In reply to Fan, Chun-wei from comment #10)
> (In reply to LRN from comment #9)
>> 
>>>
>>>(In reply to LRN from comment #7)
>>>> Also, i think it's worth mentioning that my testcase has lots of hardcoded
>>>> stuff, and its output might look like gibberish on some locales (this is
>>>> related to console text encoding). The important part is that the "expected"
>>>> and "glib" outputs should look identical.
>>>
>>> Yup, I understand that.  To double-check this, I ran a g_message on the name
>>> that was returned and the text matched the expected text.  I am totally fine
>>> with this test.
>>> 
>>> ---
>>> I updated your patch a bit as a result-does it still work alright with yours?
>> 
>> Your version of the patch does not work ("expected" and "GLib" output are
>> different).
>> 
>> Your changes do not appear to make sense at the moment. What your addition
>> in g_date_time_format() does is it makes it always assume that locale is
>> UTF-8, but only at the end, when it needs to return the result.
>> But your locale is not UTF-8. Or, at least, it shouldn't be. This, given
>> your earlier statement regarding "locale_is_utf8", is very suspicious.
>> 
>> Would you please debug the code and see what happens in g_get_charset()?
>> What does GetACP() returns on your system?
> 
> GetACP() returns 950, so _g_locale_charset_raw() would return CP950.

Okay, this is expected and correct.

> It did seem to me that when we don't have HAVE_LANGINFO_TIME, if
> locale_is_utf8 is FALSE (which is going to be our case on Windows), the
> outstr that is "returned" by g_date_time_format() would contain a string
> that is already encoded in UTF-8,
You meant g_date_time_format_locale(), obviously.
And no, that would not be correct. By definition, g_date_time_format_locale() should return a string in locale-dependent encoding (that is why it has "locale" in its name). That is why (B) seemed unattractive - it changes the semantics of this function by having it return UTF-8-encoded string. (C) ensures that g_date_time_format_locale() *does* return string in locale-dependent encoding, even if we have to convert some UTF-8-encoded pieces we get from gettext to locale-dependent encoding inside of it (which is, obviously, somewhat inefficient, as later on we convert the whole string back to UTF-8).

> The "expected" string is going to differ depending on what encoding the test
> program is encoded, at least, and what locale your system is set to be when
> that code was built.  For me, if the test source is not encoded in UTF-8 but
> in ANSI/CodePage 950, the GLib output string and the expected string will
> match when we just directly do "return g_string_free (outstr, FALSE);"
> without doing g_locale_to_utf8() on outstr->str in g_date_time_format().

Whoa. Right, sorry, i forgot that not everyone uses GCC, which supports UTF-8-encoded string constants natively. I should have specified the string constant as a combination of escape sequences so that it would work in MSVC. Would you please replace the string with "\xD0\x92\xD1\x81" or (if you modified the test to set LANG=zh_TW.UTF-8) "\xE9\x80\xB1\xE6\x97\xA5" and try again, with my patch?

Comment 12 Fan, Chun-wei 2016-05-16 04:50:48 UTC

Created attachment 327957 [details]
Output on zh-TW Windows (CodePage 950)

Hi LRN,

It now seems that the code changes in GLib are correct, thanks!

What really happened was that the for loops in the test program broke the unicode string somehow, due to the complexity in UTF-8 characters which can be variable-length for CJK strings (for example, there are 3 bytes each involved for "週" and "日").  Please see the enclosed image for the representation for this.

I tested this with Visual Studio 2008 and 2015, so for Visual Studio versions in between I would expected to be alright.

So, as a quick work around I reversed the order of printing in the test program: print the actual string first and then the guint8 representation of each UTF-8 character, so that things are okay.

As a side note, on Visual Studio, the source files that use UTF-8 strings directly should still be saved as UTF-8 without BOM.

With blessings, thank you, and cheers!

Comment 13 Fan, Chun-wei 2016-05-16 04:54:19 UTC

Review of attachment 327783 [details] [review]:

.

Comment 14 LRN 2016-05-16 05:19:47 UTC

Attachment 327783 [details] pushed as 6a1e8e8 - g_date_time_format_locale: ensure locale encoding is used