GNOME Bugzilla – Bug 793578
gdatetime tests depend on Japanese translation of month names
Last modified: 2018-03-14 12:04:21 UTC
ERROR: gdatetime - Bail out! GLib:ERROR:gdatetime.c:1439:test_non_utf8_printf: assertion failed (__p == ("10\346\234\210")): ("Oct" == "10\346\234\210") I get this error running make test on the master branch commit 96f39990f7fc5f15ab17f1e292c5d0ec506a6329 from the github mirror GNOME/Glib. Built on Fedora 27 X86_64.
This is because the ja.po translation hasn’t been updated since commit be4f96b6502c01d2a51d60b7a669c8ef82e22a4d landed (bug #749206), which split the month translations into nominative and genitive versions (with different translation contexts). Takayuki, would it be possible for the Japanese translation for GLib to be updated as a priority, please?
I think we need a patch which will add msgctxt "full month name with day" and msgctxt "abbreviated month name with day" as a copy of msgctxt "full month name" and msgctxt "abbreviated month name", respectively. For all not yet updated locales. Urgently, as an example of Japanese shows.
Right. ja.po can be fixed to add "full month name with day" and "abbreviated month name with day" contexts. --------------- >8 =============== #: ../glib/gdatetime.c:459 msgctxt "full month name with day" msgid "October" msgstr "10月" #: ../glib/gdatetime.c:542 msgctxt "abbreviated month name with day" msgid "Oct" msgstr "10月" --------------- >8 =============== But g_date_time_format_locale() still have a problem: --------------- >8 =============== case 'b': name = alt_digits ? MONTH_ABBR_STANDALONE (datetime) : MONTH_ABBR_WITH_DAY (datetime); if (g_strcmp0 (name, "") == 0) return FALSE; #if !defined (HAVE_LANGINFO_TIME) if (!locale_is_utf8) { tmp = g_locale_from_utf8 (name, -1, NULL, &tmp_len, NULL); if (!tmp) return FALSE; g_string_append_len (outstr, tmp, tmp_len); g_free (tmp); } else #endif --------------- >8 =============== MONTH_ABBR_STANDALONE() returns non-utf8 if HAVE_LANGINFO_TIME is defined but utf-8 if HAVE_LANGINFO_TIME is not defined. MONTH_ABBR_WITH_DAY() returns non-utf-8 if HAVE_LANGINFO_ABALTMON is defined but utf-8 if HAVE_LANGINFO_ABALTMON is not defined. In this case, alt_digit == FALSE and if HAVE_LANGINFO_ABALTMON is not defined, %b, %B, %h are still failed in glib/glib/tests/gdatetime.c
Thanks for the analysis. I’ve got a patch which I’ll attach shortly which should fix the conditional encoding problems. --- Here’s my analysis of the conditions, for future reference: MONTH_ABBR_STANDALONE: - defined(HAVE_LANGINFO_ABALTMON): locale - !defined(HAVE_LANGINFO_ABALTMON) && defined(HAVE_LANGINFO_TIME): locale - !defined(HAVE_LANGINFO_ABALTMON) && !defined(HAVE_LANGINFO_TIME): UTF-8 MONTH_ABBR_DAY: - defined(HAVE_LANGINFO_ABALTMON) && defined(HAVE_LANGINFO_TIME): locale - defined(HAVE_LANGINFO_ABALTMON) && !defined(HAVE_LANGINFO_TIME): UTF-8 - !defined(HAVE_LANGINFO_ABALTMON): UTF-8
Created attachment 368645 [details] [review] gdatetime: Fix locale handling for nl_langinfo() calls With the various macros we use to provide fallbacks for missing nl_langinfo() fields, the locale handling can become quite complex: nl_langinfo() returns strings encoded in the current locale, but C_() returns strings encoded in UTF-8 (by GLib convention — you do actually need to call bind_textdomain_codeset() to achieve this). There are various format placeholders, especially with the new %Ob, %OB, %Oh placeholders, which conditionally call nl_langinfo() or something based on C_(). This makes encoding handling difficult. Add additional macros which indicate whether the macros they’re paired with return something encoded in the current locale, or encoded in UTF-8. The user of the macro can then use these to work out whether to re-encode. Signed-off-by: Philip Withnall <withnall@endlessm.com>
Created attachment 368647 [details] [review] tests: Ensure gettext strings are loaded in UTF-8 in gdatetime.c See the previous commit. By convention, GLib assumes strings loaded from gettext are always in UTF-8, but we do need to tell gettext this. In most other tests, it doesn’t matter; but in the gdatetime test, we test re-encoding month names from EUC-JP, so we need to ensure the translations start in UTF-8 correctly. Signed-off-by: Philip Withnall <withnall@endlessm.com>
With a locally-modified ja.po, these patches fix the issue for me. I’m not going to push any changes to ja.po, since I am not a Japanese speaker. I’ll leave that to someone from the Japanese translation team.
> -#if !defined (HAVE_LANGINFO_TIME) > - if (!locale_is_utf8) > + if (!locale_is_utf8 && > + ((alt_digits && !MONTH_ABBR_STANDALONE_IS_LOCALE) || > + (!alt_digits && !MONTH_ABBR_WITH_DAY_IS_LOCALE))) What about a shorter expression like: alt_digits ? !MONTH_ABBR_STANDALONE_IS_LOCALE : !MONTH_ABBR_WITH_DAY_IS_LOCALE I have not tested this but I'm not sure if you guys have taken this into account: if month (weekday, etc.) names are provided by the libc library then their locale is controlled by LC_TIME environment variable, otherwise it is controlled by LC_MESSAGES. What if these two variables are set to different languages? different charsets? (Is it legal to set them to different charsets?)
(In reply to Rafal Luzynski from comment #8) > > -#if !defined (HAVE_LANGINFO_TIME) > > - if (!locale_is_utf8) > > + if (!locale_is_utf8 && > > + ((alt_digits && !MONTH_ABBR_STANDALONE_IS_LOCALE) || > > + (!alt_digits && !MONTH_ABBR_WITH_DAY_IS_LOCALE))) > > What about a shorter expression like: > > alt_digits ? !MONTH_ABBR_STANDALONE_IS_LOCALE : > !MONTH_ABBR_WITH_DAY_IS_LOCALE I don’t think mixing ternary conditionals and normal boolean expressions makes things easier to read, but I realise this is a matter of taste. > I have not tested this but I'm not sure if you guys have taken this into > account: if month (weekday, etc.) names are provided by the libc library > then their locale is controlled by LC_TIME environment variable, otherwise > it is controlled by LC_MESSAGES. What if these two variables are set to > different languages? different charsets? (Is it legal to set them to > different charsets?) Hmm. If that’s a problem, it’s always been a problem, since gdatetime.c (and, aiui, the rest of GLib) only ever handles one locale character set — whatever’s returned by g_get_charset(). That should safely support differing values of LC_TIME and LC_MESSAGES *as long as they use the same character set*. If they use different character sets, I suspect various parts of gdatetime.c are already broken (and this patch will not fix that, but it will also not make it significantly worse). See also: bug #688139, bug #687945. Bug #687945, especially, makes me fairly happy to consider LC_TIME and LC_MESSAGES having different character sets being out of scope.
Rafal, would you be able to review these patches? The other GLib maintainers are a bit busy at the moment (frolicking at a hackfest). Takao Fujiwara, can someone from the Japanese team please update the ja.po translation for GLib? Alternatively, let me know if https://gitlab.gnome.org/GNOME/glib/commit/8d6504e9a36b9d38ae8488ab25ad312eb25343b5 is grammatically correct (I think it is from what you said in comment #3, but since I do not speak Japanese I cannot be sure); if so, I will push that soon. Thank you.
(In reply to Philip Withnall from comment #10) > Rafal, would you be able to review these patches? The other GLib maintainers > are a bit busy at the moment (frolicking at a hackfest). I'm not a GLib maintainer and I have no power to tell if the patch should be accepted or rejected but I'll just do my best. > Takao Fujiwara, can someone from the Japanese team please update the ja.po > translation for GLib? Alternatively, let me know if > https://gitlab.gnome.org/GNOME/glib/commit/ > 8d6504e9a36b9d38ae8488ab25ad312eb25343b5 is grammatically correct (I think > it is from what you said in comment #3, but since I do not speak Japanese I > cannot be sure); if so, I will push that soon. Thank you. The commit link now has become: https://gitlab.gnome.org/GNOME/glib/commit/e4d193fa In case of a doubt this is the ultimate source: http://st.unicode.org/cldr-apps/v#/ja/Gregorian/ and it seems that Piotr's patch is correct. Although of course it'd better if an actual Japanese translator fixed this. (BTW, I wonder if it is correct that all abbreviated month names below 10 have a leading space; CLDR does not have it.)
Review of attachment 368645 [details] [review]: > With the various macros we use to provide fallbacks for missing > nl_langinfo() fields, the locale handling can become quite complex: > nl_langinfo() returns strings encoded in the current locale, but C_() I'm not sure this comment is correct. Is the problem maybe that nl_langinfo() uses LC_TIME environment variable to determine the current locale and C_() uses LC_MESSAGES, and they may be the same or different? Or is that even if LC_MESSAGES sets a non-UTF-8 locale then GLib's implementation of gettext always returns UTF-8? > There are various format placeholders, especially with the new %Ob, %OB, > %Oh placeholders, [...] Is this what you wanted to write or a typo? "placeholders... with the new placeholders..." - I've always called them "format specifiers" or "format specifiers with modifiers" but I am not an English native speaker so I may be wrong. > Add additional macros which indicate whether the macros they’re paired > with return something encoded in the current locale, or encoded in > UTF-8. This is probably wrong: the other macros are always in the current locale (if you set the new locale it becomes the current locale). I guess the problem is whether the current locale is UTF-8 or not. Also maybe whether you mean the locale set by LC_TIME or LC_MESSAGES or anything else. ::: glib/gdatetime.c @@ +197,2 @@ #define WEEKDAY_ABBR(d) nl_langinfo (weekday_item[0][g_date_time_get_day_of_week (d) - 1]) +#define WEEKDAY_ABBR_IS_LOCALE TRUE 1. Shouldn't you provide the same additional macro for GET_AMPM (just several lines above) and then use it in format_ampm()? 2. I think it's difficult to understand what the suffix "_IS_LOCALE" means. My first thought was "locale is a symbol of a language, like en_US or ja_UP.utf8, but why a weekday is a locale?" I guess that your idea is that WEEKDAY_ABBR is non-UTF-8 or may be non-UTF-8 and needs a conversion, or maybe you should use the pattern like "WEEKDAY_ABBR_IS_UTF8" and swap the TRUE/FALSE values? I have no good solution here, please take this as a weak concern and feel free to ignore if you think this is correct. @@ +199,2 @@ #define WEEKDAY_FULL(d) nl_langinfo (weekday_item[1][g_date_time_get_day_of_week (d) - 1]) +#define WEEKDAY_FULL_IS_LOCALE TRUE Same comment here and everywhere about "…_IS_LOCALE", I will not repeat this to avoid noise. @@ -2911,3 @@ if (g_strcmp0 (name, "") == 0) return FALSE; -#if !defined (HAVE_LANGINFO_TIME) Why remove this line? Is it because HAVE_LANGINFO_TIME now determines the value of WEEKDAY_ABBR_IS_LOCALE? Would it work if you wrote instead: #if !WEEKDAY_ABBR_IS_LOCALE and left everything else in this chunk unchanged? @@ +2943,3 @@ if (g_strcmp0 (name, "") == 0) return FALSE; + if (!locale_is_utf8 && !WEEKDAY_FULL_IS_LOCALE) Same here and everywhere else. @@ +2963,3 @@ + if (!locale_is_utf8 && + ((alt_digits && !MONTH_ABBR_STANDALONE_IS_LOCALE) || + (!alt_digits && !MONTH_ABBR_WITH_DAY_IS_LOCALE))) Those two lines are badly formatted and for this reason I mark this patch as needs-work. The first line has 1 tab and 2 spaces (10 columns), which is OK, the second has 8 spaces and I think it should have 1 tab + 6 spaces (to align with the '!' in the first line), the third line has 9 spaces and I think this is wrong but I don't have a strong preference, maybe 1 tab + 7 spaces (to align with the second parenthesis in the 2nd line). No matter what, this formatting is not nice. @@ +2983,3 @@ + if (!locale_is_utf8 && + ((alt_digits && !MONTH_FULL_STANDALONE_IS_LOCALE) || + (!alt_digits && !MONTH_FULL_WITH_DAY_IS_LOCALE))) Same remarks about the formatting issue here. @@ +3038,3 @@ + if (!locale_is_utf8 && + ((alt_digits && !MONTH_ABBR_STANDALONE_IS_LOCALE) || + (!alt_digits && !MONTH_ABBR_WITH_DAY_IS_LOCALE))) And again same here.
Review of attachment 368647 [details] [review]: ::: glib/tests/gdatetime.c @@ +2279,2 @@ /* GDateTime Tests */ + bind_textdomain_codeset ("glib20", "UTF-8"); I don't understand why this is necessary. In my machine it works correctly both with and without this line. But my machine is so tainted that lots of things work while they should not work. This is not a strong objection, feel free to commit if you think it is OK and if it fails in your machine without this line.
I must also add that the patches compile and the test cases pass in my computer. It is very difficult for me to emulate the environment where the glibc library does not support all alternative month names because all my machines have been using the patched glibc for ~2 years now. So no wonder that lots of things in this GLib work for me because they just fall back to the underlying glibc which works correctly. But I think that at least some of my tests were correct because I had to amend ja.jp file according to Takao Fujiwara's quote in comment 3. Thank you for your patches, Philip. My concerns are minor and feel free to disagree with them.
(In reply to Rafal Luzynski from comment #12) > Review of attachment 368645 [details] [review] [review]: > > > With the various macros we use to provide fallbacks for missing > > nl_langinfo() fields, the locale handling can become quite complex: > > nl_langinfo() returns strings encoded in the current locale, but C_() > > I'm not sure this comment is correct. Is the problem maybe that > nl_langinfo() uses LC_TIME environment variable to determine the > current locale and C_() uses LC_MESSAGES, and they may be the same > or different? Or is that even if LC_MESSAGES sets a non-UTF-8 locale > then GLib's implementation of gettext always returns UTF-8? Right, gettext() always returns UTF-8 for non-UTF-8 locales too in GNOME probject. You could find the charset in PO file is UTF-8. (In reply to Rafal Luzynski from comment #11) > The commit link now has become: > https://gitlab.gnome.org/GNOME/glib/commit/e4d193fa Great. > (BTW, I wonder if it is correct that all abbreviated month names below 10 > have a leading space; CLDR does not have it.) I'm also not sure. Probably the space is a policy in GNOME Japanese translators. Probably I think it's fine if the glib2 test program does not crash until Japanese translators update the PO file. (In reply to Rafal Luzynski from comment #13) > Review of attachment 368647 [details] [review] [review]: > > ::: glib/tests/gdatetime.c > @@ +2279,2 @@ > /* GDateTime Tests */ > + bind_textdomain_codeset ("glib20", "UTF-8"); > > I don't understand why this is necessary. In my machine it works correctly > both with and without this line. But my machine is so tainted that lots of > things work while they should not work. This is not a strong objection, feel > free to commit if you think it is OK and if it fails in your machine without > this line. GTK applications use UTF-8 for text widgets and read/write strings from/to files with the current encoding and bind_textdomain_codeset() would be needed for BSD system AFAIR.
(In reply to Rafal Luzynski from comment #11) > (In reply to Philip Withnall from comment #10) > > Rafal, would you be able to review these patches? The other GLib maintainers > > are a bit busy at the moment (frolicking at a hackfest). > > I'm not a GLib maintainer and I have no power to tell if the patch should be > accepted or rejected but I'll just do my best. Indeed, but I am a maintainer and I would trust your review on this. :-) > > Takao Fujiwara, can someone from the Japanese team please update the ja.po > > translation for GLib? Alternatively, let me know if > > https://gitlab.gnome.org/GNOME/glib/commit/ > > 8d6504e9a36b9d38ae8488ab25ad312eb25343b5 is grammatically correct (I think > > it is from what you said in comment #3, but since I do not speak Japanese I > > cannot be sure); if so, I will push that soon. Thank you. > > The commit link now has become: > https://gitlab.gnome.org/GNOME/glib/commit/e4d193fa I have merged it to master (see bug #793645). (In reply to Rafal Luzynski from comment #12) > Review of attachment 368645 [details] [review] [review]: > > > With the various macros we use to provide fallbacks for missing > > nl_langinfo() fields, the locale handling can become quite complex: > > nl_langinfo() returns strings encoded in the current locale, but C_() > > I'm not sure this comment is correct. Is the problem maybe that > nl_langinfo() uses LC_TIME environment variable to determine the > current locale and C_() uses LC_MESSAGES, and they may be the same > or different? Or is that even if LC_MESSAGES sets a non-UTF-8 locale > then GLib's implementation of gettext always returns UTF-8? GLib’s implementation of C_() guarantees to return UTF-8. nl_langinfo() doesn’t. > > There are various format placeholders, especially with the new %Ob, %OB, > > %Oh placeholders, [...] > > Is this what you wanted to write or a typo? "placeholders... with the new > placeholders..." - I've always called them "format specifiers" or "format > specifiers with modifiers" but I am not an English native speaker so > I may be wrong. I’d consider ‘placeholder’ and ‘format specifier’ to be synonymous, but I’ll change to ‘format specifiers’. > > Add additional macros which indicate whether the macros they’re paired > > with return something encoded in the current locale, or encoded in > > UTF-8. > > This is probably wrong: the other macros are always in the current > locale (if you set the new locale it becomes the current locale). > I guess the problem is whether the current locale is UTF-8 or not. > Also maybe whether you mean the locale set by LC_TIME or LC_MESSAGES > or anything else. As per comment #9 I don’t currently care about supporting different character encodings for LC_TIME and LC_MESSAGES (or other LC_* variables), so the ‘current locale’ is a collective term for whichever of LC_* is relevant. > ::: glib/gdatetime.c > @@ +197,2 @@ > #define WEEKDAY_ABBR(d) nl_langinfo > (weekday_item[0][g_date_time_get_day_of_week (d) - 1]) > +#define WEEKDAY_ABBR_IS_LOCALE TRUE > > 1. Shouldn't you provide the same additional macro for GET_AMPM (just > several lines above) and then use it in format_ampm()? Yes, good catch. Added. > 2. I think it's difficult to understand what the suffix "_IS_LOCALE" means. > My first thought was "locale is a symbol of a language, like en_US or > ja_UP.utf8, but why a weekday is a locale?" I guess that your idea is that > WEEKDAY_ABBR is non-UTF-8 or may be non-UTF-8 and needs a conversion, or > maybe you should use the pattern like "WEEKDAY_ABBR_IS_UTF8" and swap the > TRUE/FALSE values? I have no good solution here, please take this as a weak > concern and feel free to ignore if you think this is correct. I decided on `_IS_LOCALE` rather than `_IS_UTF8` because the locale could be (and often is) UTF-8, so having a compile-time value which says `_IS_UTF8` = `FALSE` for some macro would be misleading. I’m going to leave this unchanged. > @@ -2911,3 @@ > if (g_strcmp0 (name, "") == 0) > return FALSE; > -#if !defined (HAVE_LANGINFO_TIME) > > Why remove this line? Is it because HAVE_LANGINFO_TIME now determines the > value of WEEKDAY_ABBR_IS_LOCALE? Correct, that line is removed because HAVE_LANGINFO_TIME now determines the value of WEEKDAY_ABBR_IS_LOCALE. > Would it work if you wrote instead: > > #if !WEEKDAY_ABBR_IS_LOCALE > > and left everything else in this chunk unchanged? It would, but I would like to reduce the amount of preprocessor modifications which happen to the code, since that makes it easier to test. > @@ +2963,3 @@ > + if (!locale_is_utf8 && > + ((alt_digits && !MONTH_ABBR_STANDALONE_IS_LOCALE) || > + (!alt_digits && !MONTH_ABBR_WITH_DAY_IS_LOCALE))) > > Those two lines are badly formatted and for this reason I mark this patch as > needs-work. The first line has 1 tab and 2 spaces (10 columns), which is OK, > the second has 8 spaces and I think it should have 1 tab + 6 spaces (to > align with the '!' in the first line), the third line has 9 spaces and I > think this is wrong but I don't have a strong preference, maybe 1 tab + 7 > spaces (to align with the second parenthesis in the 2nd line). No matter > what, this formatting is not nice. Uff, that arose because tabs are set to 2 spaces in my editor for some reason in that file. Fixed.
Created attachment 369339 [details] [review] gdatetime: Fix locale handling for nl_langinfo() calls With the various macros we use to provide fallbacks for missing nl_langinfo() fields, the locale handling can become quite complex: nl_langinfo() returns strings encoded in the current locale, but C_() returns strings encoded in UTF-8 (by GLib convention — you do actually need to call bind_textdomain_codeset() to achieve this). There are various format specifiers, especially with the new %Ob, %OB, %Oh specifiers, which conditionally call nl_langinfo() or something based on C_(). This makes encoding handling difficult. Add additional macros which indicate whether the macros they’re paired with return something encoded in the current locale, or encoded in UTF-8. The user of the macro can then use these to work out whether to re-encode. Signed-off-by: Philip Withnall <withnall@endlessm.com>
Review of attachment 368647 [details] [review]: Accepted by mclasen and ebassi on IRC.
Review of attachment 369339 [details] [review]: Accepted by mclasen and ebassi on IRC.
Pushed to master for 2.56.0. Attachment 368647 [details] pushed as 36af567 - tests: Ensure gettext strings are loaded in UTF-8 in gdatetime.c Attachment 369339 [details] pushed as 12f1109 - gdatetime: Fix locale handling for nl_langinfo() calls
Thank you Philip for writing this patch and thank you mclasen and ebassi for reviewing. However, I must reopen it due to another minor bug. Hopefully it will actually fail (to build) only on Windows.
Created attachment 369597 [details] [review] gdatetime: Add missing #define WEEKDAY_FULL_IS_LOCALE
Review of attachment 369597 [details] [review]: Ouch.
Pushed to master. We have not branched yet, so this should go into 2.56.1. Attachment 369597 [details] pushed as 987bf5b - gdatetime: Add missing #define WEEKDAY_FULL_IS_LOCALE
*** Bug 794322 has been marked as a duplicate of this bug. ***