GNOME Bugzilla – Bug 793550
g_date_set_parse: Parses "September" in Polish incorrectly
Last modified: 2018-05-24 20:14:13 UTC
Created attachment 368490 [details] Test case: format September and try to parse it back See the test case. The reason is that the algorithm tries to find a month whose full or abbreviated name is a substring of the parsed string. So in Polish, "September" is "wrzesień"; as "August" is "sierpień", abbreviated as "sie", it detects that "sie" is a substring of "wrzesień" ("wrzeSIEń") and reports the month as August. The algorithm should match all months and report the one which is the longest matching substring. Somebody please fix it, I'm out of time.
Created attachment 368899 [details] [review] Use longest matching month name in g_date_set_parse. You can find a patch attached that always uses the month with the longest matching name. It also fixes another latent bug related to dates and locale, where information obtained from locale is not correctly recomputed after changing it. The current parser is also unprepared to dealing with cases where month names themselves contain numbers in ASCII. Though, that is not something that is being addressed in attached patch.
Review of attachment 368899 [details] [review]: This is not a formal review because I don't have a sufficient power to decide about whether the patch should be accepted/rejected/etc. ::: glib/gdate.c @@ +928,3 @@ typedef struct _GDateParseTokens GDateParseTokens; +static inline gboolean I think that "inline" is not a good idea here. Hopefully it would be ignored by the optimizer. @@ +929,3 @@ +static inline gboolean +update_month_match (gsize *best_match, const gchar *s, const gchar *candidate) Good idea to factor out the common code. But I was thinking about a different solution: implement a structure (a class in GLib/GObject terminology) which would encapsulate the string and its length, also the constructor would perform (and encapsulate) the calls to g_utf8_casefold() and g_utf8_normalize(). This would be helpful since we have supported 4 forms of the month name now. Also, storing the string length as the object member would make us calculate the length only while initializing the list of the month names rather than calculating it on each call of g_date_set_parse(). Pro: as stated above. Con: it is unlikely that more than one substring would be found and more than one string length would be calculated during one g_date_set_parse() call. @@ +939,3 @@ + return FALSE; + + length = strlen (candidate); Shouldn't it be g_utf8_strlen()? Pro: strlen() will incorrectly promote the strings which have more national characters because it counts bytes rather than characters. Con: it is likely that the longest string measured by g_utf8_strlen() is still the longest measured by strlen(). ::: glib/tests/date.c @@ +175,3 @@ +static void +test_parse_month_year (gconstpointer data) I don't think we actually need this test case. It is really thorough but this test case is already covered by test_month_names(). @@ +197,3 @@ + + g_date_set_dmy (&date, d, m, y); + written = g_date_strftime (buf, sizeof(buf), "%b %Y", &date); This will not expose the bug because it needs a full name of Polish word for "September" ("wrzesień") which is formatted by "%B" in older systems and "%OB" with the newest glibc and in whole BSD family. Also we need tests for all format modifiers: "%B", "%b", "%OB", "%Ob" (with "O" only on supported systems) or drop whole this test case.
Thanks for review Rafal. Regarding the g_utf8_strlen vs strlen, my intention was to only handle the case of substrings, where strlen works perfectly fine. If you think there are other cases that we would like to handle, please elaborate (for example it is far from obvious to me, that in case where input contains two different month names, where neither is substring of another, the one with more characters should be preferred). Good point about "%b" not being a valid test case for original problem in Polish. If new tests were to remain, they should include both abbreviated and long month names, both failed previously for some of included languages (the substring problem was hardly limited to Polish). I will wait for additional input from glib maintainers before iterating on this patch.
(In reply to Tomasz Miąsko from comment #3) > [...] > Regarding the g_utf8_strlen vs strlen, my intention was to only handle the > case > of substrings, where strlen works perfectly fine. If you think there are > other > cases that we would like to handle, please elaborate [...] I can't find any good example why g_utf8_strlen is better, maybe you are right. My initial thought was that "we should match the longest substring, the longest counting characters rather than bytes" but maybe you are right, the longest substring in characters is also the longest in bytes. I must rethink this. > [...] > I will wait for additional input from glib maintainers before iterating on > this > patch. Same here.
Created attachment 369005 [details] [review] An alternative solution Here I attach my solution. I'm not telling that mine is better, I'm not even sure it is 100% correct. Just to let the reviewers know what is my current work in progress. Maybe we'll merge these two and create the best one.
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/1343.