GNOME Bugzilla – Bug 678273
unicode othercasing is wrong in gregex
Last modified: 2012-06-23 21:55:22 UTC
The _pcre_ucp_othercase() function isn't working as expected by the PCRE code, since it turns characters that don't change by upper- and lower-casing into NOTACHAR (0xFFFFFFFF). This leads to PCRE internally using incorrect (or at least inefficient) character classes when using G_REGEX_CASELESS. E.g. [Z-\x{100}] turned into: [Z\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{39c}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{178}z-\x{101}] instead of the expected and efficient [Z\x{39c}\x{178}z-\x{101}]
Created attachment 216623 [details] [review] regex: Fix unicode othercasing The old _pcre_ucp_othercase() function was wrong in returning NOTACHAR (0xffffffff) for characters that aren't changed by upper- and lower-casing. This led to PCRE internally using incorrect (or at least inefficient) character classes when using G_REGEX_CASELESS. E.g. [Z-\x{100}] turned into: [Z\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{39c}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{178}z-\x{101}] instead of the expected and efficient [Z\x{39c}\x{178}z-\x{101}]
(Disregard the test added to glib/tests/regex.c, it's bogus (since it passes even without the patch).)
The following fix has been pushed: 53b48df regex: Fix unicode othercasing
Created attachment 217095 [details] [review] regex: Fix unicode othercasing The old _pcre_ucp_othercase() function was wrong in returning NOTACHAR (0xffffffff) for characters that aren't changed by upper- and lower-casing. This led to PCRE internally using incorrect (or at least inefficient) character classes when using G_REGEX_CASELESS. E.g. [Z-\x{100}] turned into: [Z\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{39c}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{178}z-\x{101}] instead of the expected and efficient [Z\x{39c}\x{178}z-\x{101}]
Actually I had updated my patch to exchange the order of these calls: + if ((oc = g_unichar_tolower(c)) != c) + return oc; + if ((oc = g_unichar_toupper(c)) != c) + return oc; ie to first do toupper, then tolower. This made this function bug-for-bug compatible with the internal pcre tables.