After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 678273 - unicode othercasing is wrong in gregex
unicode othercasing is wrong in gregex
Status: RESOLVED FIXED
Product: glib
Classification: Platform
Component: gregex
unspecified
Other Linux
: Normal normal
: ---
Assigned To: gtkdev
gtkdev
Depends on:
Blocks:
 
 
Reported: 2012-06-17 21:19 UTC by Christian Persch
Modified: 2012-06-23 21:55 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
regex: Fix unicode othercasing (2.89 KB, patch)
2012-06-17 21:19 UTC, Christian Persch
none Details | Review
regex: Fix unicode othercasing (2.07 KB, patch)
2012-06-23 21:31 UTC, Matthias Clasen
committed Details | Review

Description Christian Persch 2012-06-17 21:19:22 UTC
The _pcre_ucp_othercase() function isn't working as expected by the PCRE code, since it turns characters that don't change by upper- and lower-casing into NOTACHAR (0xFFFFFFFF). This leads to PCRE internally using incorrect (or at least inefficient) character classes when using G_REGEX_CASELESS.
    
E.g. [Z-\x{100}] turned into:

[Z\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{39c}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{178}z-\x{101}]
    
instead of the expected and efficient
    
[Z\x{39c}\x{178}z-\x{101}]
Comment 1 Christian Persch 2012-06-17 21:19:56 UTC
Created attachment 216623 [details] [review]
regex: Fix unicode othercasing

The old _pcre_ucp_othercase() function was wrong in returning
NOTACHAR (0xffffffff) for characters that aren't changed by upper-
and lower-casing. This led to PCRE internally using incorrect (or
at least inefficient) character classes when using G_REGEX_CASELESS.

E.g. [Z-\x{100}] turned into:

[Z\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{39c}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{178}z-\x{101}]

instead of the expected and efficient

[Z\x{39c}\x{178}z-\x{101}]
Comment 2 Christian Persch 2012-06-17 21:23:18 UTC
(Disregard the test added to glib/tests/regex.c, it's bogus (since it passes even without the patch).)
Comment 3 Matthias Clasen 2012-06-23 21:31:43 UTC
The following fix has been pushed:
53b48df regex: Fix unicode othercasing
Comment 4 Matthias Clasen 2012-06-23 21:31:45 UTC
Created attachment 217095 [details] [review]
regex: Fix unicode othercasing

The old _pcre_ucp_othercase() function was wrong in returning
NOTACHAR (0xffffffff) for characters that aren't changed by upper-
and lower-casing. This led to PCRE internally using incorrect (or
at least inefficient) character classes when using G_REGEX_CASELESS.

E.g. [Z-\x{100}] turned into:

[Z\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{39c}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{178}z-\x{101}]

instead of the expected and efficient

[Z\x{39c}\x{178}z-\x{101}]
Comment 5 Christian Persch 2012-06-23 21:55:22 UTC
Actually I had updated my patch to exchange the order of these calls:

+  if ((oc = g_unichar_tolower(c)) != c)
+    return oc;
+  if ((oc = g_unichar_toupper(c)) != c)
+    return oc;

ie to first do toupper, then tolower. This made this function bug-for-bug compatible with the internal pcre tables.