Bug 678273 – unicode othercasing is wrong in gregex

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 678273 - unicode othercasing is wrong in gregex


Summary:	unicode othercasing is wrong in gregex


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	gregex
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2012-06-17 21:19 UTC by Christian Persch
Modified:	2012-06-23 21:55 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
regex: Fix unicode othercasing (2.89 KB, patch) 2012-06-17 21:19 UTC, Christian Persch	none	Details \| Review
regex: Fix unicode othercasing (2.07 KB, patch) 2012-06-23 21:31 UTC, Matthias Clasen	committed	Details \| Review

Description Christian Persch 2012-06-17 21:19:22 UTC

The _pcre_ucp_othercase() function isn't working as expected by the PCRE code, since it turns characters that don't change by upper- and lower-casing into NOTACHAR (0xFFFFFFFF). This leads to PCRE internally using incorrect (or at least inefficient) character classes when using G_REGEX_CASELESS.
    
E.g. [Z-\x{100}] turned into:

[Z\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{39c}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{178}z-\x{101}]
    
instead of the expected and efficient
    
[Z\x{39c}\x{178}z-\x{101}]

Comment 1 Christian Persch 2012-06-17 21:19:56 UTC

Created attachment 216623 [details] [review]
regex: Fix unicode othercasing

The old _pcre_ucp_othercase() function was wrong in returning
NOTACHAR (0xffffffff) for characters that aren't changed by upper-
and lower-casing. This led to PCRE internally using incorrect (or
at least inefficient) character classes when using G_REGEX_CASELESS.

E.g. [Z-\x{100}] turned into:

[Z\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{39c}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{178}z-\x{101}]

instead of the expected and efficient

[Z\x{39c}\x{178}z-\x{101}]

Comment 2 Christian Persch 2012-06-17 21:23:18 UTC

(Disregard the test added to glib/tests/regex.c, it's bogus (since it passes even without the patch).)

Comment 3 Matthias Clasen 2012-06-23 21:31:43 UTC

The following fix has been pushed:
53b48df regex: Fix unicode othercasing

Comment 4 Matthias Clasen 2012-06-23 21:31:45 UTC

Created attachment 217095 [details] [review]
regex: Fix unicode othercasing

The old _pcre_ucp_othercase() function was wrong in returning
NOTACHAR (0xffffffff) for characters that aren't changed by upper-
and lower-casing. This led to PCRE internally using incorrect (or
at least inefficient) character classes when using G_REGEX_CASELESS.

E.g. [Z-\x{100}] turned into:

[Z\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{39c}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{fffe}\x{178}z-\x{101}]

instead of the expected and efficient

[Z\x{39c}\x{178}z-\x{101}]

Comment 5 Christian Persch 2012-06-23 21:55:22 UTC

Actually I had updated my patch to exchange the order of these calls:

+  if ((oc = g_unichar_tolower(c)) != c)
+    return oc;
+  if ((oc = g_unichar_toupper(c)) != c)
+    return oc;

ie to first do toupper, then tolower. This made this function bug-for-bug compatible with the internal pcre tables.