After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 455640 - Something fishy with GRegex and unicode
Something fishy with GRegex and unicode
Status: RESOLVED FIXED
Product: glib
Classification: Platform
Component: gregex
2.13.x
Other Linux
: Normal normal
: ---
Assigned To: gtkdev
gtkdev
Depends on:
Blocks:
 
 
Reported: 2007-07-10 18:26 UTC by Yevgen Muntyan
Modified: 2007-09-10 17:30 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
test case for glib (404 bytes, text/plain)
2007-07-10 18:26 UTC, Yevgen Muntyan
  Details
test case for pcretest (UTF-8 encoding) (12 bytes, text/plain)
2007-07-10 18:27 UTC, Yevgen Muntyan
  Details
pcre testcase (518 bytes, text/x-csrc)
2007-07-10 19:59 UTC, Matthias Clasen
  Details
Test for regex-test.c (575 bytes, patch)
2007-09-10 17:30 UTC, Marco Barisione
none Details | Review

Description Yevgen Muntyan 2007-07-10 18:26:03 UTC
Attached is a test case, which matches string "ễ" (that's some fancy non-latin character) against pattern ".*$", it fails. The test case shows a string "a" (ASCII letter) does match. Attached is also file for pcretest with the same strings and pattern, pcretest doesn't fail. No clue what's going on.
Note it's $ that breaks it, ".*" alone is fine.
Comment 1 Yevgen Muntyan 2007-07-10 18:26:52 UTC
Created attachment 91560 [details]
test case for glib
Comment 2 Yevgen Muntyan 2007-07-10 18:27:28 UTC
Created attachment 91561 [details]
test case for pcretest (UTF-8 encoding)
Comment 3 Matthias Clasen 2007-07-10 19:58:47 UTC
Seems to be a pcre bug with PCRE_NEWLINE_ANY. Here is a pcre-only testcase that
exhibits the behaviour. It matches only if you comment out the PCRE_NEWLINE_ANY.
Comment 4 Matthias Clasen 2007-07-10 19:59:15 UTC
Created attachment 91572 [details]
pcre testcase
Comment 5 Yevgen Muntyan 2007-07-10 20:28:33 UTC
Indeed, and same thing in the newest pcre. A workaround is to specify G_REGEX_NEWLINE_LF. Reporting this upstream. Thanks for figuring this out, I was going to debug it with hammer and drill!
Comment 6 Yevgen Muntyan 2007-07-11 21:55:59 UTC
How is it NOTGNOME if glib includes copy of pcre and uses it by default? It's YESGNOME all right, showed up as a gtksourceview highlighting bug. And the fix isn't "update your system", since glib won't use hypothetical pcre-7.3 anyway, since it builds its own copy. Moreover, if it's really as bad as it seems (like unicode handling is depends-on-your-luck), then it's WAYYESGNOME, since *glib* provides GRegex (hopefully it's not as bad as it seems).
Comment 7 Matthias Clasen 2007-07-11 23:16:18 UTC
It is NOTGNOME insofar as we are not going deviate from upstream pcre and instead just wait for a fix to appear in a pcre release, and then import that. 

Bugs are most useful if they point out something to do. 

If you feel strongly about it, we can leave this bug open as a reminder to move 
to the next version of pcre when it becomes available. It won't change the outcome
though.
Comment 8 Marco Barisione 2007-09-10 17:26:52 UTC
PCRE 7.3 fixes this bug, but the internal version is still 7.2.
Comment 9 Yevgen Muntyan 2007-09-10 17:29:57 UTC
So it's that bug with newlines and UTF8 mentioned in ChangeLog. Let's have 7.3 in glib!
Comment 10 Marco Barisione 2007-09-10 17:30:16 UTC
Created attachment 95295 [details] [review]
Test for regex-test.c