GNOME Bugzilla – Bug 455640
Something fishy with GRegex and unicode
Last modified: 2007-09-10 17:30:16 UTC
Attached is a test case, which matches string "ễ" (that's some fancy non-latin character) against pattern ".*$", it fails. The test case shows a string "a" (ASCII letter) does match. Attached is also file for pcretest with the same strings and pattern, pcretest doesn't fail. No clue what's going on. Note it's $ that breaks it, ".*" alone is fine.
Created attachment 91560 [details] test case for glib
Created attachment 91561 [details] test case for pcretest (UTF-8 encoding)
Seems to be a pcre bug with PCRE_NEWLINE_ANY. Here is a pcre-only testcase that exhibits the behaviour. It matches only if you comment out the PCRE_NEWLINE_ANY.
Created attachment 91572 [details] pcre testcase
Indeed, and same thing in the newest pcre. A workaround is to specify G_REGEX_NEWLINE_LF. Reporting this upstream. Thanks for figuring this out, I was going to debug it with hammer and drill!
How is it NOTGNOME if glib includes copy of pcre and uses it by default? It's YESGNOME all right, showed up as a gtksourceview highlighting bug. And the fix isn't "update your system", since glib won't use hypothetical pcre-7.3 anyway, since it builds its own copy. Moreover, if it's really as bad as it seems (like unicode handling is depends-on-your-luck), then it's WAYYESGNOME, since *glib* provides GRegex (hopefully it's not as bad as it seems).
It is NOTGNOME insofar as we are not going deviate from upstream pcre and instead just wait for a fix to appear in a pcre release, and then import that. Bugs are most useful if they point out something to do. If you feel strongly about it, we can leave this bug open as a reminder to move to the next version of pcre when it becomes available. It won't change the outcome though.
PCRE 7.3 fixes this bug, but the internal version is still 7.2.
So it's that bug with newlines and UTF8 mentioned in ChangeLog. Let's have 7.3 in glib!
Created attachment 95295 [details] [review] Test for regex-test.c