GNOME Bugzilla – Bug 733325
Several regex tests fail with pcre3 8.35
Last modified: 2015-06-09 17:20:13 UTC
Debian unstable got pcre3 8.35. GLib's testsuite isn't very happy with this version. A few of the regex tests now fail: assertion failed (match == data->expected): (1 == 0) TEST_MATCH("[DŽ]", G_REGEX_CASELESS, 0, "Dž", -1, 0, 0, FALSE); GLib-CRITICAL **: g_regex_match_full: assertion 'regex != NULL' failed TEST_EXPAND("(.)(?P<1>.)", "ab", "\\1", FALSE, "a"); TEST_EXPAND("(.)(?P<1>.)", "ab", "\\g<1>", FALSE, "a"); assertion failed (match_count == g_slist_length (data->expected)): (1 == 2) TEST_MATCH_ALL2("a+", "aa", -1, 0, "aa", 0, 2, "a", 0, 1); assertion failed (match_count == g_slist_length (data->expected)): (1 == 2) TEST_MATCH_ALL2(".+", ENG EURO, -1, 0, ENG EURO, 0, 5, ENG, 0, 2); assertion failed (match_count == g_slist_length (data->expected)): (1 == 3) TEST_MATCH_ALL3("a+", "aaa", -1, 0, "aaa", 0, 3, "aa", 0, 2, "a", 0, 1); The last ones look suspicious-perhaps there's a pcre bug.
(In reply to comment #0) > Debian unstable got pcre3 8.35. GLib's testsuite isn't very happy with this > version. A few of the regex tests now fail: > > assertion failed (match == data->expected): (1 == 0) > TEST_MATCH("[DŽ]", G_REGEX_CASELESS, 0, "Dž", -1, 0, 0, FALSE); This is asserting that PCRE has a particular bug. That bug was fixed in 8.32; GLib should not assert that a system PCRE has it. > GLib-CRITICAL **: g_regex_match_full: assertion 'regex != NULL' failed > TEST_EXPAND("(.)(?P<1>.)", "ab", "\\1", FALSE, "a"); > TEST_EXPAND("(.)(?P<1>.)", "ab", "\\g<1>", FALSE, "a"); PCRE 8.34 release notes say: . Perl no longer allows group names to start with digits, so I have made this change also in PCRE. > assertion failed (match_count == g_slist_length (data->expected)): (1 == 2) > TEST_MATCH_ALL2("a+", "aa", -1, 0, "aa", 0, 2, "a", 0, 1); Adding more diagnostics yields: ** Message: regex: a+ ** Message: string: aa ** Message: matched strings: ** Message: 0. 0-2 'aa' ** Message: expected strings: ** Message: 0. 0-2 'aa' ** Message: 1. 0-1 'a' > assertion failed (match_count == g_slist_length (data->expected)): (1 == 2) > TEST_MATCH_ALL2(".+", ENG EURO, -1, 0, ENG EURO, 0, 5, ENG, 0, 2); ** Message: regex: .+ ** Message: string: ŋ€ ** Message: matched strings: ** Message: 0. 0-5 'ŋ€' ** Message: expected strings: ** Message: 0. 0-5 'ŋ€' ** Message: 1. 0-2 'ŋ' > assertion failed (match_count == g_slist_length (data->expected)): (1 == 3) > TEST_MATCH_ALL3("a+", "aaa", -1, 0, "aaa", 0, 3, "aa", 0, 2, "a", 0, 1); ** Message: regex: a+ ** Message: string: aaa ** Message: matched strings: ** Message: 0. 0-3 'aaa' ** Message: expected strings: ** Message: 0. 0-3 'aaa' ** Message: 1. 0-2 'aa' ** Message: 2. 0-1 'a'
Created attachment 281259 [details] [review] regex test: improve diagnostics for some failures with system PCRE 8.35
Created attachment 281260 [details] [review] regex test: do not assert that system PCRE still has an 8.31 bug This was fixed in 8.32.
Created attachment 281261 [details] [review] regex test: do not assert that system PCRE allows "(?P<1>)" Perl >= 5.18, and PCRE >= 8.34, disallow this. --- I'm unsure about this one: technically, it's an ABI break. On the other hand, the options are basically either do this, or stop supporting system PCRE and consider GLib's internal PCRE to be a full fork...
(In reply to comment #0) > The last ones look suspicious-perhaps there's a pcre bug. Perhaps. I don't know PCRE well enough to comment.
(In reply to comment #0) > The last ones look suspicious-perhaps there's a pcre bug. Yes, they are reproducible "in pure PCRE" (at least with Debian's pcre3). http://bugs.exim.org/show_bug.cgi?id=1504
Created attachment 281286 [details] [review] regex: if PCRE is 8.34 or later, go back to older DFA-matching behaviour This currently only affects a system PCRE, but would also work fine for an internal PCRE 8.34 or later if the embedded copy is updated. --- I've asked on http://bugs.exim.org/show_bug.cgi?id=1504 whether it was intentional that this is basically an ABI break. However, if PCRE wants the change but we do not, this effectively reverts it.
(In reply to comment #4) > regex test: do not assert that system PCRE allows "(?P<1>)" ... > I'm unsure about this one: technically, it's an ABI break. On the other hand, > the options are basically either do this, or stop supporting system PCRE and > consider GLib's internal PCRE to be a full fork... I think the way forward is likely to be to apply this anyway. (In reply to comment #7) > regex: if PCRE is 8.34 or later, go back to older DFA-matching behaviour ... > I've asked on http://bugs.exim.org/show_bug.cgi?id=1504 whether it was > intentional that this is basically an ABI break. However, if PCRE wants the > change but we do not, this effectively reverts it. Philip Hazel (PCRE maintainer) says: > If the intention of the GLib function is always to provide all possible > matches, then I would always recommend using PCRE_NO_AUTO_POSSESS. - <http://bugs.exim.org/show_bug.cgi?id=1504> so yes I think this patch should be applied.
Review of attachment 281286 [details] [review]: The flag is actually called PCRE_NO_AUTO_POSSESS, not PCRE_NO_AUTO_POSSESSIFY.
(In reply to comment #9) > The flag is actually called PCRE_NO_AUTO_POSSESS, not PCRE_NO_AUTO_POSSESSIFY. Making the obvious change to my earlier patch doesn't seem to be sufficient either: /regex/lookbehind: OK /regex/subpattern: ** ERROR:/home/smcv/src/gnome/glib/glib/tests/regex.c:1752:test_subpattern: assertion failed: (res) The failing line is marked "here" below: error = NULL; regex = g_regex_new ("cat(aract|erpillar|)", G_REGEX_OPTIMIZE, 0, &error); g_assert (regex); g_assert_no_error (error); g_assert_cmpint (g_regex_get_capture_count (regex), ==, 1); g_assert_cmpint (g_regex_get_max_backref (regex), ==, 0); res = g_regex_match_all (regex, "caterpillar", 0, &match); g_assert (res); // <- here g_assert (g_match_info_matches (match)); g_assert_cmpint (g_match_info_get_match_count (match), ==, 2); str = g_match_info_fetch (match, 0); g_assert_cmpstr (str, ==, "caterpillar"); g_free (str); str = g_match_info_fetch (match, 1); g_assert_cmpstr (str, ==, "cat"); g_free (str); g_match_info_free (match); g_regex_unref (regex);
(In reply to comment #10) > The failing line is marked "here" below: > > error = NULL; > regex = g_regex_new ("cat(aract|erpillar|)", G_REGEX_OPTIMIZE, 0, &error); > g_assert (regex); > g_assert_no_error (error); > g_assert_cmpint (g_regex_get_capture_count (regex), ==, 1); > g_assert_cmpint (g_regex_get_max_backref (regex), ==, 0); > res = g_regex_match_all (regex, "caterpillar", 0, &match); > g_assert (res); // <- here Interestingly, if I revert the use of PCRE_NO_AUTO_POSSESS, this passes with the new PCRE. So it seems we can have either this or the tests noted in Comment #1 passing, but not both...
Review of attachment 281286 [details] [review]: (In reply to comment #11) > Interestingly, if I revert the use of PCRE_NO_AUTO_POSSESS, this passes with > the new PCRE. So it seems we can have either this or the tests noted in Comment > #1 passing, but not both... Ugh. This patch is in fact structurally wrong, too: PCRE_NO_AUTO_POSSESS is meant to be passed at compile-time, not execute-time. The same numeric value of flag passed to pcre_dfa_exec() as a match option is PCRE_DFA_RESTART, which does not mean the same thing. This means that g_match_all(_full) needs to recompile the regex with different options in order to work as it did before. Using PCRE_NO_AUTO_POSSESS unconditionally might be a possibility, but comes at a performance penalty. Philip Hazel said this on <http://bugs.exim.org/show_bug.cgi?id=1504>: """ The improved auto-possessification does give substantial performance gains in non-DFA matching - which is why Zoltan implemented it - and it should (though I am not aware of any measurements) also gives gains in DFA matching (though you only get the longest match). However: previously there was *some* auto-possessification already in the code, and it has been there for some years (since December 2006). For example, a+b got (and gets) compiled as a++b. The test that you quoted, ending in a+ is a new case that is recognized by the improved code. So this issue is not, in fact, a new one, just that there are now more cases. """
Created attachment 281461 [details] [review] regex: if PCRE is 8.34 or later, disable auto-possessification for DFA Normally, recent PCRE behaves as if certain patterns were replaced by a more "possessive" pattern that gives the same answer for normal regex matching, but is more efficient. However, the modified pattern produces fewer results under DFA. If we want the full set of results we have to apply PCRE_NO_AUTO_POSSESS, and that's a compile-time flag. This currently only affects a system PCRE, but would also work fine for an internal PCRE 8.34 or later if the embedded copy is updated. --- This is the best I've been able to come up with. I deliberately didn't cache the compiled DFA-mode regex, because GRegex is documented to be thread-safe, and it seems pointless to implement a thread-safe cache - according to codesearch.debian.net basically nobody uses this function anyway, other than the GLib regression tests, and language bindings that include it for completeness. (It's possible that it has real users via those language bindings, but it seems unlikely.) I think the other valid option here would be to re-document g_regex_match_all(_full) as "returns some more matches, but not necessarily every possible match" (since those seem to be the semantics of pcre_dfa_exec) and delete the failing test-cases.
GLib maintainers (both in Debian and upstream), I would really appreciate guidance on which solutions you would prefer here, even if the specific patches are not entirely right: the test failure is currently breaking the GLib build in Debian, so it is considered release-critical, and we're going to have to do something about it soon. I would prefer not to make a GLib upload to Debian that has semantics that are later considered to be unacceptable for upstream. --- Issue 1: bug-for-bug casemapping, Attachment #281260 [details]. I'm pretty sure the test is just wrong for a newer-than-8.31 PCRE, so I'm confident that disabling that test is the right thing to do. Please say if you disagree. --- Issue 2: (?P<1>), Attachment #281261 [details]. This is a semantic change, but it makes PCRE consistent with the regular expressions in Perl >= 5.18, Python 2.7 and Python 3.4 (Perl, Python and GRegex regexes are all meant to be basically the same). I'm quite confident that disabling this test is the right thing. The only uses of this pattern in Debian, according to codesearch.debian.net [1], are in regression tests for GLib, regression tests for PHP (which will probably need the same change to keep up with modern PCRE), regression tests for Python (asserting that it is invalid, i.e. the new-PCRE behaviour), and samples in something called cutter-testing-framework (which appear to be a straight port of the GLib tests to a different C testing API). In particular, the search didn't find any uses of this pattern "in real life". "Everything in Debian" seems a reasonably good approximation of "most of open source". [1] searched for \(\?P<[0-9] to find all patterns that were valid in older PCRE but not 8.34: http://codesearch.debian.net/search?q=\%28\%3FP%3C[0-9] --- Issue 3: DFA matching producing fewer results due to optimization. I see three possibilities here: * Attachment #281461 [details] or something with the same effect * turn off the auto-possessification optimization in all cases, not just for DFA matching * declare that the set of matches produced by g_regex_match_all[_full] is implementation-defined, and remove the specific test cases that fail under 8.34+ My preferred option is Attachment #281461 [details] but if you disagree, I would like to know about it. --- Independent of the actual bug-fixes, you can either take Attachment #281259 [details] to improve diagnostics, or not; I don't really mind either way.
Created attachment 281857 [details] [review] regex: if PCRE is 8.34 or later, disable auto-possessification for DFA Normally, recent PCRE behaves as if certain patterns were replaced by a more "possessive" pattern that gives the same answer for normal regex matching, but is more efficient. However, the modified pattern produces fewer results under DFA. If we want the full set of results we have to apply PCRE_NO_AUTO_POSSESS, and that's a compile-time flag. This currently only affects a system PCRE, but would also work fine for an internal PCRE 8.34 or later if the embedded copy is updated. --- Attachment #281461 [details] incorrectly depended on Attachment #281286 [details] having been applied first (it essentially contained a revert of Attachment #281286 [details]). This version does not.
Any feedback on these patches? I intend to apply them in Debian soon.
(In reply to comment #16) > Any feedback on these patches? I intend to apply them in Debian soon. We are now shipping these in Debian. Upstream review would be much appreciated.
*** Bug 682074 has been marked as a duplicate of this bug. ***
Comment on attachment 281259 [details] [review] regex test: improve diagnostics for some failures with system PCRE 8.35 (I'd just reword the summary line; this is applicable regardless of system pcre version.)
Comment on attachment 281260 [details] [review] regex test: do not assert that system PCRE still has an 8.31 bug In the pcre >= 8.32 case, just changing FALSE to TRUE is right. So let's do that, and also bump the system required pcre to 8.32.
Comment on attachment 281857 [details] [review] regex: if PCRE is 8.34 or later, disable auto-possessification for DFA Now this doesn't look right. You're re-compiling the regex with PCRE_NO_AUTO_POSSESS, but then using pcre_dfa_exec with that plus the pcre_extra that belongs to the *original regex*. That's not supported by pcre. Also doing it with the #ifdef PCRE_NO_AUTO_POSSESS means that it works right when compiling against a recent pcre, not when building against an older one but running with a newer one...
(In reply to comment #21) > Now this doesn't look right. You're re-compiling the regex with > PCRE_NO_AUTO_POSSESS, but then using pcre_dfa_exec with that plus the > pcre_extra that belongs to the *original regex*. That's not supported by pcre. Ah, good catch. I think the fact that nobody has noticed my mistake in Debian indicates how few callers g_regex_match_all_full() has... I assume the right thing to do here is pass NULL instead of regex->extra - no premature optimization and all that. > Also doing it with the #ifdef PCRE_NO_AUTO_POSSESS means that it works right > when compiling against a recent pcre, not when building against an older one > but running with a newer one... I think this is unavoidable. If PCRE_NO_AUTO_POSSESS is not defined, then we have no way to emit code that will work the same with old or new pcre; I think my change is not perfect, but also the best we can do. In practice, GLib releases are rather more frequent than pcre releases, so distributions that use the system libpcre are going to be recompiling GLib against their current pcre somewhat frequently. (Yes, I think this means this is technically an ABI break in pcre, but the pcre developer is not going to change it, so we're stuck with it.)
Comment on attachment 281259 [details] [review] regex test: improve diagnostics for some failures with system PCRE 8.35 committed as 1fdece4f for 2.45.0
Comment on attachment 281261 [details] [review] regex test: do not assert that system PCRE allows "(?P<1>)" Committed as 4e29e9a0 for 2.45.0
Created attachment 302451 [details] [review] regex test: do not assert that system PCRE still has an 8.31 bug This was fixed in 8.32. As a slight simplification, I'm assuming and requiring that a system pcre is >= 8.32 even though the internal prce is 8.31, as requested by Christian Persch in reviews on the bug. --- chpe, is this what you wanted? It seems weird to me to be bumping the dependency for an external pcre when the internal pcre is still 8.31, but if that's what you want, I'll go with it.
Created attachment 302453 [details] [review] regex: if PCRE is 8.34 or later, disable auto-possessification for DFA Normally, recent PCRE behaves as if certain patterns were replaced by a more "possessive" pattern that gives the same answer for normal regex matching, but is more efficient. However, the modified pattern produces fewer results under DFA. If we want the full set of results we have to apply PCRE_NO_AUTO_POSSESS, and that's a compile-time flag. This currently only affects a system PCRE, but would also work fine for an internal PCRE 8.34 or later if the embedded copy is updated. --- v2: do not pass the non-DFA regex's optimization blob when executing the DFA regex. chpe noticed that this could happen if someone compiles a regex with G_REGEX_OPTIMIZE and then uses the slower DFA matching algorithm - that seems unlikely, which is why I'm not putting any effort into optimizing it, but it was certainly a bug that I didn't handle it correctly.
(In reply to Simon McVittie from comment #25) > chpe, is this what you wanted? It seems weird to me to be bumping the > dependency for an external pcre when the internal pcre is still 8.31, but if > that's what you want, I'll go with it. Hmm… AFAIR I was thinking more along the lines of #include <pcre/pcre.h>, #if PCRE_MAJOR > 8 || (PCRE_MAJOR == 8 && PCRE_MINOR >= 32), etc., but this is fine too, since practically speaking no-one should be on pcre < 8.32 anymore. What I *really* want is the internal copy of pcre to go away: bug 740573.
(In reply to Christian Persch from comment #27) > Hmm… AFAIR I was thinking more along the lines of #include <pcre/pcre.h>, > #if PCRE_MAJOR > 8 || (PCRE_MAJOR == 8 && PCRE_MINOR >= 32), etc. Ah, sorry, I'd misunderstood you. I'll do a replacement patch that behaves like this instead, since that seems a lot less weird. > What I *really* want is the internal copy of pcre to go away: bug 740573. That makes sense to me. Debian and its derivatives (including Ubuntu) haven't used the internal copy for the main GLib build since 2007, although we do still use it for the cut-down "udeb" build used in debian-installer (libpcre doesn't have a udeb variant yet).
(In reply to Simon McVittie from comment #22) > (In reply to comment #21) > > Also doing it with the #ifdef PCRE_NO_AUTO_POSSESS means that it works right > > when compiling against a recent pcre, not when building against an older one > > but running with a newer one... > > I think this is unavoidable. If PCRE_NO_AUTO_POSSESS is not defined, then we > have no way to emit code that will work the same with old or new pcre; I > think my change is not perfect, but also the best we can do. I have not done anything to address this, because I don't see anything we can do to address this. New pcre is just not 100% compatible with old pcre; but the change happened (upstream) years ago, so it's rather too late to do anything about this.
Created attachment 302506 [details] [review] regex test: do not assert that system PCRE still has an 8.31 bug This was fixed in 8.32, so if we have that version, assert that it is fixed; if we don't (e.g. the current internal pcre), still don't assert that it *isn't* fixed. --- Tested with internal pcre 8.31, and Debian 8 pcre 2:8.35-3.3.
Comment on attachment 302506 [details] [review] regex test: do not assert that system PCRE still has an 8.31 bug Yes, that's exactly what I had in mind :-)
Review of attachment 302453 [details] [review]: Looks fine to me, with just the one comment on the code below. ::: glib/gregex.c @@ +1928,3 @@ + + if (pcre_re == NULL) + return FALSE; This function can return a GError **. The gboolean return value FALSE is overloaded with 'no matches' and 'error'. So perhaps this should set the error before returning FALSE. OTOH, compiling this same regex that g_regex_new already has compiled, just with this new option, shouldn't fail ever, so it may be ok not to set the error.
Review of attachment 302453 [details] [review]: ::: glib/gregex.c @@ +1928,3 @@ + + if (pcre_re == NULL) + return FALSE; Er, it does set the error? pcre_re = regex_compile (blah blah blah, error); if (pcre_re == NULL) return FALSE; I don't use an intermediate GError on the stack, because there's no need - I'm not testing the error within this function, or prefixing a message, or anything like that.
Review of attachment 302506 [details] [review]: Committed as ace7846 for 2.45.2
(In reply to Simon McVittie from comment #33) > Er, it does set the error? Sorry, that was in reply to Comment #32 - I thought Splinter would give more context.
Right, I missed that. Sorry!
Comment on attachment 302453 [details] [review] regex: if PCRE is 8.34 or later, disable auto-possessification for DFA committed as bf181a3ac78e824ca7e67ecfb2ba957e740594d7 for 2.45.3
Fixed in master for 2.45.3