GNOME Bugzilla – Bug 770147
Regexes started behaving weird
Last modified: 2016-08-24 14:21:05 UTC
I've just compiled vte 0.45.90 and built my app (xfce4-terminal) against it, without changing any piece of code. What I'm seeing is that the regexes that are used for URI matching started behaving weird: it seems that there's a limit for URI length. For example, I'm starting typing an URL and checking whether it's recognized by the regex: https://www.google.com/12345678901234567890123456789012345678901234567890123 - this URL is valid (according to the regex), but adding a single character makes it invalid. I'm using vte_terminal_match_add_gregex() API which as I see is converting GRegex to VteRegex internally. When built against vte 0.44.2, the same regexes are working fine. Thanks.
Not so long ago (0.43-ish perhaps) a new set of APIs were introduced to use pcre2 instead of glib's regex. It was compile-time optional. I vaguely recall it was disabled for stable releases, but I'm not sure. Just recently in 0.45 it was made mandatory and some old APIs began to get deprecated. So probably the change you see is triggered by internally switching to a new engine. I really don't get this story (neither the concrete technical details, nor the rationale behind the change), but I'm sure Christian will shed some light on this. Pretty much unrelated to this, gnome-terminal commit https://git.gnome.org/browse/gnome-terminal/commit/?id=1a94499 / bug 756038 introduced a complete rewrite of g-t's regex strings (including unittests). x4-t's regexes pretty much resemble g-t's old ones. You might take a look at how old g-t behaved. What I can confirm right now is that with 0.45.2, looong URLs are recognized properly by g-t, yet behave as you described in x4-t. I'm not sure if it's a vte bug (I wouldn't think so), or in pcre2 (could be), or a bug in x4-t's regex literals which happens to get triggered by pcre2, or anything even weirder than these ones... Needs some crazy cross-component debugging I'm afraid.
Playing a little bit with the maximum length... The scheme (ftp/http/https) does not influence the maximum number of path characters (the 53 digits in your example). Neither does changing any of the URL components. Changing the number of the URL components, however, increases/decreases by 2 path characters for each domain component. E.g. change ".com" to ".co.uk", and you'll only be able to have 51 digits. However, the "www." prefix only counts as 1 instead of 2. I have a vague feeling that perhaps the number of capturing groups is limited to 64(ish).
This is due to the recursion limit set in vteterminalprivate::create_match_context. Imho the regex you use should be fixed.
Hi Egmont and Christian, and thanks for the responses. I'm now trying to first make use of the new vte regex API and see if that will resolve the issue for me. What is the reason why vteregex.h is not including pcre2.h? That would save terminal apps from including it.
Nope, just switching to the new API didn't help (and that's understood). Looks like I have to change my regexes in a similar manner to how Christian has done this in the commit mentioned by Egmont above.
(In reply to Christian Persch from comment #3) > This is due to the recursion limit set in > vteterminalprivate::create_match_context. Imho the regex you use should be > fixed. Yup matching the next character probably shouldn't create a new recursion level. I wonder how glib did it, maybe it was smarter to optimize the recursion away. Or it still had that recursion with a deeper stack. Increasing 64 to a somewhat higher number obviously wouldn't solve the problem. Maybe increasing to a much higher number, e.g. 4096? Or would it work if we entirely omitted the set_recursion_level() call? (Of course, poorly written or poorly executed regexes, just as the one in x4-t, combined with long paragraphs could lead to terrible performance, so it's good to have a safeguard in place, and preferably a reasonably low one.) (In reply to Igor from comment #5) > Looks like I have to change my regexes in a similar manner to how Christian > has done this in the commit mentioned by Egmont above. That was actually my work :D, just committed by Christian for reasons I cannot recall. And he even set the author as me, but then it had to be reverted (for reasons I don't know) and applied again and there he forgot to set it again :D Both our projects are GPL, so I guess you can legally just take our regexes as-is, should you want to do that. I'd personally be totally fine with that. Christian, let us know if you have any objections or bad feelings about it.
(In reply to Egmont Koblinger from comment #6) > (In reply to Christian Persch from comment #3) > > This is due to the recursion limit set in > > vteterminalprivate::create_match_context. Imho the regex you use should be > > fixed. > > Yup matching the next character probably shouldn't create a new recursion > level. I wonder how glib did it, maybe it was smarter to optimize the > recursion away. Or it still had that recursion with a deeper stack. GRegex uses PCRE1 internally, so it's unlikely it does anything fundamentally differently. It probably just sets the limits differently (at compile time, iirc). > Increasing 64 to a somewhat higher number obviously wouldn't solve the > problem. Maybe increasing to a much higher number, e.g. 4096? Or would it > work if we entirely omitted the set_recursion_level() call? (Of course, > poorly written or poorly executed regexes, just as the one in x4-t, combined > with long paragraphs could lead to terrible performance, so it's good to > have a safeguard in place, and preferably a reasonably low one.) I picked those numbers in create_match_context() arbitrarily, but tight limits are prefereable here. Since the regex in question (and the old g-t regexes too) are obviously wrong here, I prefer not to change the limit on their account. > (In reply to Igor from comment #5) > > Looks like I have to change my regexes in a similar manner to how Christian > > has done this in the commit mentioned by Egmont above. > > That was actually my work :D, just committed by Christian for reasons I > cannot recall. And he even set the author as me, but then it had to be > reverted (for reasons I don't know) and applied again and there he forgot to > set it again :D I borked the commit somehow, but iirc the only changes were to glue/build, and definitely no changes to the regexes. So they're all yours. They're his anyway under the GPL3, of course :-)
(In reply to Christian Persch′ from comment #7) > Since the regex in question (and the old g-t > regexes too) are obviously wrong here, I prefer not to change the limit on > their account. Yup I totally agree. PS1. It was unintentional that my rewrite fixed this issue, I didn't realize this problem with the old one, nor did I know the new one won't suffer from this. PS2. We should add a unittest for a very long match.
PS2 addressed.
(In reply to Egmont Koblinger from comment #6) > That was actually my work :D, just committed by Christian for reasons I > cannot recall. And he even set the author as me, but then it had to be > reverted (for reasons I don't know) and applied again and there he forgot to > set it again :D > > Both our projects are GPL, so I guess you can legally just take our regexes > as-is, should you want to do that. I'd personally be totally fine with that. > Christian, let us know if you have any objections or bad feelings about it. Thank you then :) It looks like you won't mind if I just take terminal-regex.h and put it into xfce4-terminal code base, right? (In reply to Igor from comment #6) > What is the reason why vteregex.h is not including pcre2.h? That would save > terminal apps from including it. Still, why wouldn't you include pcre2.h from vteregex.h?
(In reply to Igor from comment #10) > (In reply to Igor from comment #6) > > What is the reason why vteregex.h is not including pcre2.h? That would save > > terminal apps from including it. > > Still, why wouldn't you include pcre2.h from vteregex.h? Because one cannot include pcre2.h without first defining PCRE2_CODE_UNIT_WIDTH, and we cannot do so without interfering with whatever the app itself wants to use there. Now, all you need pcre2.h for is the #define'd constants (PCRE2_UTF etc), so it would make sense to split that off into a pcre2-constants.h file that is independent of the code unit width, but that's for pcre upstream to decide.
(In reply to Christian Persch′ from comment #11) > > Still, why wouldn't you include pcre2.h from vteregex.h? > > Because one cannot include pcre2.h without first defining > PCRE2_CODE_UNIT_WIDTH, and we cannot do so without interfering with whatever > the app itself wants to use there. > > Now, all you need pcre2.h for is the #define'd constants (PCRE2_UTF etc), so > it would make sense to split that off into a pcre2-constants.h file that is > independent of the code unit width, but that's for pcre upstream to decide. Makes sense, thanks.
I've closed this bug, thank you Egmont and Christian.