After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 770147 - Regexes started behaving weird
Regexes started behaving weird
Status: RESOLVED NOTABUG
Product: vte
Classification: Core
Component: general
0.45.x
Other Linux
: Normal normal
: ---
Assigned To: VTE Maintainers
VTE Maintainers
Depends on:
Blocks:
 
 
Reported: 2016-08-19 14:45 UTC by Igor
Modified: 2016-08-24 14:21 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Igor 2016-08-19 14:45:54 UTC
I've just compiled vte 0.45.90 and built my app (xfce4-terminal) against it, without changing any piece of code. What I'm seeing is that the regexes that are used for URI matching started behaving weird: it seems that there's a limit for URI length.
For example, I'm starting typing an URL and checking whether it's recognized by the regex:
https://www.google.com/12345678901234567890123456789012345678901234567890123 - this URL is valid (according to the regex), but adding a single character makes it invalid.
I'm using vte_terminal_match_add_gregex() API which as I see is converting GRegex to VteRegex internally.
When built against vte 0.44.2, the same regexes are working fine.

Thanks.
Comment 1 Egmont Koblinger 2016-08-19 20:45:07 UTC
Not so long ago (0.43-ish perhaps) a new set of APIs were introduced to use pcre2 instead of glib's regex. It was compile-time optional. I vaguely recall it was disabled for stable releases, but I'm not sure. Just recently in 0.45 it was made mandatory and some old APIs began to get deprecated. So probably the change you see is triggered by internally switching to a new engine.

I really don't get this story (neither the concrete technical details, nor the rationale behind the change), but I'm sure Christian will shed some light on this.

Pretty much unrelated to this, gnome-terminal commit https://git.gnome.org/browse/gnome-terminal/commit/?id=1a94499 / bug 756038 introduced a complete rewrite of g-t's regex strings (including unittests). x4-t's regexes pretty much resemble g-t's old ones. You might take a look at how old g-t behaved.

What I can confirm right now is that with 0.45.2, looong URLs are recognized properly by g-t, yet behave as you described in x4-t.

I'm not sure if it's a vte bug (I wouldn't think so), or in pcre2 (could be), or a bug in x4-t's regex literals which happens to get triggered by pcre2, or anything even weirder than these ones... Needs some crazy cross-component debugging I'm afraid.
Comment 2 Egmont Koblinger 2016-08-19 20:56:58 UTC
Playing a little bit with the maximum length...

The scheme (ftp/http/https) does not influence the maximum number of path characters (the 53 digits in your example). Neither does changing any of the URL components.

Changing the number of the URL components, however, increases/decreases by 2 path characters for each domain component. E.g. change ".com" to ".co.uk", and you'll only be able to have 51 digits. However, the "www." prefix only counts as 1 instead of 2.

I have a vague feeling that perhaps the number of capturing groups is limited to 64(ish).
Comment 3 Christian Persch 2016-08-20 06:34:40 UTC
This is due to the recursion limit set in vteterminalprivate::create_match_context. Imho the regex you use should be fixed.
Comment 4 Igor 2016-08-20 07:36:16 UTC
Hi Egmont and Christian, and thanks for the responses.

I'm now trying to first make use of the new vte regex API and see if that will resolve the issue for me.
What is the reason why vteregex.h is not including pcre2.h? That would save terminal apps from including it.
Comment 5 Igor 2016-08-20 08:06:57 UTC
Nope, just switching to the new API didn't help (and that's understood).
Looks like I have to change my regexes in a similar manner to how Christian has done this in the commit mentioned by Egmont above.
Comment 6 Egmont Koblinger 2016-08-20 08:43:29 UTC
(In reply to Christian Persch from comment #3)
> This is due to the recursion limit set in
> vteterminalprivate::create_match_context. Imho the regex you use should be
> fixed.

Yup matching the next character probably shouldn't create a new recursion level. I wonder how glib did it, maybe it was smarter to optimize the recursion away. Or it still had that recursion with a deeper stack.

Increasing 64 to a somewhat higher number obviously wouldn't solve the problem. Maybe increasing to a much higher number, e.g. 4096? Or would it work if we entirely omitted the set_recursion_level() call? (Of course, poorly written or poorly executed regexes, just as the one in x4-t, combined with long paragraphs could lead to terrible performance, so it's good to have a safeguard in place, and preferably a reasonably low one.)

(In reply to Igor from comment #5)
> Looks like I have to change my regexes in a similar manner to how Christian
> has done this in the commit mentioned by Egmont above.

That was actually my work :D, just committed by Christian for reasons I cannot recall. And he even set the author as me, but then it had to be reverted (for reasons I don't know) and applied again and there he forgot to set it again :D

Both our projects are GPL, so I guess you can legally just take our regexes as-is, should you want to do that. I'd personally be totally fine with that. Christian, let us know if you have any objections or bad feelings about it.
Comment 7 Christian Persch′ 2016-08-20 19:19:19 UTC
(In reply to Egmont Koblinger from comment #6)
> (In reply to Christian Persch from comment #3)
> > This is due to the recursion limit set in
> > vteterminalprivate::create_match_context. Imho the regex you use should be
> > fixed.
> 
> Yup matching the next character probably shouldn't create a new recursion
> level. I wonder how glib did it, maybe it was smarter to optimize the
> recursion away. Or it still had that recursion with a deeper stack.

GRegex uses PCRE1 internally, so it's unlikely it does anything fundamentally differently. It probably just sets the limits differently (at compile time, iirc).

> Increasing 64 to a somewhat higher number obviously wouldn't solve the
> problem. Maybe increasing to a much higher number, e.g. 4096? Or would it
> work if we entirely omitted the set_recursion_level() call? (Of course,
> poorly written or poorly executed regexes, just as the one in x4-t, combined
> with long paragraphs could lead to terrible performance, so it's good to
> have a safeguard in place, and preferably a reasonably low one.)

I picked those numbers in create_match_context() arbitrarily, but tight limits are prefereable here. Since the regex in question (and the old g-t regexes too) are obviously wrong here, I prefer not to change the limit on their account.
 
> (In reply to Igor from comment #5)
> > Looks like I have to change my regexes in a similar manner to how Christian
> > has done this in the commit mentioned by Egmont above.
> 
> That was actually my work :D, just committed by Christian for reasons I
> cannot recall. And he even set the author as me, but then it had to be
> reverted (for reasons I don't know) and applied again and there he forgot to
> set it again :D

I borked the commit somehow, but iirc the only changes were to glue/build, and definitely no changes to the regexes. So they're all yours. They're his anyway under the GPL3, of course :-)
Comment 8 Egmont Koblinger 2016-08-20 21:21:34 UTC
(In reply to Christian Persch′ from comment #7)

> Since the regex in question (and the old g-t
> regexes too) are obviously wrong here, I prefer not to change the limit on
> their account.

Yup I totally agree.

PS1. It was unintentional that my rewrite fixed this issue, I didn't realize this problem with the old one, nor did I know the new one won't suffer from this.

PS2. We should add a unittest for a very long match.
Comment 9 Egmont Koblinger 2016-08-20 21:43:00 UTC
PS2 addressed.
Comment 10 Igor 2016-08-22 11:04:04 UTC
(In reply to Egmont Koblinger from comment #6)
> That was actually my work :D, just committed by Christian for reasons I
> cannot recall. And he even set the author as me, but then it had to be
> reverted (for reasons I don't know) and applied again and there he forgot to
> set it again :D
> 
> Both our projects are GPL, so I guess you can legally just take our regexes
> as-is, should you want to do that. I'd personally be totally fine with that.
> Christian, let us know if you have any objections or bad feelings about it.

Thank you then :)
It looks like you won't mind if I just take terminal-regex.h and put it into xfce4-terminal code base, right?

(In reply to Igor from comment #6)
> What is the reason why vteregex.h is not including pcre2.h? That would save
> terminal apps from including it.

Still, why wouldn't you include pcre2.h from vteregex.h?
Comment 11 Christian Persch′ 2016-08-23 06:23:38 UTC
(In reply to Igor from comment #10)
> (In reply to Igor from comment #6)
> > What is the reason why vteregex.h is not including pcre2.h? That would save
> > terminal apps from including it.
> 
> Still, why wouldn't you include pcre2.h from vteregex.h?

Because one cannot include pcre2.h without first defining PCRE2_CODE_UNIT_WIDTH, and we cannot do so without interfering with whatever the app itself wants to use there. 

Now, all you need pcre2.h for is the #define'd constants (PCRE2_UTF etc), so it would make sense to split that off into a pcre2-constants.h file that is independent of the code unit width, but that's for pcre upstream to decide.
Comment 12 Igor 2016-08-24 14:20:31 UTC
(In reply to Christian Persch′ from comment #11)
> > Still, why wouldn't you include pcre2.h from vteregex.h?
> 
> Because one cannot include pcre2.h without first defining
> PCRE2_CODE_UNIT_WIDTH, and we cannot do so without interfering with whatever
> the app itself wants to use there. 
> 
> Now, all you need pcre2.h for is the #define'd constants (PCRE2_UTF etc), so
> it would make sense to split that off into a pcre2-constants.h file that is
> independent of the code unit width, but that's for pcre upstream to decide.

Makes sense, thanks.
Comment 13 Igor 2016-08-24 14:21:05 UTC
I've closed this bug, thank you Egmont and Christian.