Bug 756038 – Rewrite URL regex from scratch

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 756038 - Rewrite URL regex from scratch


Summary:	Rewrite URL regex from scratch


Status:	RESOLVED FIXED

Product:	gnome-terminal
Classification:	Core
Component:	general
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	GNOME Terminal Maintainers
QA Contact:	GNOME Terminal Maintainers

URL:
Whiteboard:

Depends on:
Blocks:	448044 546628 570898 578929 604563 619795 637134 735597 739757 759959 762151

Reported:	2015-10-04 11:50 UTC by Egmont Koblinger
Modified:	2016-04-28 21:32 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Unittest v0 (4.78 KB, patch) 2015-10-10 11:57 UTC, Egmont Koblinger	none	Details \| Review
Regex fixes + unittest, v1 (9.19 KB, patch) 2015-10-23 21:49 UTC, Egmont Koblinger	none	Details \| Review
Regex fixes + unittest, v2 (12.94 KB, patch) 2015-10-25 21:11 UTC, Egmont Koblinger	rejected	Details \| Review
Regex fixes + unittest, v2 (19.03 KB, patch) 2015-10-25 21:24 UTC, Egmont Koblinger	none	Details \| Review
Regex fixes + unittest, v3 (30.63 KB, patch) 2015-10-31 16:23 UTC, Egmont Koblinger	none	Details \| Review

Description Egmont Koblinger 2015-10-04 11:50:45 UTC

The current URL and e-mail regexs contain plenty of bugs (see the "blocks" field). Some of these bugs contain patches that would probably heavily conflict with each other, and some even try to solve much more than what's in that bugreport. So reviewing and applying those patches one by one is not feasible.

We should rewrite the pattern from scratch, addressing all these issues once and for all. With tons of comments because regexs are very hard to read. And unittests.

For excluding the terminating dot (introduced by the patch to bug 571002, which I'm afraid is the culprit for some other issues) we should use negative lookahead (http://www.rexegg.com/regex-lookarounds.html), that's the clean solution without side effects. I wish I knew about this regex feature a long time ago!

Conditionals (http://www.rexegg.com/regex-conditionals.html) will probably help us recognize URLs enclosed between some marks, e.g. if it goes like "«http..." then it ends at a "»", otherwise "»" is a valid inner character. Not sure at this point whether lookbehind-lookahead can be combined with conditionals (along the lines of: lookbehind is [«⟨({], and lookahead is: if lookbehind matched "«" then "»" elif lookbehind matched "⟨" then "⟩" elif...). If it doesn't work out, we can still add a separate regex for each opening-closing pair we'd like to recognize, there'll be no more than maybe 5-10 of them.

How much are GRegex and PCRE2 comatible with each other?

Comment 1 Christian Persch 2015-10-04 13:26:40 UTC

GRegex is PCRE1, and PCRE1 and PCRE2 aren't much different. So we can just target PCRE2 and it should work with GRegex too. For the full set of features PCRE2 supports, see man:pcre2pattern(3) or for PCRE1, man:pcrepattern(3) .

Comment 2 Egmont Koblinger 2015-10-10 11:57:34 UTC

Created attachment 313015 [details] [review]
Unittest v0

Preliminary regex unittesting infrastructure (on top of bug 741728 patch v7)

Comment 3 Christian Persch 2015-10-10 19:24:57 UTC

Review of attachment 313015 [details] [review]:

I like adding tests :-)

However, as you can see from your Makefile.am changes, doing it like this isn't exactly convenient... how about this: move all of gnome-terminal-server's sources to a noinst_LTLIBRARIES convenience library, link g-t-s to that, and link the tests to that? And move the precompile_regex function from terminal-screen to a terminal-regex.[ch] and the test to src/tests/regex.c (linking to the convience lib, if necessary).

Comment 4 Christian Persch 2015-10-11 09:43:41 UTC

I've restructured the build on master now.

Comment 5 Egmont Koblinger 2015-10-11 12:33:05 UTC

I'm really not an autoconf-automake-whatever guy, I generally prefer if you do these kinds of things for me :)  I'll try to do what you outlined above, but I might ask for further help.  E.g. hooking up so that the test is run at tarball creation time.

I'm not even sure, what's the command that's supposed to run all the tests? "make check" already does certain things but fails for me in g-t as well as vte.

Your recent change broke g-t for me:

(gnome-terminal-server:15190): Gtk-CRITICAL **: Unable to load resource for composite template for type 'TerminalWindow': The resource at '/org/gnome/terminal/ui/window.ui' does not exist

(gnome-terminal-server:15190): Gtk-CRITICAL **: gtk_widget_init_template: assertion 'template != NULL' failed
Segmentation fault

Comment 6 Christian Persch 2015-10-11 14:35:38 UTC

Reverted until I can figure this out.

So you could just move the regex stuff to terminal-regex.[ch] and have test-regex.c just build with these 2 instead of that gigantic SOURCES. For tests, you just do

check_PROGRAMS = test-regex
test_regex_SOURCES = ...
etc

as usual.

Comment 7 Egmont Koblinger 2015-10-23 20:22:36 UTC

#define USERPASS USERCHARS_CLASS "+(?:" PASSCHARS_CLASS "+)?"

Username and password are _almost_ separated by a semicolon in current g-t :-)

I guess I'll rework the unittests to work on explicitly given regex strings rather than on the url_regexes array. And have defines for the final giant patterns, i.e. change

static const TerminalRegexPattern url_regex_patterns[] = {
  { SCHEME "//(?:" USERPASS "\\@)?" HOST PORT URLPATH, FLAVOR_AS_IS, TRUE },
  ...

to

#define URL  SCHEME "//(?:" USERPASS "\\@)?" HOST PORT URLPATH
static const TerminalRegexPattern url_regex_patterns[] = {
  { URL, FLAVOR_AS_IS, TRUE },
  ...


This will have multiple advantages:

- Can test smaller building blocks like USERPASS, not just the final bloat;
- More decoupling of the rest of g-t (only the defines are needed which can go to a separate .h file);
- Can test against both regex engines in a single run.

Comment 8 Egmont Koblinger 2015-10-23 21:30:32 UTC

(In reply to Egmont Koblinger from comment #0)

> For excluding the terminating dot (introduced by the patch to bug 571002,
> which I'm afraid is the culprit for some other issues) we should use
> negative lookahead

No clue why I thought lookahead can do it. Now it seems it cannot. (Or am I wrong now?)

Comment 9 Egmont Koblinger 2015-10-23 21:47:41 UTC

I wasn't that wrong after all. We need a look_behind_ at the end of the regex. Allow to consume all kinds of alphanumeric and punctuation chars including dots etc., and at the very end look back to ensure that it's not a dot there. It works, it's simple, straightforward and awesome :)

Comment 10 Egmont Koblinger 2015-10-23 21:49:05 UTC

Created attachment 313996 [details] [review]
Regex fixes + unittest, v1

(work in progress)

Comment 11 Egmont Koblinger 2015-10-24 09:28:39 UTC

FYI: I'm planning to remove the caseless boolean field and use (?i:...) instead.

Rationale: it allows better unittesting (e.g. of the scheme only) and theoretically an entire giant regex could be built from a mixture of case-sensitive and case-insensitive blocks. (In practice, right now we only use case insensitivity for the scheme.)

Let me know if you have concerns.

Comment 12 Egmont Koblinger 2015-10-24 12:33:35 UTC

Style question:

I have this right now (and it might get worse):

#define XYZ "(?:[-[:alnum:]]|(?![[:ascii:]])[[:graph:]])"

Do you think adding spaces (eliminated by cpp) would improve readability, or just increase visual clutter?

#define XYZ "(?:" "[-[:alnum:]]" "|" "(?!" "[[:ascii:]]" ")" "[[:graph:]]" ")"

How about certain defines, something like this?

#define NC  "(?:"  // non-capturing
#define NLA "(?!"  // negative lookahead
#define CL  ")"    // close
#define OR  "|"

#define XYZ NC "[-[:alnum:]]" OR NLA "[[:ascii:]]" CL "[[:graph:]]" CL

They'd probably help preventing bugs as in the beginning of comment 7 (where that ":" was mistaken for a literal colon whereas it's part of the non-capturing group syntax). On the other hand, it's almost a new language to learn :D

Comment 13 Egmont Koblinger 2015-10-24 13:02:34 UTC

Oh, there's something called "free-spacing mode", I'll need that! :)

Comment 14 Egmont Koblinger 2015-10-25 21:11:47 UTC

Created attachment 314100 [details] [review]
Regex fixes + unittest, v2

The architecture was reworked, it's pretty much what I had in my mind. I think it's almost complete. Two small issues to be addressed:

- Run the regexes against pcre2 in addition to glib;

- Hook up to distcheck or whatever so that it's run at tarball creation.

Regexes themselves begin to look promising, but they're still far from ideal.

Comment 15 Egmont Koblinger 2015-10-25 21:24:43 UTC

Created attachment 314102 [details] [review]
Regex fixes + unittest, v2

Haha, attached a totally different file :) Here's it.

Comment 16 Egmont Koblinger 2015-10-31 16:23:55 UTC

Created attachment 314548 [details] [review]
Regex fixes + unittest, v3

Here I am right now. It's still work in progress, but is quite decent now.

It might look terribly complex, sometimes a total overkill. E.g. why have an exact IPv6 regex? Well, for one, I had fun implementing it.

The thing is: We could go the opposite directon and say that anything from a http:// to the next whitespace or trailing dot is a URL. The rest (recognizing hostname, port etc.) is not our business. This could result in even simpler regexes than current g-t.

It's hard to find a balance, and in fact, I believe that the half-decent solution we had so far is the hardest to maintain.

It _sometimes_ understands the semantics of the string, but makes so many mistakes. E.g. one I just newly discovered: currently "file:///mnt/lost+found" is recognized as a valid URI (we mistakenly believe that "mnt" is a hostname), but "file:///lost+found" is not (similarly we'd mistakenly parse "lost+found" as the hostname, whereas '+' is not a valid hostname char).

From this halfway decent solution (where it's damn hard to fix such bugs) we can go to the rough n simple approach of accepting pretty much everything (where I'm afraid it's inevitable to have almost-impossible-to-fix minor bugs), or try to be as pedantic as reasonably possible.

Unittesting also benefits from being pedantic. It's a huge win that we can not only test valid matches but also invalid ones. Keeping it within strict boundaries helps us ensure that we indeed correctly parse the semantics of the URLs.

Comment 17 Thomas Meyer 2015-12-29 17:44:09 UTC

Hi,

oops! I didn't searched bugzilla beforehand, so duplicate work :-(

I did have a look at your patch, do your patch support the IPvFuture syntax as described in https://tools.ietf.org/html/rfc3986 ?

Maybe supporting https://tools.ietf.org/html/rfc6874 is also a good thing, but maybe not really needed as an extension to RFC3986.

Comment 18 Egmont Koblinger 2015-12-29 17:46:45 UTC

(In reply to Thomas Meyer from comment #17)

> I did have a look at your patch, do your patch support the IPvFuture syntax
> as described in https://tools.ietf.org/html/rfc3986 ?

Nope. I've never heard of this.

Feel free to improve my patch (along with the unittests), thanks! Although I doubt it's of any practical importance.

Comment 19 Egmont Koblinger 2015-12-29 18:05:40 UTC

I didn't find RFC3986's IPv6 BNF and all the IPv6 regexs I found on the net were terrible so I came up with my own. However, I tend to prefer your approach of implementing the RFC's BNF.

Comment 20 Thomas Meyer 2015-12-29 18:42:01 UTC

Feel free to use my code! Does gnome have an gerrit instance somewhere? as patch tracking in bugzilla is a bit awkward... Or isn't it? I never used it this way.

Comment 21 Egmont Koblinger 2015-12-29 18:44:58 UTC

I never used Gerrit :P No clue if GNOME has one, I'm not aware of it.

Comment 22 Christian Persch 2016-02-14 10:10:23 UTC

IMHO the patch is ready; it certainly can't be worse than the current code :-)

Comment 23 Egmont Koblinger 2016-02-14 11:04:07 UTC

:-)

Comment 24 Egmont Koblinger 2016-02-16 17:28:04 UTC

Let's double check that we address bug 762151 too.

Comment 25 Christian Persch 2016-02-21 14:42:36 UTC

I committed this to master; let's call this FIXED and open new bugs if anything remains.

Comment 26 Egmont Koblinger 2016-04-28 21:32:26 UTC

(In reply to Egmont Koblinger from comment #16)

> From this halfway decent solution (where it's damn hard to fix such bugs) we
> can go to the rough n simple approach of accepting pretty much everything
> (where I'm afraid it's inevitable to have almost-impossible-to-fix minor
> bugs), or try to be as pedantic as reasonably possible.
> 
> Unittesting also benefits from being pedantic. It's a huge win that we can
> not only test valid matches but also invalid ones. Keeping it within strict
> boundaries helps us ensure that we indeed correctly parse the semantics of
> the URLs.

http://askubuntu.com/q/671278/398785 brings up another excellent data point why going pedantic was probably the right choice.

Here someone complains that an NPM package such as "react@0.13.3" is highlighted as an email address.

With the new regexes, even if it's modified to be a valid IP (e.g. "react@1.2.3.4") it's still not recognized, because IP addresses need to be enclosed within brackets in email addresses.

So, by being pedantic, we highlight fewer false positives and hence cause less frustration to those working with something totally different that just happens to look similar to URLs or email addresses.