GNOME Bugzilla – Bug 756038
Rewrite URL regex from scratch
Last modified: 2016-04-28 21:32:26 UTC
The current URL and e-mail regexs contain plenty of bugs (see the "blocks" field). Some of these bugs contain patches that would probably heavily conflict with each other, and some even try to solve much more than what's in that bugreport. So reviewing and applying those patches one by one is not feasible. We should rewrite the pattern from scratch, addressing all these issues once and for all. With tons of comments because regexs are very hard to read. And unittests. For excluding the terminating dot (introduced by the patch to bug 571002, which I'm afraid is the culprit for some other issues) we should use negative lookahead (http://www.rexegg.com/regex-lookarounds.html), that's the clean solution without side effects. I wish I knew about this regex feature a long time ago! Conditionals (http://www.rexegg.com/regex-conditionals.html) will probably help us recognize URLs enclosed between some marks, e.g. if it goes like "«http..." then it ends at a "»", otherwise "»" is a valid inner character. Not sure at this point whether lookbehind-lookahead can be combined with conditionals (along the lines of: lookbehind is [«⟨({], and lookahead is: if lookbehind matched "«" then "»" elif lookbehind matched "⟨" then "⟩" elif...). If it doesn't work out, we can still add a separate regex for each opening-closing pair we'd like to recognize, there'll be no more than maybe 5-10 of them. How much are GRegex and PCRE2 comatible with each other?
GRegex is PCRE1, and PCRE1 and PCRE2 aren't much different. So we can just target PCRE2 and it should work with GRegex too. For the full set of features PCRE2 supports, see man:pcre2pattern(3) or for PCRE1, man:pcrepattern(3) .
Created attachment 313015 [details] [review] Unittest v0 Preliminary regex unittesting infrastructure (on top of bug 741728 patch v7)
Review of attachment 313015 [details] [review]: I like adding tests :-) However, as you can see from your Makefile.am changes, doing it like this isn't exactly convenient... how about this: move all of gnome-terminal-server's sources to a noinst_LTLIBRARIES convenience library, link g-t-s to that, and link the tests to that? And move the precompile_regex function from terminal-screen to a terminal-regex.[ch] and the test to src/tests/regex.c (linking to the convience lib, if necessary).
I've restructured the build on master now.
I'm really not an autoconf-automake-whatever guy, I generally prefer if you do these kinds of things for me :) I'll try to do what you outlined above, but I might ask for further help. E.g. hooking up so that the test is run at tarball creation time. I'm not even sure, what's the command that's supposed to run all the tests? "make check" already does certain things but fails for me in g-t as well as vte. Your recent change broke g-t for me: (gnome-terminal-server:15190): Gtk-CRITICAL **: Unable to load resource for composite template for type 'TerminalWindow': The resource at '/org/gnome/terminal/ui/window.ui' does not exist (gnome-terminal-server:15190): Gtk-CRITICAL **: gtk_widget_init_template: assertion 'template != NULL' failed Segmentation fault
Reverted until I can figure this out. So you could just move the regex stuff to terminal-regex.[ch] and have test-regex.c just build with these 2 instead of that gigantic SOURCES. For tests, you just do check_PROGRAMS = test-regex test_regex_SOURCES = ... etc as usual.
#define USERPASS USERCHARS_CLASS "+(?:" PASSCHARS_CLASS "+)?" Username and password are _almost_ separated by a semicolon in current g-t :-) I guess I'll rework the unittests to work on explicitly given regex strings rather than on the url_regexes array. And have defines for the final giant patterns, i.e. change static const TerminalRegexPattern url_regex_patterns[] = { { SCHEME "//(?:" USERPASS "\\@)?" HOST PORT URLPATH, FLAVOR_AS_IS, TRUE }, ... to #define URL SCHEME "//(?:" USERPASS "\\@)?" HOST PORT URLPATH static const TerminalRegexPattern url_regex_patterns[] = { { URL, FLAVOR_AS_IS, TRUE }, ... This will have multiple advantages: - Can test smaller building blocks like USERPASS, not just the final bloat; - More decoupling of the rest of g-t (only the defines are needed which can go to a separate .h file); - Can test against both regex engines in a single run.
(In reply to Egmont Koblinger from comment #0) > For excluding the terminating dot (introduced by the patch to bug 571002, > which I'm afraid is the culprit for some other issues) we should use > negative lookahead No clue why I thought lookahead can do it. Now it seems it cannot. (Or am I wrong now?)
I wasn't that wrong after all. We need a look_behind_ at the end of the regex. Allow to consume all kinds of alphanumeric and punctuation chars including dots etc., and at the very end look back to ensure that it's not a dot there. It works, it's simple, straightforward and awesome :)
Created attachment 313996 [details] [review] Regex fixes + unittest, v1 (work in progress)
FYI: I'm planning to remove the caseless boolean field and use (?i:...) instead. Rationale: it allows better unittesting (e.g. of the scheme only) and theoretically an entire giant regex could be built from a mixture of case-sensitive and case-insensitive blocks. (In practice, right now we only use case insensitivity for the scheme.) Let me know if you have concerns.
Style question: I have this right now (and it might get worse): #define XYZ "(?:[-[:alnum:]]|(?![[:ascii:]])[[:graph:]])" Do you think adding spaces (eliminated by cpp) would improve readability, or just increase visual clutter? #define XYZ "(?:" "[-[:alnum:]]" "|" "(?!" "[[:ascii:]]" ")" "[[:graph:]]" ")" How about certain defines, something like this? #define NC "(?:" // non-capturing #define NLA "(?!" // negative lookahead #define CL ")" // close #define OR "|" #define XYZ NC "[-[:alnum:]]" OR NLA "[[:ascii:]]" CL "[[:graph:]]" CL They'd probably help preventing bugs as in the beginning of comment 7 (where that ":" was mistaken for a literal colon whereas it's part of the non-capturing group syntax). On the other hand, it's almost a new language to learn :D
Oh, there's something called "free-spacing mode", I'll need that! :)
Created attachment 314100 [details] [review] Regex fixes + unittest, v2 The architecture was reworked, it's pretty much what I had in my mind. I think it's almost complete. Two small issues to be addressed: - Run the regexes against pcre2 in addition to glib; - Hook up to distcheck or whatever so that it's run at tarball creation. Regexes themselves begin to look promising, but they're still far from ideal.
Created attachment 314102 [details] [review] Regex fixes + unittest, v2 Haha, attached a totally different file :) Here's it.
Created attachment 314548 [details] [review] Regex fixes + unittest, v3 Here I am right now. It's still work in progress, but is quite decent now. It might look terribly complex, sometimes a total overkill. E.g. why have an exact IPv6 regex? Well, for one, I had fun implementing it. The thing is: We could go the opposite directon and say that anything from a http:// to the next whitespace or trailing dot is a URL. The rest (recognizing hostname, port etc.) is not our business. This could result in even simpler regexes than current g-t. It's hard to find a balance, and in fact, I believe that the half-decent solution we had so far is the hardest to maintain. It _sometimes_ understands the semantics of the string, but makes so many mistakes. E.g. one I just newly discovered: currently "file:///mnt/lost+found" is recognized as a valid URI (we mistakenly believe that "mnt" is a hostname), but "file:///lost+found" is not (similarly we'd mistakenly parse "lost+found" as the hostname, whereas '+' is not a valid hostname char). From this halfway decent solution (where it's damn hard to fix such bugs) we can go to the rough n simple approach of accepting pretty much everything (where I'm afraid it's inevitable to have almost-impossible-to-fix minor bugs), or try to be as pedantic as reasonably possible. Unittesting also benefits from being pedantic. It's a huge win that we can not only test valid matches but also invalid ones. Keeping it within strict boundaries helps us ensure that we indeed correctly parse the semantics of the URLs.
Hi, oops! I didn't searched bugzilla beforehand, so duplicate work :-( I did have a look at your patch, do your patch support the IPvFuture syntax as described in https://tools.ietf.org/html/rfc3986 ? Maybe supporting https://tools.ietf.org/html/rfc6874 is also a good thing, but maybe not really needed as an extension to RFC3986.
(In reply to Thomas Meyer from comment #17) > I did have a look at your patch, do your patch support the IPvFuture syntax > as described in https://tools.ietf.org/html/rfc3986 ? Nope. I've never heard of this. Feel free to improve my patch (along with the unittests), thanks! Although I doubt it's of any practical importance.
I didn't find RFC3986's IPv6 BNF and all the IPv6 regexs I found on the net were terrible so I came up with my own. However, I tend to prefer your approach of implementing the RFC's BNF.
Feel free to use my code! Does gnome have an gerrit instance somewhere? as patch tracking in bugzilla is a bit awkward... Or isn't it? I never used it this way.
I never used Gerrit :P No clue if GNOME has one, I'm not aware of it.
IMHO the patch is ready; it certainly can't be worse than the current code :-)
:-)
Let's double check that we address bug 762151 too.
I committed this to master; let's call this FIXED and open new bugs if anything remains.
(In reply to Egmont Koblinger from comment #16) > From this halfway decent solution (where it's damn hard to fix such bugs) we > can go to the rough n simple approach of accepting pretty much everything > (where I'm afraid it's inevitable to have almost-impossible-to-fix minor > bugs), or try to be as pedantic as reasonably possible. > > Unittesting also benefits from being pedantic. It's a huge win that we can > not only test valid matches but also invalid ones. Keeping it within strict > boundaries helps us ensure that we indeed correctly parse the semantics of > the URLs. http://askubuntu.com/q/671278/398785 brings up another excellent data point why going pedantic was probably the right choice. Here someone complains that an NPM package such as "react@0.13.3" is highlighted as an email address. With the new regexes, even if it's modified to be a valid IP (e.g. "react@1.2.3.4") it's still not recognized, because IP addresses need to be enclosed within brackets in email addresses. So, by being pedantic, we highlight fewer false positives and hence cause less frustration to those working with something totally different that just happens to look similar to URLs or email addresses.