GNOME Bugzilla – Bug 763980
URL containing parentheses
Last modified: 2017-12-17 22:35:17 UTC
Found at https://bugs.launchpad.net/terminator/+bug/1559635 This should be recognized as a complete URL, including the pair of parentheses: https://en.wikipedia.org/wiki/The_Offspring_(album) However, when the entire URL is enclosed in parentheses, they should be ignored: (https://en.wikipedia.org/wiki/The_Offspring) ... And then handle the combination of the two correctly, ouch!!! (https://en.wikipedia.org/wiki/The_Offspring_(album))
A few other user complaints pro/con including/excluding trailing parentheses: https://askubuntu.com/questions/895710/gnome-terminal-highlight-urls-ending-in-close-paren/895813 also mentions wikipedia articles with disambiguation, e.g.: http://zelda.gamepedia.com/Ocarina_of_Time_(Item) https://github.com/thestinger/termite/issues/472 mentions markdown syntax, e.g. [description](http://www.link.com)
I can think of two possible approaches to address this problem: 1. In the regex that specifies the path characters, somehow explicitly denote that '(' and ')' are only allowed if they are balanced. Something along the lines of PATHCHARS_CLASS* ( \( PATHCHARS_CLASS* \) PATHCHARS_CLASS* )? 2. Lookbehind for a '(' and if found, require a lookahead ')'. Not sure which one is the easier to implement or would give the better results. My guts feeling tell me to try the first one first, especially if we try to care about nested parentheses as well (not sure if we should, and I'm a bit afraid that the required recursive regex might cause exponential slowdown).
Created attachment 350253 [details] [review] Fix v1 Here's an implementation of allowing nested parentheses. Thinking about it more, I don't think we should be afraid of exponential cost. That would require recursion and branches (like "(a|b)"). Technically we do have a recursion here, but it's always one matching path that's viable, so practically there are no multiple branches to track (or only tiny local ones that don't make it into recursion). I'm no longer sure if it's nice to exclude trailing dots using a lookbehind at the end, maybe an explicit set for the allowed trailing characters is cleaner. At least the combination of recursion and lookaround caused me a bit of headache. Let me see...
Should any other chars, e.g. [ ] or { } get this treatment?
Created attachment 350271 [details] [review] Fix v2 As promised, here's the second version. Instead of lookbehind, this one uses an explicit set for the last character. I like this one better.
Comment on attachment 350271 [details] [review] Fix v2 Nice use of PCRE features :-) Let's try this for master.
Comment on attachment 350271 [details] [review] Fix v2 Committed to master.
Still not sure about [ ] and { }. Currently they are not URL characters. If I recall correctly, [] are used by PHP in query parameters whose values are arrays. Not sure if we should simply add these two characters to the allowed set, or again accept balanced pairs only. Anyways... Let's close this one, and come back to [] if anyone cares.
[] are actually quite frequent, and according to bug 448044 comment 4 they're fine unencoded (while {} are not mentioned there). So let's add support for balanced pairs of [], assuming they don't overlap () pairs.
Created attachment 365663 [details] [review] Square brackets
Comment on attachment 365663 [details] [review] Square brackets Ok.
Submitted.