Bug 763980 – URL containing parentheses

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 763980 - URL containing parentheses


Summary:	URL containing parentheses


Status:	RESOLVED FIXED

Product:	gnome-terminal
Classification:	Core
Component:	general
Version:	git master
Hardware:	Other Linux

Importance:	Normal minor
Target Milestone:	---
Assigned To:	GNOME Terminal Maintainers
QA Contact:	GNOME Terminal Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2016-03-21 10:38 UTC by Egmont Koblinger
Modified:	2017-12-17 22:35 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Fix v1 (4.10 KB, patch) 2017-04-22 20:31 UTC, Egmont Koblinger	none	Details \| Review
Fix v2 (4.76 KB, patch) 2017-04-23 19:04 UTC, Egmont Koblinger	committed	Details \| Review
Square brackets (2.13 KB, patch) 2017-12-17 22:06 UTC, Egmont Koblinger	committed	Details \| Review

Description Egmont Koblinger 2016-03-21 10:38:04 UTC

Found at https://bugs.launchpad.net/terminator/+bug/1559635

This should be recognized as a complete URL, including the pair of parentheses:
https://en.wikipedia.org/wiki/The_Offspring_(album)

However, when the entire URL is enclosed in parentheses, they should be ignored:
(https://en.wikipedia.org/wiki/The_Offspring)

... And then handle the combination of the two correctly, ouch!!!
(https://en.wikipedia.org/wiki/The_Offspring_(album))

Comment 1 Egmont Koblinger 2017-04-21 21:46:19 UTC

A few other user complaints pro/con including/excluding trailing parentheses:

https://askubuntu.com/questions/895710/gnome-terminal-highlight-urls-ending-in-close-paren/895813
also mentions wikipedia articles with disambiguation, e.g.:
http://zelda.gamepedia.com/Ocarina_of_Time_(Item)

https://github.com/thestinger/termite/issues/472
mentions markdown syntax, e.g.
[description](http://www.link.com)

Comment 2 Egmont Koblinger 2017-04-21 21:55:51 UTC

I can think of two possible approaches to address this problem:

1. In the regex that specifies the path characters, somehow explicitly denote that '(' and ')' are only allowed if they are balanced. Something along the lines of
  PATHCHARS_CLASS* ( \( PATHCHARS_CLASS* \) PATHCHARS_CLASS* )?

2. Lookbehind for a '(' and if found, require a lookahead ')'.

Not sure which one is the easier to implement or would give the better results. My guts feeling tell me to try the first one first, especially if we try to care about nested parentheses as well (not sure if we should, and I'm a bit afraid that the required recursive regex might cause exponential slowdown).

Comment 3 Egmont Koblinger 2017-04-22 20:31:36 UTC

Created attachment 350253 [details] [review]
Fix v1

Here's an implementation of allowing nested parentheses.

Thinking about it more, I don't think we should be afraid of exponential cost. That would require recursion and branches (like "(a|b)"). Technically we do have a recursion here, but it's always one matching path that's viable, so practically there are no multiple branches to track (or only tiny local ones that don't make it into recursion).

I'm no longer sure if it's nice to exclude trailing dots using a lookbehind at the end, maybe an explicit set for the allowed trailing characters is cleaner. At least the combination of recursion and lookaround caused me a bit of headache. Let me see...

Comment 4 Egmont Koblinger 2017-04-22 20:36:09 UTC

Should any other chars, e.g. [ ] or { } get this treatment?

Comment 5 Egmont Koblinger 2017-04-23 19:04:27 UTC

Created attachment 350271 [details] [review]
Fix v2

As promised, here's the second version. Instead of lookbehind, this one uses an explicit set for the last character. I like this one better.

Comment 6 Christian Persch 2017-04-23 20:20:40 UTC

Comment on attachment 350271 [details] [review]
Fix v2

Nice use of PCRE features :-)  Let's try this for master.

Comment 7 Egmont Koblinger 2017-04-23 21:45:34 UTC

Comment on attachment 350271 [details] [review]
Fix v2

Committed to master.

Comment 8 Egmont Koblinger 2017-04-23 21:52:21 UTC

Still not sure about [ ] and { }. Currently they are not URL characters.

If I recall correctly, [] are used by PHP in query parameters whose values are arrays. Not sure if we should simply add these two characters to the allowed set, or again accept balanced pairs only. Anyways... Let's close this one, and come back to [] if anyone cares.

Comment 9 Egmont Koblinger 2017-12-17 22:05:49 UTC

[] are actually quite frequent, and according to bug 448044 comment 4 they're fine unencoded (while {} are not mentioned there).

So let's add support for balanced pairs of [], assuming they don't overlap () pairs.

Comment 10 Egmont Koblinger 2017-12-17 22:06:42 UTC

Created attachment 365663 [details] [review]
Square brackets

Comment 11 Christian Persch 2017-12-17 22:31:00 UTC

Comment on attachment 365663 [details] [review]
Square brackets

Ok.

Comment 12 Egmont Koblinger 2017-12-17 22:35:17 UTC

Submitted.