After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 448044 - Problem with apostrophe in URL
Problem with apostrophe in URL
Status: RESOLVED FIXED
Product: gnome-terminal
Classification: Core
Component: general
2.18.x
Other Linux
: Normal minor
: ---
Assigned To: GNOME Terminal Maintainers
GNOME Terminal Maintainers
Depends on: 756038
Blocks:
 
 
Reported: 2007-06-15 22:42 UTC by Sven Arvidsson
Modified: 2018-01-01 15:01 UTC
See Also:
GNOME target: ---
GNOME version: 2.17/2.18


Attachments
v0 (1.92 KB, patch)
2017-12-18 08:46 UTC, Egmont Koblinger
committed Details | Review

Description Sven Arvidsson 2007-06-15 22:42:58 UTC
[ Forwarded from http://bugs.debian.org/426592 ]

Put "http://en.wikipedia.org/wiki/Moore's_law" in a terminal without the quotes. Move cursor over URL. "'s_law" is not part of the URL according to gnome-terminal. This is a flaw.
Comment 1 Behdad Esfahbod 2007-06-15 22:51:02 UTC
Well, it's a heuristic, if we allow single-quote, then URL in a quotation goes wrong...  We need to make the cut somewhere, and leaving single-quote made more sense previously.  Times may have changed thought with things like Wikipedia, I agree.
Comment 2 Havoc Pennington 2007-06-15 22:54:56 UTC
single quote would usually be escaped in an url wouldn't it?
Comment 3 Behdad Esfahbod 2007-06-15 23:02:55 UTC
That was what I first thought, but Firefox seems to happily keep it.  Same for a bunch of others, like parantheses.  It escapes backslash though for example.  So seems like apps have become more relaxed these days.
Comment 4 Daniel Micay 2012-09-29 12:38:58 UTC
Obviously it's not possibly to match every *intended* URL in unstructured text, but I think all truly valid ones (per the URL RFC) should be matched as a baseline.

From my perspective, treating single quotes and parentheses in a special way results in far more URLs being matched incorrectly than the alternative. The older RFC suggested surrounding URLs in unstructured text with < and >, which is exactly what ReStructured Text and many IRC bots do.

According to the URL RFC, all of these are valid characters to leave unencoded in a URL (but can be used with a special meaning within the URL scheme):

      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

Along with the fully unreserved characters:

      unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

https://www.ietf.org/rfc/rfc1738.txt
https://tools.ietf.org/html/rfc3986
Comment 5 Christian Persch 2016-02-21 15:09:46 UTC
Still happending with the regex rewrite from bug 756038.
Comment 6 Egmont Koblinger 2017-12-17 23:37:21 UTC
Apostrophes are not uncommon at the end of URLs either, e.g.
  https://en.wikipedia.org/wiki/Cryin'

In Chromium, go to wikipedia.org, type Cryin' and press Enter. The URL bar will show Cryin%27 and it copy-pastes accordingly. Go again to wikipedia.org and type again Cryin' and this time choose the first autocomplete dropdown with the mouse. The URL bar will say Cryin' and it'll copy-paste accordingly as apostrophe. (You're taken to the same page. I haven't compared the HTTP traffic.)

In Firefox, the URL bar shows an apostrophe but the address copy-pastes as %27. That is as long as you don't tamper with the URL bar. Retype any part of the URL (either the apostrophe or an irrelevant segment) and from then on it's an apostrophe. Then try to undo it: remove the apostrophe from the URL and start typing %27, as soon as you press % it recognizes it has the apostrophe-version in its history and replaces the % by that ' so you cannot type %27.

Double quotes seem to always get copy-pasted from browsers' URL bar as %22, so hopefully we don't need to worry about them. Wikipedia doesn't like it either, e.g. 12" redirects to 12-inch_single, etc.

Well, I was copying the entire URL in all these examples. Copying a part of them behaves differently, then it's the visible string that's copied.

Such a freaking mess... I'm cryin'...

Anyway, our takeaway is that we should probably add support for apostrophes (and probably double quotes aren't needed).

In the mean time, it's essential for URLs within single quotes, e.g. 'http://example.com' not to grab the trailing apostrophe.

The balanced pair approach (bug 763980) obviously cannot work.

I can see two possible approaches:

- Have a branch at the outmost level, pretty much duplicating the entire big regex. One variant that doesn't allow embedded apostrophes, OR another variant that allows, but begins with a lookbehind that there isn't an apostrophe.

- Regex conditionals, http://www.rexegg.com/regex-conditionals.html. At the beginning "define a variable" containing whether there's a preceding apostrophe (a named capture group doing lookbehind, or something like that) and then based on this variable do inner small local branches to allow/forbid apostrophes.
Comment 7 Egmont Koblinger 2017-12-18 00:37:05 UTC
A third one, sounds simpler:

- Using regex conditionals, check at the very beginning whether there's a lookbehind apostrophe, and if so, require at the end that there isn't a lookbehind (i.e. trailing) apostrophe. (If there's one, backtracking will leave that out but match the rest, I hope.)
Comment 8 Egmont Koblinger 2017-12-18 08:46:09 UTC
Created attachment 365673 [details] [review]
v0

Here's a draft patch that seems to be working.

It's based on the 3rd approach. At the beginning there's a named capturing group APOS_START with a lookbehind to see whether there's a leading apostrophe. The whole thing is made optional because we mustn't bail out if there isn't.

At the end, lookbehind + backtracking cannot work because backtracking works on the level of regex blocks, not individual characters. A single backtracking step decides to omit the whole optional URLPATH. Maybe the regex could be reworked so that it works, but doesn't look easy.

Instead, luckily, we already define a different set of characters that can terminate the path (to exclude dot and comma, and I even sneaked in semicolon here yesterday). The path needs to end in one like this, unless it ends in a closing parentheseis or square bracket (or is empty), in which case we don't care about apostrophe here at all.

So here we branch on whether the optional opening apostrophe's named capturing group matched or not, and depending on that, we forbid or allow the apostrophe.
Comment 9 Christian Persch 2017-12-31 12:05:09 UTC
Comment on attachment 365673 [details] [review]
v0

You know the regexes better, so if you think this is the right fix, go for it :-)  Would be nice to add a test for this too.

Thanks!
Comment 10 Egmont Koblinger 2018-01-01 15:01:01 UTC
(In reply to Christian Persch from comment #9)

> You know the regexes better, so if you think this is the right fix, go for

I was thinking of ways to avoid duplication of pathterm chars, but anything that occurred to me would have just even further overcomplicated it. So I just left it unchanged.

> Would be nice to add a test for this too.

Added comments and unittests of course :)

Submitted.