GNOME Bugzilla – Bug 448044
Problem with apostrophe in URL
Last modified: 2018-01-01 15:01:01 UTC
[ Forwarded from http://bugs.debian.org/426592 ] Put "http://en.wikipedia.org/wiki/Moore's_law" in a terminal without the quotes. Move cursor over URL. "'s_law" is not part of the URL according to gnome-terminal. This is a flaw.
Well, it's a heuristic, if we allow single-quote, then URL in a quotation goes wrong... We need to make the cut somewhere, and leaving single-quote made more sense previously. Times may have changed thought with things like Wikipedia, I agree.
single quote would usually be escaped in an url wouldn't it?
That was what I first thought, but Firefox seems to happily keep it. Same for a bunch of others, like parantheses. It escapes backslash though for example. So seems like apps have become more relaxed these days.
Obviously it's not possibly to match every *intended* URL in unstructured text, but I think all truly valid ones (per the URL RFC) should be matched as a baseline. From my perspective, treating single quotes and parentheses in a special way results in far more URLs being matched incorrectly than the alternative. The older RFC suggested surrounding URLs in unstructured text with < and >, which is exactly what ReStructured Text and many IRC bots do. According to the URL RFC, all of these are valid characters to leave unencoded in a URL (but can be used with a special meaning within the URL scheme): gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" Along with the fully unreserved characters: unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" https://www.ietf.org/rfc/rfc1738.txt https://tools.ietf.org/html/rfc3986
Still happending with the regex rewrite from bug 756038.
Apostrophes are not uncommon at the end of URLs either, e.g. https://en.wikipedia.org/wiki/Cryin' In Chromium, go to wikipedia.org, type Cryin' and press Enter. The URL bar will show Cryin%27 and it copy-pastes accordingly. Go again to wikipedia.org and type again Cryin' and this time choose the first autocomplete dropdown with the mouse. The URL bar will say Cryin' and it'll copy-paste accordingly as apostrophe. (You're taken to the same page. I haven't compared the HTTP traffic.) In Firefox, the URL bar shows an apostrophe but the address copy-pastes as %27. That is as long as you don't tamper with the URL bar. Retype any part of the URL (either the apostrophe or an irrelevant segment) and from then on it's an apostrophe. Then try to undo it: remove the apostrophe from the URL and start typing %27, as soon as you press % it recognizes it has the apostrophe-version in its history and replaces the % by that ' so you cannot type %27. Double quotes seem to always get copy-pasted from browsers' URL bar as %22, so hopefully we don't need to worry about them. Wikipedia doesn't like it either, e.g. 12" redirects to 12-inch_single, etc. Well, I was copying the entire URL in all these examples. Copying a part of them behaves differently, then it's the visible string that's copied. Such a freaking mess... I'm cryin'... Anyway, our takeaway is that we should probably add support for apostrophes (and probably double quotes aren't needed). In the mean time, it's essential for URLs within single quotes, e.g. 'http://example.com' not to grab the trailing apostrophe. The balanced pair approach (bug 763980) obviously cannot work. I can see two possible approaches: - Have a branch at the outmost level, pretty much duplicating the entire big regex. One variant that doesn't allow embedded apostrophes, OR another variant that allows, but begins with a lookbehind that there isn't an apostrophe. - Regex conditionals, http://www.rexegg.com/regex-conditionals.html. At the beginning "define a variable" containing whether there's a preceding apostrophe (a named capture group doing lookbehind, or something like that) and then based on this variable do inner small local branches to allow/forbid apostrophes.
A third one, sounds simpler: - Using regex conditionals, check at the very beginning whether there's a lookbehind apostrophe, and if so, require at the end that there isn't a lookbehind (i.e. trailing) apostrophe. (If there's one, backtracking will leave that out but match the rest, I hope.)
Created attachment 365673 [details] [review] v0 Here's a draft patch that seems to be working. It's based on the 3rd approach. At the beginning there's a named capturing group APOS_START with a lookbehind to see whether there's a leading apostrophe. The whole thing is made optional because we mustn't bail out if there isn't. At the end, lookbehind + backtracking cannot work because backtracking works on the level of regex blocks, not individual characters. A single backtracking step decides to omit the whole optional URLPATH. Maybe the regex could be reworked so that it works, but doesn't look easy. Instead, luckily, we already define a different set of characters that can terminate the path (to exclude dot and comma, and I even sneaked in semicolon here yesterday). The path needs to end in one like this, unless it ends in a closing parentheseis or square bracket (or is empty), in which case we don't care about apostrophe here at all. So here we branch on whether the optional opening apostrophe's named capturing group matched or not, and depending on that, we forbid or allow the apostrophe.
Comment on attachment 365673 [details] [review] v0 You know the regexes better, so if you think this is the right fix, go for it :-) Would be nice to add a test for this too. Thanks!
(In reply to Christian Persch from comment #9) > You know the regexes better, so if you think this is the right fix, go for I was thinking of ways to avoid duplication of pathterm chars, but anything that occurred to me would have just even further overcomplicated it. So I just left it unchanged. > Would be nice to add a test for this too. Added comments and unittests of course :) Submitted.