GNOME Bugzilla – Bug 667102
It Seems All Non Space Characters Are Considered Valid URL Characters
Last modified: 2015-05-09 20:46:39 UTC
I stumbled upon an NFO where a ░ character (U2591) was immediately following an URL and it was seen as a valid URL character.
This is true. Lines are split to words by spaces and words are considered URLs if they match the following regular expression. r"(([0-9a-zA-Z]+://\S+?\.\S+)|(www\.\S+?\.\S+))" Sometimes, although rarely, there will be errors as the one you have now stumbled on. I'm open to suggestions on how to fix this problem. Any strict list of characters allowed in URLs by some specification is not enough since most browsers are "smart" and often implicit in converting special characters to their %-escaped equivalents and authors often rely on those conversions being available.
Maybe it's best to provide some kind of a limited list of characters after all, we'll see how it works. This problem has been fixed in the unstable development version. The fix will be available in the next major software release. You may need to upgrade your Linux distribution to obtain that newer version. commit 0ef94eb8b0d46fedfe5b224ea40df92315d8040a Author: Osmo Salomaa <otsaloma@iki.fi> Date: Sat May 9 23:42:37 2015 +0300 Improve URL detection. https://bugzilla.gnome.org/show_bug.cgi?id=667102 https://github.com/otsaloma/nfoview/commit/0ef94eb8b0d46fedfe5b224ea40df92315d8040a