GNOME Bugzilla – Bug 724724
404 on gitweb pages, semi colons in URL?
Last modified: 2014-05-21 12:14:43 UTC
I get a 404 erorr page when I go to an URL such as: http://git.savannah.gnu.org/gitweb/?p=grep.git;a=commit;h=97318f5e59a1ef6feb8a378434a00932a3fc1e0b This works fine with GtkLauncher (and other browsers). For an unknown reason (I tried tracing it without success) the semi colon gets url-encoded before the request is sent to the server. --- /tmp/trace-ephy 2014-02-19 13:07:15.047673066 +0100 +++ /tmp/trace-gtklauncher 2014-02-19 13:51:03.364706177 +0100 @@ -1,21 +1,20 @@ -GET /gitweb/?p=grep.git%3Ba%3Dcommit%3Bh%3D97318f5e59a1ef6feb8a378434a00932a3fc1e0b HTTP/1.1 +GET /gitweb/?p=grep.git;a=commit;h=97318f5e59a1ef6feb8a378434a00932a3fc1e0b HTTP/1.1 Host: git.savannah.gnu.org Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 -User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X) AppleWebKit/538.15 (KHTML, like Gecko) Safari/538.15 Epiphany/3.11.4 -DNT: 1 +User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X) AppleWebKit/538.15 (KHTML, like Gecko) Safari/538.15 Accept-Encoding: gzip, deflate -Accept-Language: fr, en;q=0.90 Connection: Keep-Alive -HTTP/1.1 404 Not Found -Date: Wed, 19 Feb 2014 12:06:00 GMT +HTTP/1.1 200 OK +Date: Wed, 19 Feb 2014 12:50:38 GMT Server: Apache/2.2.16 (Debian) +Expires: Thu, 20 Feb 2014 12:50:38 GMT Keep-Alive: timeout=15, max=100 Connection: Keep-Alive Transfer-Encoding: chunked Content-Type: application/xhtml+xml; charset=utf-8
Works for me with 3.11.90 and wkgtk 2.3.90.
Created attachment 269814 [details] [review] do not encode ; while transforming URL to avoid trackers
Created attachment 269816 [details] [review] do not encode valid query characters while transforming URL to avoid trackers
After investigating this further and going through webkit, libsoup, and other friends, I came back to it and tried it with a different user and it worked; so that was a config thing and then it was quick, this is caused by the removal of trackers from URI when you have "do not track" enabled. I added a few characters that are valid in query parameters and this now is ok. (as well as bug 724588)
Review of attachment 269816 [details] [review]: Good catch! ::: lib/ephy-uri-helpers.c @@ +82,3 @@ g_string_append_c (str, '+'); s++; + } else if (*s != ';' && *s != '/' && *s != ':' && !g_ascii_isalnum (*s)) This code is copied from soup and the expression in soup is: else if (!g_ascii_isalnum (*s) && (*s != '-') && (*s != '_') && (*s != '.')) which makes me also wonder whether we could use soup API instead copying the code.
Probably the code isn't using the libsoup methods because soup_form_decode() returns a GHashTable, and some sites get annoyed if you rearrange the form elements. On the encoding side, soup_form_encode_datalist() takes a GData, and encodes it correctly and in order. So if you made query_decode() return a GData, then you could use soup_form_encode_datalist() on the encoding side. (And that would suggest that we should probably have soup_form_decode_datalist() too...)
(In reply to comment #6) > Probably the code isn't using the libsoup methods because soup_form_decode() > returns a GHashTable, and some sites get annoyed if you rearrange the form > elements. It's mostly the same code, just adapted a bit. I wonder why we need to ignore ';', '/' amd ':' while soup is ignoring '-', '_' and '.'. > On the encoding side, soup_form_encode_datalist() takes a GData, and encodes it > correctly and in order. So if you made query_decode() return a GData, then you > could use soup_form_encode_datalist() on the encoding side. (And that would > suggest that we should probably have soup_form_decode_datalist() too...)
Created attachment 269931 [details] [review] do not touch query parameter values when removing trackers Actually the patch didn't work as expected, pondering it, as we are not considering the value parts, this new patch simply doesn't do any decoding/recoding of these elements.
Review of attachment 269931 [details] [review]: Looks good. I'm glad there's a test suite for those ;) I would make a note in the QueryItem struct that name is unescaped in query_decode() with form_decode(), and that value is as-is, still escaped.
(In reply to comment #6) > Probably the code isn't using the libsoup methods because soup_form_decode() > returns a GHashTable, and some sites get annoyed if you rearrange the form > elements. Yes, that's exactly it. > On the encoding side, soup_form_encode_datalist() takes a GData, and encodes it > correctly and in order. So if you made query_decode() return a GData, then you > could use soup_form_encode_datalist() on the encoding side. (And that would > suggest that we should probably have soup_form_decode_datalist() too...) That would be great as well, certainly. To be fair, I probably glanced at the g_datalist API, and wondered how to use it...
Created attachment 269941 [details] [review] do not touch query parameter values when removing trackers Same patch, comment added before QueryItem: +/* QueryItem holds a query parameter name/value pair. The name is unescaped in + * query_decode() with form_decode(), the value is not altered. */
Review of attachment 269941 [details] [review]: Looks good!
Attachment 269941 [details] pushed as 77a7fd4 - do not touch query parameter values when removing trackers
(In reply to comment #7) > It's mostly the same code, just adapted a bit. I wonder why we need to ignore > ';', '/' amd ':' while soup is ignoring '-', '_' and '.'. libsoup used to just check !g_ascii_isalnum(); the -/_/. check was added later, when we ported the code from the HTML4 form encoding rules to the HTML5 form encoding rules. You should probably add those characters to your exceptions list, because there are sites that will fail if you don't. Other than that, it may be because your code is trying to preserve the URL mostly-unchanged (after decoding, editing, and re-encoding), and it was just determined empirically that there were some sites that included unencoded ';' '/' and ':' in their queries that would fail if you encoded them while regenerating the query. OTOH there are probably sites that fail in the opposite direction too. I think ideally, your code would want to split up the query without doing any %-decoding, then decode each parameter name only for the purpose of comparing it against the list of parameters to remove, and then reassemble the query from the remaining original pieces (rather than decoding and reencoding them).
I've implemented comment 14's approach in https://bugzilla.gnome.org/show_bug.cgi?id=730464#c5