Bug 724724 – 404 on gitweb pages, semi colons in URL?

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 724724 - 404 on gitweb pages, semi colons in URL?


Summary:	404 on gitweb pages, semi colons in URL?


Status:	RESOLVED FIXED

Product:	epiphany
Classification:	Core
Component:	General
Version:	git master
Hardware:	Other Mac OS

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Epiphany Maintainers
QA Contact:	Epiphany Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-02-19 14:43 UTC by Frederic Peters
Modified:	2014-05-21 12:14 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
do not encode ; while transforming URL to avoid trackers (1.67 KB, patch) 2014-02-20 16:03 UTC, Frederic Peters	none	Details \| Review
do not encode valid query characters while transforming URL to avoid trackers (1.72 KB, patch) 2014-02-20 16:18 UTC, Frederic Peters	reviewed	Details \| Review
do not touch query parameter values when removing trackers (1.90 KB, patch) 2014-02-21 17:14 UTC, Frederic Peters	reviewed	Details \| Review
do not touch query parameter values when removing trackers (2.18 KB, patch) 2014-02-21 19:16 UTC, Frederic Peters	committed	Details \| Review

Description Frederic Peters 2014-02-19 14:43:37 UTC

I get a 404 erorr page when I go to an URL such as:
http://git.savannah.gnu.org/gitweb/?p=grep.git;a=commit;h=97318f5e59a1ef6feb8a378434a00932a3fc1e0b

This works fine with GtkLauncher (and other browsers).

For an unknown reason (I tried tracing it without success) the semi colon gets url-encoded before the request is sent to the server.

--- /tmp/trace-ephy     2014-02-19 13:07:15.047673066 +0100
+++ /tmp/trace-gtklauncher      2014-02-19 13:51:03.364706177 +0100
@@ -1,21 +1,20 @@
-GET /gitweb/?p=grep.git%3Ba%3Dcommit%3Bh%3D97318f5e59a1ef6feb8a378434a00932a3fc1e0b HTTP/1.1
+GET /gitweb/?p=grep.git;a=commit;h=97318f5e59a1ef6feb8a378434a00932a3fc1e0b HTTP/1.1
 Host: git.savannah.gnu.org
 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
-User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X) AppleWebKit/538.15 (KHTML, like Gecko) Safari/538.15 Epiphany/3.11.4
-DNT: 1
+User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X) AppleWebKit/538.15 (KHTML, like Gecko) Safari/538.15
 Accept-Encoding: gzip, deflate
-Accept-Language: fr, en;q=0.90
 Connection: Keep-Alive
 
-HTTP/1.1 404 Not Found
-Date: Wed, 19 Feb 2014 12:06:00 GMT
+HTTP/1.1 200 OK
+Date: Wed, 19 Feb 2014 12:50:38 GMT
 Server: Apache/2.2.16 (Debian)
+Expires: Thu, 20 Feb 2014 12:50:38 GMT
 Keep-Alive: timeout=15, max=100
 Connection: Keep-Alive
 Transfer-Encoding: chunked
 Content-Type: application/xhtml+xml; charset=utf-8

Comment 1 Claudio Saavedra 2014-02-19 21:14:10 UTC

Works for me with 3.11.90 and wkgtk 2.3.90.

Comment 2 Frederic Peters 2014-02-20 16:03:15 UTC

Created attachment 269814 [details] [review]
do not encode ; while transforming URL to avoid trackers

Comment 3 Frederic Peters 2014-02-20 16:18:12 UTC

Created attachment 269816 [details] [review]
do not encode valid query characters while transforming URL to avoid trackers

Comment 4 Frederic Peters 2014-02-20 16:21:35 UTC

After investigating this further and going through webkit, libsoup, and other friends, I came back to it and tried it with a different user and it worked; so that was a config thing and then it was quick, this is caused by the removal of trackers from URI when you have "do not track" enabled.

I added a few characters that are valid in query parameters and this now is ok. (as well as bug 724588)

Comment 5 Carlos Garcia Campos 2014-02-21 08:04:40 UTC

Review of attachment 269816 [details] [review]:

Good catch!

::: lib/ephy-uri-helpers.c
@@ +82,3 @@
       g_string_append_c (str, '+');
       s++;
+    } else if (*s != ';' && *s != '/' && *s != ':' && !g_ascii_isalnum (*s))

This code is copied from soup and the expression in soup is:

else if (!g_ascii_isalnum (*s) && (*s != '-') && (*s != '_') && (*s != '.'))

which makes me also wonder whether we could use soup API instead copying the code.

Comment 6 Dan Winship 2014-02-21 12:32:28 UTC

Probably the code isn't using the libsoup methods because soup_form_decode() returns a GHashTable, and some sites get annoyed if you rearrange the form elements.

On the encoding side, soup_form_encode_datalist() takes a GData, and encodes it correctly and in order. So if you made query_decode() return a GData, then you could use soup_form_encode_datalist() on the encoding side. (And that would suggest that we should probably have soup_form_decode_datalist() too...)

Comment 7 Carlos Garcia Campos 2014-02-21 16:55:44 UTC

(In reply to comment #6)
> Probably the code isn't using the libsoup methods because soup_form_decode()
> returns a GHashTable, and some sites get annoyed if you rearrange the form
> elements.

It's mostly the same code, just adapted a bit. I wonder why we need to ignore ';', '/' amd ':' while soup is ignoring '-', '_' and '.'.

> On the encoding side, soup_form_encode_datalist() takes a GData, and encodes it
> correctly and in order. So if you made query_decode() return a GData, then you
> could use soup_form_encode_datalist() on the encoding side. (And that would
> suggest that we should probably have soup_form_decode_datalist() too...)

Comment 8 Frederic Peters 2014-02-21 17:14:00 UTC

Created attachment 269931 [details] [review]
do not touch query parameter values when removing trackers

Actually the patch didn't work as expected, pondering it, as we are not considering the value parts, this new patch simply doesn't do any decoding/recoding of these elements.

Comment 9 Bastien Nocera 2014-02-21 18:38:16 UTC

Review of attachment 269931 [details] [review]:

Looks good. I'm glad there's a test suite for those ;)

I would make a note in the QueryItem struct that name is unescaped in query_decode() with form_decode(), and that value is as-is, still escaped.

Comment 10 Bastien Nocera 2014-02-21 18:46:04 UTC

(In reply to comment #6)
> Probably the code isn't using the libsoup methods because soup_form_decode()
> returns a GHashTable, and some sites get annoyed if you rearrange the form
> elements.

Yes, that's exactly it.

> On the encoding side, soup_form_encode_datalist() takes a GData, and encodes it
> correctly and in order. So if you made query_decode() return a GData, then you
> could use soup_form_encode_datalist() on the encoding side. (And that would
> suggest that we should probably have soup_form_decode_datalist() too...)

That would be great as well, certainly. To be fair, I probably glanced at the g_datalist API, and wondered how to use it...

Comment 11 Frederic Peters 2014-02-21 19:16:51 UTC

Created attachment 269941 [details] [review]
do not touch query parameter values when removing trackers

Same patch, comment added before QueryItem:

+/* QueryItem holds a query parameter name/value pair. The name is unescaped in
+ * query_decode() with form_decode(), the value is not altered. */

Comment 12 Bastien Nocera 2014-02-21 19:21:41 UTC

Review of attachment 269941 [details] [review]:

Looks good!

Comment 13 Frederic Peters 2014-02-21 20:09:57 UTC

Attachment 269941 [details] pushed as 77a7fd4 - do not touch query parameter values when removing trackers

Comment 14 Dan Winship 2014-02-22 15:13:30 UTC

(In reply to comment #7)
> It's mostly the same code, just adapted a bit. I wonder why we need to ignore
> ';', '/' amd ':' while soup is ignoring '-', '_' and '.'.

libsoup used to just check !g_ascii_isalnum(); the -/_/. check was added later, when we ported the code from the HTML4 form encoding rules to the HTML5 form encoding rules. You should probably add those characters to your exceptions list, because there are sites that will fail if you don't.

Other than that, it may be because your code is trying to preserve the URL mostly-unchanged (after decoding, editing, and re-encoding), and it was just determined empirically that there were some sites that included unencoded ';' '/' and ':' in their queries that would fail if you encoded them while regenerating the query.

OTOH there are probably sites that fail in the opposite direction too. I think ideally, your code would want to split up the query without doing any %-decoding, then decode each parameter name only for the purpose of comparing it against the list of parameters to remove, and then reassemble the query from the remaining original pieces (rather than decoding and reencoding them).

Comment 15 Bastien Nocera 2014-05-21 12:14:43 UTC

I've implemented comment 14's approach in https://bugzilla.gnome.org/show_bug.cgi?id=730464#c5