After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 715126 - The libsoup content sniffer doesn't allow leading whitespace in doctype headers
The libsoup content sniffer doesn't allow leading whitespace in doctype headers
Status: RESOLVED FIXED
Product: libsoup
Classification: Core
Component: Misc
unspecified
Other Linux
: Normal normal
: ---
Assigned To: libsoup-maint@gnome.bugs
libsoup-maint@gnome.bugs
Depends on:
Blocks:
 
 
Reported: 2013-11-24 23:19 UTC by Alberto Garcia
Modified: 2014-02-17 17:29 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Patch (1.05 KB, patch)
2013-11-24 23:38 UTC, Alberto Garcia
none Details | Review
Patch + test (2.12 KB, patch)
2013-11-28 21:29 UTC, Alberto Garcia
none Details | Review

Description Alberto Garcia 2013-11-24 23:19:06 UTC
Here's an example:

   http://www.jaguarusa.com/index.html

This document has three empty lines before the <!DOCTYPE html> header

However that seems to be allowed:

   http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#writing
Comment 1 Alberto Garcia 2013-11-24 23:38:02 UTC
Created attachment 261379 [details] [review]
Patch
Comment 2 Alberto Garcia 2013-11-24 23:52:06 UTC
Note that I enlarged the mask and the pattern but I didn't touch the
pattern_length field.

As far as I can see by the way it is being used, it's not treated as
the length of the string but as the index of the last character.

Here's the condition:

    (index_pattern <= type_row->pattern_length)

In some of the patterns the value is incorrect. It was the case in the
DOCTYPE header, and it also happens in the PDF case.

There the pattern is "%PDF-" and pattern_length is 5. Then pattern[5]
is the null terminator. However, since mask[5] is also null, the
pattern matches.
Comment 3 Sergio Villar 2013-11-25 07:44:31 UTC
Your change definitely needs some tests, check sniffing-test.c in tests/ directory. I guess some of them will be broken by your change.
Comment 4 Alberto Garcia 2013-11-25 08:46:20 UTC
tests/sniffing-test seems to run fine, but I'm not sure if there's any
specific test to cover this case.
Comment 5 Gustavo Noronha (kov) 2013-11-25 22:34:46 UTC
Review of attachment 261379 [details] [review]:

::: libsoup/soup-content-sniffer.c
@@ +90,3 @@
+	{ TRUE,
+	  (const guchar *)"\xFF\xFF\xFF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xFF\xDF\xDF\xDF\xDF",
+	  (const guchar *)" \x3C\x21\x44\x4F\x43\x54\x59\x50\x45\x20\x48\x54\x4D\x4C",

I often wonder if it wouldn't make sense to use the actual characters instead of their hex codes. Perhaps adding a test for this case would be a good idea, so it doesn't regress? Should be a matter of adding a file which starts with the doctype preceded by a space to tests/resources/ and then adding something like:

        test_sniffing ("/unknown/space-before-doctype.html", "text/html");

Maybe it makes sense adding another test path that sends no Content-Type header, but the unknown code path should be good too.
Comment 6 Alberto Garcia 2013-11-26 08:48:47 UTC
As far as I can see they were taken from the table here:

http://www.w3.org/TR/2009/WD-html5-20090212/infrastructure.html#content-type-sniffing:-unknown-type

Yes I think they can be replaced by actual characters, maybe we can
have assertions to make sure that the pattern and the mask have the
same length.

Related to that, I think the pattern_length field should be calculated
automatically rather than set manually. But this is probably a task
for a second patch.

About the new test case: yes, it totally makes sense. I can try to
write it this week unless someone else volunteers :)
Comment 7 Gustavo Noronha (kov) 2013-11-26 17:47:20 UTC
Yeah, they were taken from the precursor to that one, actually, the draft by adam barth that lived on ietf =)
Comment 8 Alberto Garcia 2013-11-28 21:29:54 UTC
Created attachment 263082 [details] [review]
Patch + test

Here's the new patch, this one includes a test for this case.

I'd also like to rewrite the sniffing code a bit, but that can go to a separate patch.
Comment 9 Dan Winship 2013-11-30 20:59:10 UTC
I think this is basically a dup of bug 648849, "update content sniffing algorithm to latest specification". I guess the HTML5 spec is the canonical version now, since it looks like the Internet Draft never made it to RFC.
Comment 10 Alberto Garcia 2013-12-03 22:24:00 UTC
(In reply to comment #9)
> I think this is basically a dup of bug 648849, "update content
> sniffing algorithm to latest specification". I guess the HTML5 spec
> is the canonical version now, since it looks like the Internet Draft
> never made it to RFC.

Yes, it looks like that. Should we close this one then or upload this
fix anyway and update the rest of the algorithm later? I'm not sure if
I'll have the time to work on that soon.
Comment 11 Gustavo Noronha (kov) 2013-12-04 10:58:00 UTC
Sounds like a task I could take on during the hackfest in a few days =)
Comment 12 Dan Winship 2013-12-04 12:32:07 UTC
(In reply to comment #11)
> Sounds like a task I could take on during the hackfest in a few days =)

agenda'd!
Comment 13 Gustavo Noronha (kov) 2013-12-10 11:38:10 UTC
Pushing my WIP to https://git.gnome.org/browse/libsoup/log/?h=content-sniffing-update
Comment 14 Gustavo Noronha (kov) 2013-12-11 13:10:40 UTC
OK, I think I'm happy with the branch now, would you like me to publish the branch as patches here? I incorporated the new test by Berto as a commit. What is missing according to the spec now is the "contextual" sniffing - we will need to add APIs to soup to let WebKit tell soup what context the sniffing should be done on, I guess (see http://mimesniff.spec.whatwg.org/#context-specific-sniffing).
Comment 15 Dan Winship 2014-02-17 17:29:50 UTC
merged to master