GNOME Bugzilla – Bug 715126
The libsoup content sniffer doesn't allow leading whitespace in doctype headers
Last modified: 2014-02-17 17:29:50 UTC
Here's an example: http://www.jaguarusa.com/index.html This document has three empty lines before the <!DOCTYPE html> header However that seems to be allowed: http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#writing
Created attachment 261379 [details] [review] Patch
Note that I enlarged the mask and the pattern but I didn't touch the pattern_length field. As far as I can see by the way it is being used, it's not treated as the length of the string but as the index of the last character. Here's the condition: (index_pattern <= type_row->pattern_length) In some of the patterns the value is incorrect. It was the case in the DOCTYPE header, and it also happens in the PDF case. There the pattern is "%PDF-" and pattern_length is 5. Then pattern[5] is the null terminator. However, since mask[5] is also null, the pattern matches.
Your change definitely needs some tests, check sniffing-test.c in tests/ directory. I guess some of them will be broken by your change.
tests/sniffing-test seems to run fine, but I'm not sure if there's any specific test to cover this case.
Review of attachment 261379 [details] [review]: ::: libsoup/soup-content-sniffer.c @@ +90,3 @@ + { TRUE, + (const guchar *)"\xFF\xFF\xFF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xFF\xDF\xDF\xDF\xDF", + (const guchar *)" \x3C\x21\x44\x4F\x43\x54\x59\x50\x45\x20\x48\x54\x4D\x4C", I often wonder if it wouldn't make sense to use the actual characters instead of their hex codes. Perhaps adding a test for this case would be a good idea, so it doesn't regress? Should be a matter of adding a file which starts with the doctype preceded by a space to tests/resources/ and then adding something like: test_sniffing ("/unknown/space-before-doctype.html", "text/html"); Maybe it makes sense adding another test path that sends no Content-Type header, but the unknown code path should be good too.
As far as I can see they were taken from the table here: http://www.w3.org/TR/2009/WD-html5-20090212/infrastructure.html#content-type-sniffing:-unknown-type Yes I think they can be replaced by actual characters, maybe we can have assertions to make sure that the pattern and the mask have the same length. Related to that, I think the pattern_length field should be calculated automatically rather than set manually. But this is probably a task for a second patch. About the new test case: yes, it totally makes sense. I can try to write it this week unless someone else volunteers :)
Yeah, they were taken from the precursor to that one, actually, the draft by adam barth that lived on ietf =)
Created attachment 263082 [details] [review] Patch + test Here's the new patch, this one includes a test for this case. I'd also like to rewrite the sniffing code a bit, but that can go to a separate patch.
I think this is basically a dup of bug 648849, "update content sniffing algorithm to latest specification". I guess the HTML5 spec is the canonical version now, since it looks like the Internet Draft never made it to RFC.
(In reply to comment #9) > I think this is basically a dup of bug 648849, "update content > sniffing algorithm to latest specification". I guess the HTML5 spec > is the canonical version now, since it looks like the Internet Draft > never made it to RFC. Yes, it looks like that. Should we close this one then or upload this fix anyway and update the rest of the algorithm later? I'm not sure if I'll have the time to work on that soon.
Sounds like a task I could take on during the hackfest in a few days =)
(In reply to comment #11) > Sounds like a task I could take on during the hackfest in a few days =) agenda'd!
Pushing my WIP to https://git.gnome.org/browse/libsoup/log/?h=content-sniffing-update
OK, I think I'm happy with the branch now, would you like me to publish the branch as patches here? I incorporated the new test by Berto as a commit. What is missing according to the spec now is the "contextual" sniffing - we will need to add APIs to soup to let WebKit tell soup what context the sniffing should be done on, I guess (see http://mimesniff.spec.whatwg.org/#context-specific-sniffing).
merged to master