Bug 715126 – The libsoup content sniffer doesn't allow leading whitespace in doctype headers

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 715126 - The libsoup content sniffer doesn't allow leading whitespace in doctype headers


Summary:	The libsoup content sniffer doesn't allow leading whitespace in doctype headers


Status:	RESOLVED FIXED

Product:	libsoup
Classification:	Core
Component:	Misc
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	libsoup-maint@gnome.bugs
QA Contact:	libsoup-maint@gnome.bugs

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2013-11-24 23:19 UTC by Alberto Garcia
Modified:	2014-02-17 17:29 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Patch (1.05 KB, patch) 2013-11-24 23:38 UTC, Alberto Garcia	none	Details \| Review
Patch + test (2.12 KB, patch) 2013-11-28 21:29 UTC, Alberto Garcia	none	Details \| Review

Description Alberto Garcia 2013-11-24 23:19:06 UTC

Here's an example:

   http://www.jaguarusa.com/index.html

This document has three empty lines before the <!DOCTYPE html> header

However that seems to be allowed:

   http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#writing

Comment 1 Alberto Garcia 2013-11-24 23:38:02 UTC

Created attachment 261379 [details] [review]
Patch

Comment 2 Alberto Garcia 2013-11-24 23:52:06 UTC

Note that I enlarged the mask and the pattern but I didn't touch the
pattern_length field.

As far as I can see by the way it is being used, it's not treated as
the length of the string but as the index of the last character.

Here's the condition:

    (index_pattern <= type_row->pattern_length)

In some of the patterns the value is incorrect. It was the case in the
DOCTYPE header, and it also happens in the PDF case.

There the pattern is "%PDF-" and pattern_length is 5. Then pattern[5]
is the null terminator. However, since mask[5] is also null, the
pattern matches.

Comment 3 Sergio Villar 2013-11-25 07:44:31 UTC

Your change definitely needs some tests, check sniffing-test.c in tests/ directory. I guess some of them will be broken by your change.

Comment 4 Alberto Garcia 2013-11-25 08:46:20 UTC

tests/sniffing-test seems to run fine, but I'm not sure if there's any
specific test to cover this case.

Comment 5 Gustavo Noronha (kov) 2013-11-25 22:34:46 UTC

Review of attachment 261379 [details] [review]:

::: libsoup/soup-content-sniffer.c
@@ +90,3 @@
+	{ TRUE,
+	  (const guchar *)"\xFF\xFF\xFF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xFF\xDF\xDF\xDF\xDF",
+	  (const guchar *)" \x3C\x21\x44\x4F\x43\x54\x59\x50\x45\x20\x48\x54\x4D\x4C",

I often wonder if it wouldn't make sense to use the actual characters instead of their hex codes. Perhaps adding a test for this case would be a good idea, so it doesn't regress? Should be a matter of adding a file which starts with the doctype preceded by a space to tests/resources/ and then adding something like:

        test_sniffing ("/unknown/space-before-doctype.html", "text/html");

Maybe it makes sense adding another test path that sends no Content-Type header, but the unknown code path should be good too.

Comment 6 Alberto Garcia 2013-11-26 08:48:47 UTC

As far as I can see they were taken from the table here:

http://www.w3.org/TR/2009/WD-html5-20090212/infrastructure.html#content-type-sniffing:-unknown-type

Yes I think they can be replaced by actual characters, maybe we can
have assertions to make sure that the pattern and the mask have the
same length.

Related to that, I think the pattern_length field should be calculated
automatically rather than set manually. But this is probably a task
for a second patch.

About the new test case: yes, it totally makes sense. I can try to
write it this week unless someone else volunteers :)

Comment 7 Gustavo Noronha (kov) 2013-11-26 17:47:20 UTC

Yeah, they were taken from the precursor to that one, actually, the draft by adam barth that lived on ietf =)

Comment 8 Alberto Garcia 2013-11-28 21:29:54 UTC

Created attachment 263082 [details] [review]
Patch + test

Here's the new patch, this one includes a test for this case.

I'd also like to rewrite the sniffing code a bit, but that can go to a separate patch.

Comment 9 Dan Winship 2013-11-30 20:59:10 UTC

I think this is basically a dup of bug 648849, "update content sniffing algorithm to latest specification". I guess the HTML5 spec is the canonical version now, since it looks like the Internet Draft never made it to RFC.

Comment 10 Alberto Garcia 2013-12-03 22:24:00 UTC

(In reply to comment #9)
> I think this is basically a dup of bug 648849, "update content
> sniffing algorithm to latest specification". I guess the HTML5 spec
> is the canonical version now, since it looks like the Internet Draft
> never made it to RFC.

Yes, it looks like that. Should we close this one then or upload this
fix anyway and update the rest of the algorithm later? I'm not sure if
I'll have the time to work on that soon.

Comment 11 Gustavo Noronha (kov) 2013-12-04 10:58:00 UTC

Sounds like a task I could take on during the hackfest in a few days =)

Comment 12 Dan Winship 2013-12-04 12:32:07 UTC

(In reply to comment #11)
> Sounds like a task I could take on during the hackfest in a few days =)

agenda'd!

Comment 13 Gustavo Noronha (kov) 2013-12-10 11:38:10 UTC

Pushing my WIP to https://git.gnome.org/browse/libsoup/log/?h=content-sniffing-update

Comment 14 Gustavo Noronha (kov) 2013-12-11 13:10:40 UTC

OK, I think I'm happy with the branch now, would you like me to publish the branch as patches here? I incorporated the new test by Berto as a commit. What is missing according to the spec now is the "contextual" sniffing - we will need to add APIs to soup to let WebKit tell soup what context the sniffing should be done on, I guess (see http://mimesniff.spec.whatwg.org/#context-specific-sniffing).

Comment 15 Dan Winship 2014-02-17 17:29:50 UTC

merged to master