GNOME Bugzilla – Bug 648849
update content sniffing algorithm to latest specification
Last modified: 2014-02-17 17:30:09 UTC
Hi, when content sniffing was written, specification was in its first draft[1]. Currently, specification is in it's 6th draft[2]. There have been some minor changes since version 1. May be libsoup could be updated to match current revision. [1]: http://tools.ietf.org/html/draft-abarth-mime-sniff-01 [2]: http://tools.ietf.org/html/draft-abarth-mime-sniff-06
Created attachment 186894 [details] [review] patch v1 Hi, here is a patch proposal. It applies on top of https://bugzilla.gnome.org/review?bug=648846&attachment=186889 I'll change this patch depending on what's decided for bug 648846. In addition to reflecting specification changes, I've also fixed a few bugs I discovered in current sniffing algorithm. Tell me if I should better split these part into another bug. Here is an explaination of the patch ==== specification modifications ==== * some media types added in types_table[] (such as application/x-rar-compressed or image/webp * handling of "SB" (space-or-bracket") octet: allow a html tag to terminate either with a space or a > * detection of rdf feeds as "application/rss+xml" ==== fixes in current code ==== * some html patterns were not managed in types_table (such as <BODY or <P * test added for <B pattern * in Text or Binary, point 3 of algorithm tells: If n is greater than or equal to 3, while code checks for if (resource_length >= 4) { I'm not sure that's really important * adding a test for special case "image/svg+xml". It's currently handled correctly part of the code handling xml, but in case of refactoring, we want to make sure, that image/svg+xml is still handled correctly. * handling of html comments did not work at all. I fixed it and added comments nodes in tests rss20.xml and atom.xml: with those modifications, test fail without this patch. * handling of ?> was also wrong. It recognize ?X or X> (|| test instead of && test)
Comment on attachment 186894 [details] [review] patch v1 OK, notes based on comparison with the spec: It should probably be noted in a comment that (a) soup_message_headers_get_content_type() does look at the last Content-Type header if there is more than one, as required by Section 2, and (b) it does ignore the header if it can't be parsed correctly, as also required by Section 2. sniffing-test doesn't test the "syntactically-incorrect Content-Type header" case though, and it should. Given that Section 3 is titled "Web Pages", it is perhaps implied that those rules are only to be used for URLs that are believed to refer to web pages, and that, eg, a URL from an IMG tag ought to jump straight to the "Image" rules? (If this was not the case, then there'd be no need for the image/svg+xml clarification at the start of the "Image" section.) OTOH, this would imply that in some cases, copying the URL out of an IMG tag and pasting it into the location bar would result in a non-image being loaded... I think this requires clarification. Section 3 step 2 only calls for going to the "text or binary" section if the complete text of the Content-Type header matches one of those strings, but the code currently calls sniff_text_or_binary() for any text/plain Content-Type. In fact... Section 2 defines official-type as "the media type indicted by the HTTP Content-Type header field, if present". And RFC 2616 defines media-type as 'media-type = type "/" subtype *( ";" parameter )', which seems to suggest that "official-type" includes any parameters in the Content-Type header... I am pretty sure this is a bug in the spec. (Not to mention the typo of "indicated" as "indicted".) Anyway, I'm assuming for the rest of this that official-type is supposed to mean only the type/subtype production, not including parameters. It is also unclear what we are supposed to do with the original parameters in the case where we change the type/subtype... currently we just keep them as-is, which is pretty clearly wrong... The comment about image/svg+xml refers to the wrong section number now, but it's also more-or-less irrelevant and can be removed. (Actually, there are lots of wrong section numbers...) The "images" check is technically wrong, in that you're only supposed to go to the image-sniffing rules if the official-type is "supported by the user agent". But we can't figure that out without gaining a dependency on gdk-pixbuf, so we'll ignore that part. (This rule is also weird, since it means that, say, if you have a PNG file mislabeled as image/bmp, then you should only display it if you support BMP files. Not sure if that's a bug or just part of the general weirdness of sniffing.) (Spec issues: In 5.3, 'If the row has no "WS" octets:' should be 'If the row has no "WS" or "_>" octets:'. '"SB" octet' should be '_>' octet". 'If index-stream-th octet of the stream different' should be '...stream is different') Since the spec doesn't talk about "SB" any more, you should explain what it means if you're still going to use that abbreviation. The pattern on the UTF-16LE BOM (line 291) is wrong, it should be "\xFF\xFE\x00\x00", not "\xFF\xFF\x00\x00". (Spec issue: In the "unknown" table, "application/x-gzip" should probably be "application/gzip". Also, there doesn't seem to be any difference between "Safe" and "n/a".) The "skip insignificant whitespace" step of sniff_feed_or_html() can read off the end of the buffer. The line "pos = pos + 3" in the "Skipping comments" section of sniff_feed_or_html() is mis-indented.
Created attachment 193479 [details] [review] updated patch Hi, here is an updated patch. (In reply to comment #2) > (From update of attachment 186894 [details] [review]) > It should probably be noted in a comment that (a) > soup_message_headers_get_content_type() does look at the last Content-Type > header if there is more than one, as required by Section 2, and (b) it does > ignore the header if it can't be parsed correctly, as also required by Section > 2. comment added > sniffing-test doesn't test the "syntactically-incorrect Content-Type header" > case though, and it should. test added > Section 3 step 2 only calls for going to the "text or binary" section if the > complete text of the Content-Type header matches one of those strings, but the > code currently calls sniff_text_or_binary() for any text/plain Content-Type. > > In fact... Section 2 defines official-type as "the media type indicted by the > HTTP Content-Type header field, if present". And RFC 2616 defines media-type as > 'media-type = type "/" subtype *( ";" parameter )', which seems to suggest that > "official-type" includes any parameters in the Content-Type header... I am > pretty sure this is a bug in the spec. (Not to mention the typo of "indicated" > as "indicted".) Ok. Updated patch tests, as required by spec, HTTP Content-Type header field. To avoid confusion, I changed previous variable name from content_type to official_type. > The comment about image/svg+xml refers to the wrong section number now, but > it's also more-or-less irrelevant and can be removed. (Actually, there are lots > of wrong section numbers...) I've updated section number comments. > Since the spec doesn't talk about "SB" any more, you should explain what it > means if you're still going to use that abbreviation. Comment modified to remove abbreviation. > The pattern on the UTF-16LE BOM (line 291) is wrong, it should be > "\xFF\xFE\x00\x00", not "\xFF\xFF\x00\x00". fixed. > The "skip insignificant whitespace" step of sniff_feed_or_html() can read off > the end of the buffer. I don't understand how. (pos > resource_lenght) check is before each loop start (before entering loop and at the end of the loop). > The line "pos = pos + 3" in the "Skipping comments" section of > sniff_feed_or_html() is mis-indented. fixed
CCing adam barth because comment 2 raises many spec issues.
pulling out just the spec issues/questions, not the code issues: 1) typos: Section 2: 'the media type indicted by the HTTP Content-Type header field, if present' should say 'indicated' Section 5.3: 'If the row has no "WS" octets:' should be 'If the row has no "WS" or "_>" octets:' '"SB" octet' should be '"_>" octet' 'If index-stream-th octet of the stream different' should be 'If the index-stream-th octet of the stream is different' 2) Does the fact that Section 3 is titled "Web Pages" mean that that algorithm is only for URLs that are believed to be web pages, and that, eg, URLs in IMG tags should jump right to the "Image" rules? If yes, then that should be stated explicitly (and this means that the interpretation of URLs can change when you pull them out of a document and put them directly in the address bar). If no, then the rule about image/svg+xml at the start of the Image section is irrelevant and should be removed, since there's no way to get to that step if the official-type is image/svg+xml. 3) Does "official-type" include the parameters from the Content-Type header? Comparing the definition with the definitions in RFC 2616 implies "yes", but I suspect that wasn't the intention (particularly since the text/plain rules in 3.2 could just say "official-type" in that case, but they don't). Assuming the answer is "no", then what are you supposed to do with the parameters from the original Content-Type header? (Probably, if the sniffed type/subtype is the same as the original type/subtype, then keep the params, else drop them? But other than the text/plain case, do they affect anything before that?) 4) Are you really only supposed to sniff images if the declared official-type is supported by the browser? This would imply that, eg, a PNG labeled as "image/gif" should be recognized as a PNG, but a PNG labeled "image/webp" should not. (If this is intentional, it is weird enough that it should probably be stated explicitly.) 5) In the "unknown" table, "application/x-gzip" should probably be "application/gzip", since that's the official MIME type. (Presumably the browser treats them the same anyway.) Also, in the current state of the document there is no difference between "Safe" and "n/a".
Thanks. This is great feedback. I'll let you know when I've updated the spec.
ok, getting back to this... turns out draft-abarth-mime-sniff-06 is actually old, because the draft got renamed to draft-ietf-mime-sniff, and is now at -03. Using the diffs feature on tools.ietf.org, it looks like the changes from -06 are: - added a new step 4 in the Unknown section, for video/mp4 - changed "audio/x-wave" to "audio/wave" in the Unknown table (and fixed the obvious typo in "vidow/webm") - added section 6.1 explaining how to sniff video/mp4 - removed the table from the Images section and just referred to the signatures from the Unknown table (which should be a no-op, specwise) - added Video and Fonts sections
fixed (via bug 715126)