After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 572589 - libsoup should be able to do Content-Type sniffing
libsoup should be able to do Content-Type sniffing
Status: RESOLVED FIXED
Product: libsoup
Classification: Core
Component: API
unspecified
Other All
: Normal normal
: ---
Assigned To: Gustavo Noronha (kov)
libsoup-maint@gnome.bugs
Depends on:
Blocks:
 
 
Reported: 2009-02-20 19:19 UTC by Gustavo Noronha (kov)
Modified: 2009-07-02 14:23 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
proposed implementation (4.24 KB, patch)
2009-02-20 19:20 UTC, Gustavo Noronha (kov)
needs-work Details | Review
second try at implementing content sniffing (6.87 KB, patch)
2009-02-21 02:37 UTC, Gustavo Noronha (kov)
needs-work Details | Review
small test case (1.31 KB, text/plain)
2009-02-21 02:42 UTC, Gustavo Noronha (kov)
  Details
third try (7.88 KB, patch)
2009-02-23 04:36 UTC, Gustavo Noronha (kov)
none Details | Review
real third try (7.77 KB, patch)
2009-02-23 04:44 UTC, Gustavo Noronha (kov)
none Details | Review
fourth try =) (8.57 KB, patch)
2009-03-01 15:37 UTC, Gustavo Noronha (kov)
none Details | Review
fourth try for real (15.00 KB, patch)
2009-03-05 11:32 UTC, Gustavo Noronha (kov)
none Details | Review
forth try with Christian's comments addressed (14.46 KB, patch)
2009-03-23 14:44 UTC, Gustavo Noronha (kov)
none Details | Review
small test case (1.84 KB, application/octet-stream)
2009-06-08 11:54 UTC, Gustavo Noronha (kov)
  Details
initial implementation of the proposed scheme (18.61 KB, patch)
2009-06-08 12:03 UTC, Gustavo Noronha (kov)
none Details | Review
Second implementation of proposed scheme (15.63 KB, patch)
2009-06-09 02:58 UTC, Gustavo Noronha (kov)
none Details | Review
current state (18.71 KB, patch)
2009-06-12 21:29 UTC, Gustavo Noronha (kov)
none Details | Review
my small test case (18.71 KB, application/octet-stream)
2009-06-12 21:31 UTC, Gustavo Noronha (kov)
  Details
really, my small test case (2.09 KB, text/plain)
2009-06-15 19:41 UTC, Gustavo Noronha (kov)
  Details
first cut of the actual test (6.68 KB, patch)
2009-06-16 01:05 UTC, Gustavo Noronha (kov)
none Details | Review
patch to make WebKitGTK+ use the new API (6.09 KB, patch)
2009-06-21 02:29 UTC, Gustavo Noronha (kov)
none Details | Review

Description Gustavo Noronha (kov) 2009-02-20 19:19:38 UTC
It should be possible for client applications to request that soup try to find out the content type of content when the server sends no Content-Type header, or 'application/octet-stream' as the Content-Type.
Comment 1 Gustavo Noronha (kov) 2009-02-20 19:20:38 UTC
Created attachment 129174 [details] [review]
proposed implementation

Tested with WebKit/GTK+, and it works.
Comment 2 Dan Winship 2009-02-20 20:34:15 UTC
OK, style nitpicks:

use "char", not "gchar" for consistency. ("guchar", "gsize", etc, are fine, it's just "gchar", "gint", and "glong" that I hate.)

> +static
> +void sniff_content (SoupMessage *msg, const gchar* data, gsize length)

should be "static void\n", not "static\nvoid"


Actual code stuff:

> +	if (priv->msg_flags & SOUP_MESSAGE_SNIFF_CONTENT) {

Need to check !priv->server_side as well, because we only want to do content sniffing on the response bodies received by SoupSession, not on the request bodies received by SoupServer.

> +	if (mime_type)
> +		soup_message_headers_set_content_type (msg->response_headers, mime_type, NULL);
> +	else
> +		soup_message_headers_set_content_type (msg->response_headers, "application/octet-stream", NULL);

I think I would prefer to leave the headers untouched if glib can't figure out the MIME type... At any rate, it appears that g_content_type_guess() never returns NULL anyway, and annoyingly, it doesn't even set the "uncertain" flag in this case, which seems like a bug. So... for now I think I'd do:

    if (mime_type && strcmp (mime_type, "application/octet-stream") != 0)

> +	g_signal_emit (msg, signals[GOT_HEADERS], 0);
> +	priv->need_content_sniffing = FALSE;

change the flag before emitting the signal, in case the app requeues the message from the signal handler or something...

The check in got_body() is not really useful; the I/O code will always emit got_chunk at least once if any body data is read, so if we haven't done the sniffing by the time got_body() is invoked, then that means it's a 0-length body, so you're not going to manage to sniff anything anyway.

You should probably clear the priv->need_content_sniffing flag in soup_message_cleanup_response() just to be paranoid.

Finally, I want to take this opportunity to make the declaration of SoupMessageFlags a little less ugly. Right now it's:

  typedef enum {
  #ifndef LIBSOUP_DISABLE_DEPRECATED
	SOUP_MESSAGE_OVERWRITE_CHUNKS = (1 << 3),
  #endif
	SOUP_MESSAGE_NO_REDIRECT      = (1 << 1)
  } SoupMessageFlags;

where the two are out of order because gcc with certain warning flags will complain if there's a trailing ",", and making that work nicely with the ifdef is ugly. But we can fix it now to:

  typedef enum {
	SOUP_MESSAGE_NO_REDIRECT      = (1 << 1),
  #ifndef LIBSOUP_DISABLE_DEPRECATED
	SOUP_MESSAGE_OVERWRITE_CHUNKS = (1 << 3),
  #endif
	SOUP_MESSAGE_SNIFF_CONTENT    = (1 << 4)
  } SoupMessageFlags;

(I don't care that this leaves 1<<2 unused. I'll use it later. :)



All that said, I've realized there's actually another problem: when you call sniff_content() from soup_message_got_chunk(), it's possible that the got-headers signal emission will cause the message to be cancelled, paused, or requeued, in which case emitting got_chunk after calling it is wrong. (A concrete example of when this happens is if the app pauses a message after receiving the session "authenticate" signal, so that it can asynchronously pop up a password dialog before calling soup_auth_authenticate. Which I assume WebKit does, or at least, will after all the current patches land.)

So, sorry, maybe you were right the first time, and this should go in at the soup-message-io layer instead.

<kov> danw: so, to implement the sniffing, I guess we're not messing with the read io states
<kov> danw: but delaying the signal emission to STATE_BODY/STATE_CHUNKED, right?

right, just have a single flag in SoupMessageIOData for "waiting to emit got_headers", which would be set at the end of the io_read STATE_HEADERS case rather than calling soup_message_got_headers() if the sniffing flag was set. Then right before the soup_message_got_chunk() call in read_body_chunk(), you'd call a function which would do the sniffing and call got_headers. Except you again run into problems here, because if the message is paused after got_headers, we need to save the just-read chunk so we have it to emit later... (OTOH, that might also make it easier to handle the case where the first chunk is very small, and so we want to wait until we've read a little further before sniffing.)

What do you think? Would it be easier to just do this in WebKit now? (Or maybe have the delay-calling-didReceiveResponse() bits in WebKit, but still have soup_message_guess_content_type() or something in libsoup?)
Comment 3 Gustavo Noronha (kov) 2009-02-20 22:10:18 UTC
(In reply to comment #2)
> OK, style nitpicks:

ACK =)

> The check in got_body() is not really useful; the I/O code will always emit
> got_chunk at least once if any body data is read, so if we haven't done the
> sniffing by the time got_body() is invoked, then that means it's a 0-length
> body, so you're not going to manage to sniff anything anyway.

Would be useful to add application/octet-stream, though.

> You should probably clear the priv->need_content_sniffing flag in
> soup_message_cleanup_response() just to be paranoid.

OK
 
> <kov> danw: so, to implement the sniffing, I guess we're not messing with the
> read io states
> <kov> danw: but delaying the signal emission to STATE_BODY/STATE_CHUNKED,
> right?
> 
> right, just have a single flag in SoupMessageIOData for "waiting to emit
> got_headers", which would be set at the end of the io_read STATE_HEADERS case
> rather than calling soup_message_got_headers() if the sniffing flag was set.
> Then right before the soup_message_got_chunk() call in read_body_chunk(), you'd
> call a function which would do the sniffing and call got_headers. Except you
> again run into problems here, because if the message is paused after
> got_headers, we need to save the just-read chunk so we have it to emit later...
> (OTOH, that might also make it easier to handle the case where the first chunk
> is very small, and so we want to wait until we've read a little further before
> sniffing.)

I may be able to do that. So, would you suggest adding a SoupBuffer to the iodata structure, and using it to store that data? We will need to add checks to see if the buffer is not empty in STATE_CHUNKED and TRAILERS(, and FINISHING?), correct?
 
> What do you think? Would it be easier to just do this in WebKit now? (Or maybe
> have the delay-calling-didReceiveResponse() bits in WebKit, but still have
> soup_message_guess_content_type() or something in libsoup?)

It might be easier, but I would prefer to have this in libsoup right now if I can get to implement it right. If the download patch goes in I will be in a hurry, though =)

Comment 4 Gustavo Noronha (kov) 2009-02-21 02:37:45 UTC
Created attachment 129197 [details] [review]
second try at implementing content sniffing

ok, this is a second try; I believe I addressed most of the concerns, though I may have added more nit-pickables than I removed =)
Comment 5 Gustavo Noronha (kov) 2009-02-21 02:42:11 UTC
Created attachment 129198 [details]
small test case

The tests tells soup to sniff content, fetches an URI that sends no Content-Type header, pauses the message while handling got_headers, and unpauses it 2 seconds later through a timeout. Not really intended for the test suite, because it depends on a real URL.

kov@abacate ~/s/l/tests> jhbuild run ./st

** (process:25489): WARNING **: headers [text/html]

** (process:25489): WARNING **: pausing msg

** (process:25489): WARNING **: unpausing msg

** (process:25489): WARNING **: got-chunk

** (process:25489): WARNING **: got-chunk
Comment 6 Dan Winship 2009-02-21 16:36:30 UTC
I said:
> At any rate, it appears that g_content_type_guess() never
> returns NULL anyway, and annoyingly, it doesn't even set the "uncertain" flag
> in this case, which seems like a bug.

But I was looking at the win32 version, which is apparently just lame. The UNIX version also never returns NULL (so I guess that's part of the API contract), but it does set "uncertain" correctly. So this:

> +	if (content_type && strcmp (content_type, "application/octet-stream") != 0)

should be just "if (!uncertain)". Oh, and also, you need to keep the "g_content_type_get_mime_type()" part, so that it will still work (at least a little bit) on Windows (where the "content_type" values are file extensions, not MIME types). Good to see that the code is using uri_path now too.

soup-message-io.c: In function ‘sniff_content’:
soup-message-io.c:269: warning: unused variable ‘priv’

The read_body_chunk changes make me a little nervous. I *think* it's safe, but I'd like it better if it just used the same SOUP_MESSAGE_IO_PREPARE_FOR_CALLBACK / SOUP_MESSAGE_IO_RETURN_VAL_IF_CANCELLED_OR_PAUSED macros as everything else does. You ought to be able to do that by just setting up io->delayed_chunk_buffer *before* emitting got_headers rather than after (and then clearing it again if you're still around after the RETURN macro).

> +			if (!priv->server_side && priv->msg_flags & SOUP_MESSAGE_SNIFF_CONTENT &&

please put ()s around the "priv->msg_flags & SOUP_MESSAGE_SNIFF_CONTENT" .

oh, and this is sort of random, but for consistency with the rest of soup-message-io, you should say "if (io->mode == SOUP_MESSAGE_IO_CLIENT", not "if (!priv->server_side"... I should get rid of that redundancy/inconsistency at some point. :-} (The priv->server_side flag used to not exist, so soup-message-io had its own flag for keeping track of it...)

> We will need to add checks
> to see if the buffer is not empty in STATE_CHUNKED and TRAILERS(, and
> FINISHING?), correct?

Yeah, I believe the current patch will get messed up if either there's a 0-length body (in which case read_body_chunk() will never get called, and so it will never emit got_headers), or if the entire body is read in a single chunk and the app pauses the message after the delayed got_headers signal (in which case it won't call read_body_chunk() a second time, and so the delayed chunk will never get emitted). If you add

	priv->msg_flags = SOUP_MESSAGE_SNIFF_CONTENT;

to soup_message_init(), then run "make check", 3 of the 19 tests fail.

Both of those cases would be fixed by some extra checks in FINISHING. (Don't worry about TRAILERS since we already don't handle that correctly, and no one uses it anyway.) Or maybe you could make it work with the checks being just at the top of io_read()?

> Not really intended for the test suite, because it depends on a real URL.

Yeah, but we definitely want something in the real test suite. I can work on that.


Other improvements to make in the future (mostly as notes to myself, these can all happen after the initial commit):

  * If the initial chunk is small, we should read more chunks and concatenate
    them together to get a large enough buffer to sniff effectively. This
    might involve using SoupMessageBody to accumulate the chunks.

  * Rather than just using the URI path as the "filename" arg to
    g_content_type_guess(), look at Content-Disposition too (eg, for when
    the URI path is /show_attachment.cgi, but the Content-Disposition includes
    filename="0001-Implement-content-sniffing.patch"). I want to abstract
    that into a "soup_message_get_filename()" method that uses
    Content-Disposition if available, and the URI if not. (And maybe
    Content-Location?)

  * Figure out if there are web-specific sniffing rules we should be using
    (eg, HTML5's rules) in addition to g_content_type_guess().
Comment 7 Gustavo Noronha (kov) 2009-02-23 04:36:30 UTC
Created attachment 129296 [details] [review]
third try

Addressed your comments, and this version passes all tests (I confess I didn't try the ones that use apache, though =)).
Comment 8 Gustavo Noronha (kov) 2009-02-23 04:44:13 UTC
(In reply to comment #6)
> Both of those cases would be fixed by some extra checks in FINISHING. (Don't
> worry about TRAILERS since we already don't handle that correctly, and no one
> uses it anyway.) Or maybe you could make it work with the checks being just at
> the top of io_read()?

Not enough, it seems.  I actually had to add checks for delayed headers and buffer to the got_body label, but this seems to do it all =).

> Yeah, but we definitely want something in the real test suite. I can work on
> that.

I would like to work on that, actually. Any pointers on how/what you would like to test this?

> Other improvements to make in the future (mostly as notes to myself, these can
> all happen after the initial commit):

Very good TODO list. I will be glad to help on those, too. Specially on the second point =).
Comment 9 Gustavo Noronha (kov) 2009-02-23 04:44:43 UTC
Created attachment 129297 [details] [review]
real third try

Sorry, messed up the last patch.
Comment 10 Dan Winship 2009-02-28 19:08:25 UTC
OK, so if we want this in for 2.26 it really has to happen this weekend.

Here's the thing: I am not convinced that this API is right for all apps that would want to do Content-Type sniffing. Even worse, reading some stuff on the web about web browser content sniffing security issues (eg http://webblaze.cs.berkeley.edu/2009/content-sniffing/mimesniff.pdf), I'm not even sure it's right for WebKit/epiphany.


So here's my suggestion: rather than exposing this as a single SoupMessageFlags value with a fixed behavior, we instead create a SoupSessionFeature called SoupContentTypeSniffer. This would have a constructor that takes an enum value that tells what kind of content-type sniffing you want to do. For 2.26 we'd only define one value, SOUP_CONTENT_TYPE_SNIFF_AUTOMATIC or something, which would be exactly the code you've already written; it would delay emission of got-headers, and replace the existing Content-Type header with a new one. Later on, after we've had more time to look at use cases, we can add more enum/flag values (and/or additional parameters/methods) to tweak the behavior: SOUP_CONTENT_TYPE_SNIFF_BE_PARANOID, SOUP_CONTENT_TYPE_SNIFF_USE_HTML5_RULES, SOUP_CONTENT_TYPE_SNIFF_DONT_OVERWRITE_ORIGINAL_HEADER, etc.

In terms of implementation, for now, this would basically just mean removing the SOUP_MESSAGE_SNIFF_CONTENT flag, and replacing it with a gboolean in SoupMessagePrivate, and then adding SoupContentTypeSniffer and having it set that flag on each message that passes through; the code in soup-message-io would all stay the same (other than checking a different flag). (Later on we might move the flag back to being a public SoupMessageFlags value, or maybe not.) And then the public API for requesting content sniffing would be to just add a SoupContentTypeSniffer feature to your session. (Does that work? Do you want to sniff *all* requests? Or was your code only going to be sniffing some requests?)

If you have time to do this, that would be great. Otherwise I will try to do it tomorrow.
Comment 11 Gustavo Noronha (kov) 2009-03-01 13:16:45 UTC
(In reply to comment #10)
> OK, so if we want this in for 2.26 it really has to happen this weekend.

Woohoo =)
 
> Here's the thing: I am not convinced that this API is right for all apps that
> would want to do Content-Type sniffing. Even worse, reading some stuff on the
> web about web browser content sniffing security issues (eg
> http://webblaze.cs.berkeley.edu/2009/content-sniffing/mimesniff.pdf), I'm not
> even sure it's right for WebKit/epiphany.

I see. I will give more attention to that article this week.

> not.) And then the public API for requesting content sniffing would be to just
> add a SoupContentTypeSniffer feature to your session. (Does that work? Do you
> want to sniff *all* requests? Or was your code only going to be sniffing some
> requests?)

I was turning the flag on to all requests, because soup needs to know if content sniffing is wanted before emitting got-headers, so there is no way webkit would be able to decide which request should be sniffed.

> If you have time to do this, that would be great. Otherwise I will try to do it
> tomorrow.

I didn't see the reply yesterday, but I'm on it right now.
Comment 12 Gustavo Noronha (kov) 2009-03-01 15:37:40 UTC
Created attachment 129784 [details] [review]
fourth try =)

Adds the feature, as suggested by Dan.
Comment 13 Gustavo Noronha (kov) 2009-03-05 11:32:14 UTC
Created attachment 130114 [details] [review]
fourth try for real

cosimoc noticed I forgot to add the feature-related files in my last patch; here they are
Comment 14 Christian Dywan 2009-03-13 09:32:19 UTC
A few comments:

- There are Since tags missing
- soup_content_sniffer_new mentions cookie storage which probably was copied from SoupCookieJar and should be removed
- SOUP_MESSAGE_SNIFF_CONTENT should be added after SOUP_MESSAGE_OVERWRITE_CHUNKS
- The "sniff-type" property has no documentation
- You may want to reconsider the naming of "SOUP_IS_MESSAGE_IO_CANCELLED", this reads like a typical GObject macro, which it is not. It's not a public symbol, but it might still be good to avoid such confusion.
Comment 15 Gustavo Noronha (kov) 2009-03-23 14:30:10 UTC
(In reply to comment #14)
> A few comments:

Thanks =)


> - There are Since tags missing

I'm adding 2.26.1, since I am not sure what to add right now, but I will fix this when landing.

> - soup_content_sniffer_new mentions cookie storage which probably was copied
> from SoupCookieJar and should be removed

bad copy/paste =)

> - SOUP_MESSAGE_SNIFF_CONTENT should be added after
> SOUP_MESSAGE_OVERWRITE_CHUNKS

It should be removed, actually. The flag doesn't exist anymore, since it has been replaced by the feature.

> - The "sniff-type" property has no documentation
> - You may want to reconsider the naming of "SOUP_IS_MESSAGE_IO_CANCELLED", this
> reads like a typical GObject macro, which it is not. It's not a public symbol,
> but it might still be good to avoid such confusion.

Fixed both. Just finishing building and I'll upload the updated patch.

Comment 16 Gustavo Noronha (kov) 2009-03-23 14:44:24 UTC
Created attachment 131188 [details] [review]
forth try with Christian's comments addressed
Comment 17 Gustavo Noronha (kov) 2009-06-05 18:02:10 UTC
So, danw and I have been discussing how this should be implemented. I have some ideas, and thought I'd add them here, so that we can figure out a good API.

A possible implementation is this one:

General Idea

Content Sniffing is a feature, that connects to the got-header and got-chunk signals of messages. When it detects that a message needs sniffing, it will first accumulate the amount of chunks necessary to do the sniffing, sniff, and then emit a 'content-sniffed' signal, with the modified message as parameter. If no sniffing will be done, the 'content-sniffed' signal is emitted just after 'got-headers', with no changes to the message.

This allows for something which is important for users of libsoup: after adding the feature, they may wait for the content-sniffed signal to be emitted, where they would wait for got-headers.

-------
How does the feature decide which messages are going to be sniffed, and whether to use the sniffed type instead of the original one? This depends on its 'operation mode', a construction property of the feature. Initially we'll have 3 operation modes:

GIO, HTML5, and CUSTOM

When using GIO, the sniffer will use whatever g_content_type_guess () guesses.

When using HTML5, the content sniffer will use the algorithm described in the draft HTML5 spec: http://dev.w3.org/html5/spec/Overview.html?#content-type-sniffing, including deciding automatically that some types should not become other types.

When using CUSTOM, two additional signals get emitted for a message: after got-headers, 'should-sniff-content' is emitted, and the user can decide whether sniffing should be done by looking at the message, and returning its decision (TRUE/FALSE). After the sniffing has been done, 'should-replace-content-type' is emitted with the message as-is, and the sniffed content type. The user can decide if the message should be modified by returning TRUE/FALSE. Then, the normal 'content-sniffed' is emitted.

We could provide soup_content_sniffer_sniff(Buffer, OperationMode) API calls for the CUSTOM mode to use, too.

-------

Comments?
Comment 18 Dan Winship 2009-06-05 21:41:30 UTC
(In reply to comment #17)
> Content Sniffing is a feature, that connects to the got-header and got-chunk
> signals of messages. When it detects that a message needs sniffing, it will
> first accumulate the amount of chunks necessary to do the sniffing, sniff, and
> then emit a 'content-sniffed' signal

I'd been thinking that 'content-sniffed' would be a SoupMessage signal, it's just that it would be triggered by the actions of the sniffing feature rather than by soup-message-io like the other signals. Maybe it does make more sense to have it be on the feature though?

Problem: what happens to the chunks received before emitting 'content-sniffed'? WebKit can't process them yet, but since WebKit sets soup_message_body_set_accumulate(FALSE), libsoup is going to throw them away after emitting got-chunk. So you'd have to do something like have the sniffer gather them and then emit them as part of the content-sniffed signal? Bleah. It would probably be better to just have some flag the sniffer can set that tells soup-message-io that it's not allowed to emit got-chunk yet, and so it has to keep concatenating chunks together until it gets enough data, and then when it does, the sniffing feature can look at it, emit content-sniffed, and then clear the flag, and then soup-message-io will emit got-chunk on the whole buffer, and then discard it and continue normally with the rest of the body. (So actually, this is a little like the code you wrote before, it's just that we'd be delaying got-chunk, not got-headers.)

> with the modified message as parameter.

I don't think you should modify the message headers. Just leave them as-is, but have the sniffed content type as another signal param.

> When using GIO, the sniffer will use whatever g_content_type_guess () guesses.

The patch you'd submitted before passed the URI path to g_content_type_guess() as the filename, which means that if there's a filename extension, it generally won't sniff at all. So maybe there needs to be more configurability there? (only use filename, prefer filename, prefer sniffing, only use sniffing?)

(Also, looks like if the filename passed to g_content_type_guess() ends in "/" then it will assume it's a directory, which isn't true on the web. Should probably g_basename() it.)

> When using CUSTOM, two additional signals get emitted for a message

I'm generally more in favor of having the feature type be subclassable, and if you don't like the default behavior, you can subclass and implement the virtual methods differently.

What option would WebKit be using here? If it's going to use HTML5, I don't think there's any reason to implement the GIO version.
Comment 19 Gustavo Noronha (kov) 2009-06-07 01:32:18 UTC
(In reply to comment #18)
> I'd been thinking that 'content-sniffed' would be a SoupMessage signal, it's
> just that it would be triggered by the actions of the sniffing feature rather
> than by soup-message-io like the other signals. Maybe it does make more sense
> to have it be on the feature though?

I think so.

> Problem: what happens to the chunks received before emitting 'content-sniffed'?
> WebKit can't process them yet, but since WebKit sets
> soup_message_body_set_accumulate(FALSE), libsoup is going to throw them away
> after emitting got-chunk. So you'd have to do something like have the sniffer
> gather them and then emit them as part of the content-sniffed signal? Bleah. It
> would probably be better to just have some flag the sniffer can set that tells
> soup-message-io that it's not allowed to emit got-chunk yet, and so it has to
> keep concatenating chunks together until it gets enough data, and then when it
> does, the sniffing feature can look at it, emit content-sniffed, and then clear
> the flag, and then soup-message-io will emit got-chunk on the whole buffer, and
> then discard it and continue normally with the rest of the body. (So actually,
> this is a little like the code you wrote before, it's just that we'd be
> delaying got-chunk, not got-headers.)

Yeah, that's been in my mind ever since I wrote this comment. I believe delaying got-chunk is the best option we have, indeed. I was trying to find a way to not delay emission of signals, but I think delaying got-chunk is less problematic.

> > with the modified message as parameter.
> 
> I don't think you should modify the message headers. Just leave them as-is, but
> have the sniffed content type as another signal param.

Right. That was what my initial version said, before I edited it heh. I thought of doing it this way so that if you use, say, HTML5 sniffing, you could not connect to content-type-sniffed, if you don't need it, and have the default sniffing policy be applied, but I think we could get this using a default handler, instead?

> > When using CUSTOM, two additional signals get emitted for a message
> 
> I'm generally more in favor of having the feature type be subclassable, and if
> you don't like the default behavior, you can subclass and implement the virtual
> methods differently.

OK, sounds good.

> What option would WebKit be using here? If it's going to use HTML5, I don't
> think there's any reason to implement the GIO version.

HTML5. GIO would be for applications wanting to provide some kind of compatibility with nautilus or whatever. I don't really know if there's a use case for this, so I won't defend its implementation for now. Thanks for the comments, I'll follow up with a patch, now =).

> 

Comment 20 Gustavo Noronha (kov) 2009-06-08 11:54:48 UTC
Created attachment 136140 [details]
small test case

This is not a proper test. I plan to write one when we're decided on the API, and perhaps push this work to a branch.
Comment 21 Gustavo Noronha (kov) 2009-06-08 12:03:36 UTC
Created attachment 136141 [details] [review]
initial implementation of the proposed scheme

This is a first iteration. Since I was about to switch from weekend hacking to day job hacking, I thought I'd upload what I already have, so that you can take a look and give some early comments. Please be picky with the style, too =).

So, what I found out during implementation: message-io has no sane way that I could find of accessing a session feature, so I decided to make the sniffer a member of SoupMessage's private structure. I'm not 100% happy with this, even though I think it's not that of a hack, given that we are already using that structure to communicate policy regarding sniffing.

Also, I think emitting the message on the feature makes things less practical, so I decided to go with adding the signal to the message. The signal is not fully implemented (I will add a boolean parameter to let the user know if the sniffing algorythm considers changing the content type from what came on the header to what was sniffed safe (avoiding priviledge escalation image -> script, for instance). I also want to add a message flag to say 'for this message, do not perform sniffing'. In webkit we have a flag telling us whether sniffing should be done for a given request.

The 'html5_sniff' is just our old sniffing code, as a place holder. After rereading our last exchange, I am starting to consider if we really want to have operation mode at all. Perhaps we should implement the HTML5 algorythm, and people who want to override decisions other than if the sniffed type should replace the received one can subclass the feature, and implement the appropriate method?
Comment 22 Dan Winship 2009-06-08 14:27:28 UTC
(In reply to comment #21)
> Since I was about to switch from weekend hacking to
> day job hacking

Yeah, me too, so this is just based on a quick look. I'll look more later.

> Please be picky with the style, too =).

I'd started writing a HACKING file at one point, which I've now committed

> So, what I found out during implementation: message-io has no sane way that I
> could find of accessing a session feature, so I decided to make the sniffer a
> member of SoupMessage's private structure. I'm not 100% happy with this, even
> though I think it's not that of a hack, given that we are already using that
> structure to communicate policy regarding sniffing.

As long as you aren't changing SoupMessage's public API (other than adding the content-sniffed signal), we can move it around later if we find some better way to do it.

> Also, I think emitting the message on the feature makes things less practical,
> so I decided to go with adding the signal to the message.

Sure, that's fine; sniffing changes the behavior of SoupMessage anyway (delaying got-chunk), so it does make sense to have this signal there too.

I am not sure I like the idea that it is emitted even if there is no sniffer...

> I also want to add a message flag to say 'for this
> message, do not perform sniffing'.

Bug 574773 discusses being able to disable individual SoupSessionFeatures on a per-message basis, which would let you do this.

> Perhaps we should implement the HTML5 algorythm,
> and people who want to override decisions other than if the sniffed type should
> replace the received one can subclass the feature, and implement the
> appropriate method?

Yes, that sounds right.


On the code...

I think soup_content_sniffer_sniff should take a SoupBuffer, not a char*+length. (You'll need to have created the SoupBuffer for the got-chunk signal anyway, so you might as well pass it to the sniffer too.)

Likewise, you shouldn't use a GString for delayed_chunk_data, since the data may have 0x00 bytes in it.

+	/* FIXME: we probably want the feature to define this number; the HTML5
+	 * spec says the client may wait for 512 or more bytes.
+	 */
+	if (io->delayed_chunk_data->len > 512) {

Yes, having the sniffer decide sounds right. You may want to have it be "soup_content_sniffer_get_buffer_size (sniffer, msg)" so that it can choose a different buffer size based on the declared content-type or anything else in the headers? (Or it can return 0 right then to say that it's not going to want to sniff that message.)

Oh, and the I/O code needs to handle the case where the total response length turns out to be less than 512.

And actually, these changes tie into another thought I'd had at one point about inefficient use of malloc in the I/O code; in the case where the message has a Content-Length header, and it's something "small", it would be more efficient for us to just malloc a single SoupBuffer large enough for the whole body right at the beginning, store that in the SoupMessageIOData, and then read data into it directly (using soup_buffer_new_subbuffer() as needed to create subbuffers to emit got-chunk with). That would to save us a bunch of malloc'ing and realloc'ing and stuff.

So, keeping that in mind as a future use case, one way to do the soup-message-io sniffing part would be: after emitting got-headers, call the sniffer_get_buffer_size() method, and allocate a SoupBuffer of the size that it wants (stored in the SoupMessageIOData). Then, read directly into that buffer (keeping track of the current offset) until you fill it up, and then do the sniffing and got-chunk.

Alternatively, an easier way to implement this (that doesn't immediately set us up for the memory optimizations, but is less work now) would be to use a SoupMessageBody for delayed_chunk_data, and just append each chunk to it in turn, and then use soup_message_body_flatten at the end to squish it into a single chunk to pass to the sniffer and the got-chunk signal.
Comment 23 Gustavo Noronha (kov) 2009-06-08 15:25:05 UTC
(In reply to comment #22)
> (In reply to comment #21)
> Yeah, me too, so this is just based on a quick look. I'll look more later.

Coolie!
 
> > Please be picky with the style, too =).
> 
> I'd started writing a HACKING file at one point, which I've now committed

Nice, thanks!
 
> I am not sure I like the idea that it is emitted even if there is no sniffer...

Yeah, I'll clean that up.

> Bug 574773 discusses being able to disable individual SoupSessionFeatures on a
> per-message basis, which would let you do this.

Awesome. I won't bother with this in here, for now, then.

> I think soup_content_sniffer_sniff should take a SoupBuffer, not a
> char*+length. (You'll need to have created the SoupBuffer for the got-chunk
> signal anyway, so you might as well pass it to the sniffer too.)
> 
> Likewise, you shouldn't use a GString for delayed_chunk_data, since the data
> may have 0x00 bytes in it.

GString is able to handle that, since it has length as a member, which is why I used it. I like the idea of using SoupMessageBody. I think using a buffer of exactly the size the sniffer wants may complicate matters a bit, since we may end up getting more than that in a read. I may be wrong, but I'll go for the simpler route for now if you're OK with that.

Using SoupMessageBody didn't cross my mind while writting, but sounds like a very good alternative to GString.

> Yes, having the sniffer decide sounds right. You may want to have it be
> "soup_content_sniffer_get_buffer_size (sniffer, msg)" so that it can choose a
> different buffer size based on the declared content-type or anything else in
> the headers? (Or it can return 0 right then to say that it's not going to want
> to sniff that message.)

That's a good idea!

> Oh, and the I/O code needs to handle the case where the total response length
> turns out to be less than 512.

That's handled already, thanks to this also being a problem in my previous patch, I just adapted the code that already existed =).
Comment 24 Gustavo Noronha (kov) 2009-06-09 02:58:48 UTC
Created attachment 136183 [details] [review]
Second implementation of proposed scheme

I hopefully address all the issues. For the number of bytes that we want to wait, I decided to use another member in SoupMessagePrivate instead of using a function call. We can do all that is needed to decide whether, and how a message will be sniffed while handling got-headers, this way.

The signal only gets emitted when the feature is enabled, now, and I replaced GString with SoupMessageBody. I removed the operation mode, but I decided to keep a placeholder private structure for the feature so that we can use it in the future, if need be. Let me know if you want that removed.

If you are OK with the basic approach, I will start implementing the actual HTML5 sniffing algorithm in soup_content_sniffer_got_headers_cb, and soup_content_sniffer_sniff.
Comment 25 Dan Winship 2009-06-09 14:51:59 UTC
(In reply to comment #24)
> For the number of bytes that we want to
> wait, I decided to use another member in SoupMessagePrivate instead of using a
> function call. We can do all that is needed to decide whether, and how a
> message will be sniffed while handling got-headers, this way.

Hm... although a subclass outside the libsoup tree wouldn't be able to override that, since it wouldn't have access to SOUP_MESSAGE_GET_PRIVATE...

Maybe have the virtual method, but just have it be called by soup_content_sniffer_got_headers_cb(), which could then set the SoupMessagePrivate field? (Likewise, please make soup_content_sniffer_sniff() virtual.)

> I removed the operation mode, but I decided to
> keep a placeholder private structure for the feature so that we can use it in
> the future, if need be. Let me know if you want that removed.

None of the other libsoup types assign their private struct to ->priv anyway (they all call SOUP_WHATEVER_GET_PRIVATE in every method where they need it), so for consistency, you should remove the priv field from SoupContentSniffer too, and then there's no reason at all to have it in the .c file if you're not using it.

> + * Since: 2.27.x

should be "Since: 2.27.2"

> +soup_content_sniffer_got_headers_cb(SoupMessage *msg, SoupContentSniffer *sniffer)

missing space before "("

> +	SoupMessagePrivate *priv = SOUP_MESSAGE_GET_PRIVATE (msg);
> +	priv->sniffer = SOUP_CONTENT_SNIFFER(g_object_ref(G_OBJECT(feature)));

blank line between variable declarations and function body. (likewise in request_unqueued). Also, g_object_ref takes and returns gpointer, so no need for the casts. And add a space between "g_object_ref" and "(".

> +			  G_CALLBACK(soup_content_sniffer_got_headers_cb),

missing space before "("

> +void                soup_content_sniffer_sniff    (SoupContentSniffer *sniffer,
> +						   SoupMessage *msg,
> +						   SoupBuffer *buffer);

it seems a bit odd that this operates by side-effect... maybe it should return the content type and soup-message-io should call soup_message_content_sniffed() ?

> +static void io_content_sniff(SoupMessage *msg);

space

> +	soup_content_sniffer_sniff (priv->sniffer, msg, sniffed_buffer);
> +	io->delay_got_chunks = FALSE;

regardless of whether the signal ends up being emitted directly in soup-message-io or indirectly via soup_content_sniffer_sniff(), you need to have soup-message-io do the PREPARE_FOR_CALLBACK dance here, since a signal handler is going to get run, and it might cancel or pause the message.

Making things trickier, this means you may have to return after emitting content-sniffed, and then the next time io_read is called, you have to emit the delayed got-chunk.

Also, if delayed_chunk_data is still set in soup_message_io_cleanup(), you need to free it. (ie, the message got cancelled before the delayed chunk was emitted)

> +				/* We already have enough data to perform sniffing, so do it */
> +				if (io->delayed_chunk_length > priv->bytes_for_sniffing) {

The comment is written as though you already know the conditional is going to be true... either put the comment inside the if statement, or else rewrite the comment to be condition ("if we already ... then do it")

> +			else if (priv->sniffer)
> +				soup_message_content_sniffed(msg, NULL);

space before "(". It's definitely weird that soup-message-io causes the signal to be emitted directly in this case, but does it indirectly through soup-content-sniffer in the normal case, so this is another argument for always having soup-message-io do the signal emission.

Also, the same PREPARE_FOR_CALLBACK issues apply here; you can't emit content-sniffed here because the application may have paused from the got-headers handler. So you need to wait until you reach SOUP_MESSAGE_IO_STATE_BODY / read_body_chunk().

>  	got_body:
> +		if (io->delayed_chunk_data)
> +			io_sniff_content (msg);

and again here, you need to possibly return after io_sniff_content, and then possibly come back later to emit got-body.

> +	if (priv->should_sniff_content)
> +		io->delay_got_chunks = TRUE;

this is no longer needed in new_iostate() because you set it after got-headers, right?


OK, so summing up how I think soup-message-io needs to work:

At the top of read_body_chunk(), before the loop, check if this is the very first read (you'll need to add a new flag for this; i don't think there's currently any way to figure that out reliably), and if so, and priv->sniffer is set but priv->should_sniff_content is FALSE, then call soup_message_sniffed_content(msg, NULL) (and toggle whatever flags we need to toggle to make sure we don't do this again next time). (And then RETURN_IF_PAUSED_OR_CANCELLED.) (This replaces the call you currently have to soup_message_content_sniffed() after the got-headers signal.)

After that (still before the loop), if delayed_chunk_data has already been filled to the right size, then it must be that we previously sniffed it and emitted content-sniffed, but then returned without emitting got-chunk. So emit got-chunk on it now and then free it. (And then RETURN_IF_PAUSED_OR_CANCELLED.)

Then go into the read_body_chunk loop, as in your current patch, except with the possibility of returning between content-sniffed and got-chunk.

In SOUP_MESSAGE_IO_STATE_BODY, after the "got_body" label, check if delayed_chunk_data is still set, and if so, call io_sniff_content() and possibly return. Oh, except before returning, make sure io->read_state is STATE_BODY and io->read_length is 0, so that if you do return, then when io_read() gets called again, it will come back to the same point again, and read_body_chunk() will do the right thing (emitting got-chunk if needed, but not doing any more reading). (You need to set read_state because it might have been STATE_TRAILERS previously if we were doing a chunked read, and you need to set read_length because it might have been non-0 previously if we were doing a read-to-EOF.)
Comment 26 Gustavo Noronha (kov) 2009-06-12 21:19:47 UTC
(In reply to comment #25)
> OK, so summing up how I think soup-message-io needs to work:

The funny thing is I started working on the issues, one by one, as I read your comment, so I had done most of the changes before getting to the summing up. My changes were all very similar to what you described already, except for this:

> been STATE_TRAILERS previously if we were doing a chunked read, and you need to
> set read_length because it might have been non-0 previously if we were doing a
> read-to-EOF.)

OK. I'll post the current state of the patch, and of the small tool I'm using to test with currently. The next thing I'll do is write a proper test, so I wanted your help in double checking I have all the main issues covered for a first round of testing:
 
=== when using the sniffer

for, say, a mbox file that will be sent with text/plain by the server:

-> content-sniffed must be emitted after got-headers, and before the first got-chunk
-> the number of bytes read must match

-> if got-headers pause the message, the sniffing still needs to happen
-> if sniffing pauses the message, all chunks are notified by got-chunk
-> the number of bytes read must match

for an mbox file that is small enough to fit in one got-chunk (is there a way to guarantee this?)

-> if got-headers pause the message, the sniffing still needs to happen
-> if sniffing pauses the message, the only chunk is notified by got-chunk

those tests should be run in both normal and chunked transfer modes to make sure both work

=== when using the sniffer, the server sends a content type that is not sniffed

-> content_sniffed gets emitted with a NULL content type, right after got-headers, before any got-chunk

=== when no using the sniffer

-> content_sniffed does not get emitted, at all

What do you think?
Comment 27 Gustavo Noronha (kov) 2009-06-12 21:29:28 UTC
Created attachment 136476 [details] [review]
current state
Comment 28 Gustavo Noronha (kov) 2009-06-12 21:31:33 UTC
Created attachment 136477 [details]
my small test case
Comment 29 Dan Winship 2009-06-13 15:05:47 UTC
(In reply to comment #26)
> OK. I'll post the current state of the patch, and of the small tool I'm using
> to test with currently.

Oops. You actually posted two copies of the patch and none of the test.

> for an mbox file that is small enough to fit in one got-chunk (is there a way
> to guarantee this?)

make it smaller than 8k. (Actually, you probably have to make sure that the entire response, including headers, is < 8k.)

> those tests should be run in both normal and chunked transfer modes to make
> sure both work

And also, should be run both with set_accumulate(TRUE) and set_accumulate(FALSE).
Comment 30 Gustavo Noronha (kov) 2009-06-15 19:41:28 UTC
Created attachment 136658 [details]
really, my small test case

Now, for real.
Comment 31 Gustavo Noronha (kov) 2009-06-16 01:05:16 UTC
Created attachment 136683 [details] [review]
first cut of the actual test

Here's the first cut for my WIP, for real, test case. Comments? =)
Comment 32 Adam Barth 2009-06-16 19:09:53 UTC
Hi.  I'm new to libsoup, but I do know a thing or two about content sniffing.  It looks like the current patch has the hooks for the algorithm but you haven't started implementing the algorithm yet.  Please let me know how I can be of assistance.
Comment 33 Gustavo Noronha (kov) 2009-06-16 20:28:16 UTC
(In reply to comment #32)
> Hi.  I'm new to libsoup, but I do know a thing or two about content sniffing. 
> It looks like the current patch has the hooks for the algorithm but you haven't
> started implementing the algorithm yet.  Please let me know how I can be of
> assistance.

Hey! That's right. I've been working with Dan on figuring out how to hook sniffing into the code and API. Right now I'm writing regression tests to make sure I don't break stuff going forward. Thanks for showing up!
Comment 34 Dan Winship 2009-06-17 00:19:02 UTC
(In reply to comment #32)
> It looks like the current patch has the hooks for the algorithm but you haven't
> started implementing the algorithm yet.  Please let me know how I can be of
> assistance.

Well, in terms of making sure that we implement the spec correctly, a test suite would be great (eg, a tar or zip file full of sample content with some sort of index/manifest indicating what each file should be sniffed as)...

(In reply to comment #31)
> Created an attachment (id=136683) [edit]
> first cut of the actual test
> 
> Here's the first cut for my WIP, for real, test case. Comments? =)

It's missing resources/mbox. (Any reason you're using an mbox file as the test case btw?)

The use of object data for the booleans but a global for the chunk_data seems a little inconsistent, but I bet you just copied that from some other test case. :-} (Likewise, it's odd that the should_* booleans are passed by &'ing them but the got_* booleans are passed by G_INT_TO_POINTER. [I'd use G_INT_TO_POINTER for all of them.]

Use g_idle_add, not g_timeout_add for the pausing; we don't need to actually *pause*, we just need to test returning-to-the-main-loop, and there's no reason to slow down "make check".

Do a memcmp on the response body in addition to checking its length.

I assume we'll test that the sniffer is actually getting the right answer later on?

Feel free to push this to a branch on git.gnome.org if you want.
Comment 35 Gustavo Noronha (kov) 2009-06-18 22:19:33 UTC
(In reply to comment #32)
> Hi.  I'm new to libsoup, but I do know a thing or two about content sniffing. 
> It looks like the current patch has the hooks for the algorithm but you haven't
> started implementing the algorithm yet.  Please let me know how I can be of
> assistance.
 
Adam, can you point us to the best document to follow while implementing this? I was looking at the published HTML5 spec, but it seems like there is a newer version that was to be proposed as a standalone spec?
Comment 36 Adam Barth 2009-06-18 23:17:02 UTC
> Adam, can you point us to the best document to follow while implementing this?
> I was looking at the published HTML5 spec, but it seems like there is a newer
> version that was to be proposed as a standalone spec?

Yeah, for various political reasons, the algorithm will be published in a standalone spec.  You can see the latest version here:

http://tools.ietf.org/html/draft-abarth-mime-sniff

I've been thinking about the best way to write a test suite.  Maybe something based on netcat would be easiest?  You want really fine control over the Content-Type header (e.g., more than PHP gives you).
Comment 37 Dan Winship 2009-06-19 00:57:22 UTC
For our own tests I think it would end up being easier for us to just use the test cases from your test suite, but use SoupServer to serve the data. (That way I don't have to add yet another package to the list of packages you need to install to be able to run all of "make check".)
Comment 38 Gustavo Noronha (kov) 2009-06-19 19:17:41 UTC
Yeah, the tests I have currently use SoupServer. You can check the work in progress here, by the way:

http://git.gnome.org/cgit/libsoup/log/?h=content-sniffing

You'll notice that the implementation is very simplistic. I am trying to make it as simple and clear as possible, for now. If any of you have comments let me know!
Comment 39 Gustavo Noronha (kov) 2009-06-21 02:29:14 UTC
Created attachment 137097 [details] [review]
patch to make WebKitGTK+ use the new API

I am now using WebKitGTK+ with this patch applied in order to dog-foog soup's content sniffing as I develop it. I'm posting it here in case anyone else wants to use it.
Comment 40 Gustavo Noronha (kov) 2009-06-21 03:05:46 UTC
Dan, while using the WebKitGTK+ patch I already noticed something: the spec has sentences such as 'sniffed type is text/plain'. I followed that to the letter a bit too much - I was droping whatever parameters were given.

Do you think we should make the signal receive a hash table with the parameters, in order to be consistent, or leave it to the API user to fetch the parameters from the message, and use them when setting the new type? Right now I have modified the code to return the whole header, but I am not sure if that is the right approach.
Comment 41 Adam Barth 2009-06-21 05:52:32 UTC
(In reply to comment #40)
> Dan, while using the WebKitGTK+ patch I already noticed something: the spec has
> sentences such as 'sniffed type is text/plain'. I followed that to the letter a
> bit too much - I was droping whatever parameters were given.

Hum...  I can understand that confusion.  How can I make the spec clearer about that?
Comment 42 Gustavo Noronha (kov) 2009-06-21 22:55:53 UTC
(In reply to comment #41)
> (In reply to comment #40)
> > Dan, while using the WebKitGTK+ patch I already noticed something: the spec has
> > sentences such as 'sniffed type is text/plain'. I followed that to the letter a
> > bit too much - I was droping whatever parameters were given.
> 
> Hum...  I can understand that confusion.  How can I make the spec clearer about
> that? 

I must confess I am still unsure about what is expected (I haven't read Chromium's code to see if it makes it clearer). I am under the impression that for text or binary, whenever you say 'sniffed type is text/plain', you want the original parameters to keep being part of the type, and that the same goes for whenever you say 'sniffed type is official type'. Or should the parameters be kept for whatever type?

If the first understanding is correct, a small paragraph at the beginning such as the following would have been enough for me:

"When the /sniffed type/ is text/plain, or /official type/, the original parameters sent by the server must be kept."
Comment 43 Adam Barth 2009-06-22 01:20:11 UTC
The intent is that the algorithm isn't telling you how to figure out the charset, etc.  It's not just text/plain.  For example, you should keep the parameters when you get text/html also.  I'll add something early on about this issue.  Thanks.
Comment 44 Dan Winship 2009-06-22 13:49:40 UTC
(In reply to comment #40)
> Dan, while using the WebKitGTK+ patch I already noticed something: the spec has
> sentences such as 'sniffed type is text/plain'. I followed that to the letter a
> bit too much - I was droping whatever parameters were given.
> 
> Do you think we should make the signal receive a hash table with the
> parameters, in order to be consistent, or leave it to the API user to fetch the
> parameters from the message, and use them when setting the new type? Right now
> I have modified the code to return the whole header, but I am not sure if that
> is the right approach.

Hm... there's no public method to parse the full Content-Type value into MIME type and params... so either you should add the params hash to the signal so people don't have to parse it themselves, or else we should fix up soup-message-headers.c:do_content_foo() a bit, rename it, and move it to soup-headers.h. I vote for adding the params to the signal.

Oh, this indirectly reminds me of another test case: the spec says that if there are multiple Content-Type headers, you only look at the last one. We should check that.
Comment 45 Gustavo Noronha (kov) 2009-06-22 17:05:50 UTC
(In reply to comment #44)
> Hm... there's no public method to parse the full Content-Type value into MIME
> type and params... so either you should add the params hash to the signal so
> people don't have to parse it themselves, or else we should fix up
> soup-message-headers.c:do_content_foo() a bit, rename it, and move it to
> soup-headers.h. I vote for adding the params to the signal.

Me too.

> Oh, this indirectly reminds me of another test case: the spec says that if
> there are multiple Content-Type headers, you only look at the last one. We
> should check that.

Last? I believe it's the first, though for some reason I failed to find the part of the spec regarding this quickly now =(.

Comment 46 Dan Winship 2009-06-22 17:21:44 UTC
> > Oh, this indirectly reminds me of another test case: the spec says that if
> > there are multiple Content-Type headers, you only look at the last one. We
> > should check that.
> 
> Last? I believe it's the first, though for some reason I failed to find the
> part of the spec regarding this quickly now =(.

draft-abarth-mime-sniff-01, section 2:

   For HTTP resources, only the last Content-Type HTTP header, if any,
   contributes any type information; the official type of the resource
   is then the value of that header, interpreted as described by the
   HTTP specifications.  If the Content-Type HTTP header is present but
   the value of the last such header cannot be interpreted as described
   by the HTTP specifications (e.g. because its value doesn't contain a
   U+002F SOLIDUS ('/') character), then the resource has no type
   information (even if there are multiple Content-Type HTTP headers and
   one of the other ones is syntactically correct).
Comment 47 Gustavo Noronha (kov) 2009-06-22 17:26:49 UTC
(In reply to comment #46)
>    For HTTP resources, only the last Content-Type HTTP header, if any,

Nice, I'll add that test! Thanks, Dan!
Comment 48 Adam Barth 2009-06-22 17:40:22 UTC
(In reply to comment #45)
> Last? I believe it's the first, though for some reason I failed to find the
> part of the spec regarding this quickly now =(.

The spec used to say first, but that turned out not to be as compatible as last.
Comment 49 Gustavo Noronha (kov) 2009-06-24 03:00:49 UTC
(In reply to comment #48)
> (In reply to comment #45)
> > Last? I believe it's the first, though for some reason I failed to find the
> > part of the spec regarding this quickly now =(.
> 
> The spec used to say first, but that turned out not to be as compatible as
> last.
> 

Aha =)

One more thing. In the feed or HTML spec, there's this substep:

      2.  If s[pos] and s[pos+1] equal 0x3F and 0x3E respectively,
            then increase pos by 1 and jump back to the step labeled
            loop start in the overall algorithm in this section

It seems to me like it should read 'then increase pos by 2 and jump back to the step labeled...'.
Comment 50 Adam Barth 2009-06-28 19:24:14 UTC
(In reply to comment #49)
> One more thing. In the feed or HTML spec, there's this substep:
> 
>       2.  If s[pos] and s[pos+1] equal 0x3F and 0x3E respectively,
>             then increase pos by 1 and jump back to the step labeled
>             loop start in the overall algorithm in this section
> 
> It seems to me like it should read 'then increase pos by 2 and jump back to the
> step labeled...'.

Yeah, that step look buggy.  I'll have to figure out what its supposed to be doing.
Comment 51 Gustavo Noronha (kov) 2009-07-02 14:06:27 UTC
This code is now merged into master. Thanks for all the help.