GNOME Bugzilla – Bug 661682
Add support for deflate Content-Encoding
Last modified: 2011-11-17 18:11:10 UTC
Websites like http://roland.antoniovillena.es/ are not working fine because we only claim to support gzip.
See bug 632372. I had started trying to implement it at one point, but it's not pleasant. Note that libsoup explicitly sends "Accept-Encoding: gzip", meaning any server that returns deflate-encoded data is broken. (Oh my god! There are broken web servers on the internet!)
Created attachment 198952 [details] [review] Patch
Comment on attachment 198952 [details] [review] Patch oops, replied before you attached this. anyway, this handles RAW, but not ZLIB, and if we start saying we support deflate in our Accept-Encoding, then we're going to receive both kinds, so this patch would cause some sites to stop working
Created attachment 199021 [details] [review] Patch (Adding the following JFTR) So the issue is that deflate Content-Encoding should mean data compressed with deflate algorithm and zlib headers (RFC 1950). The problem is that some servers decided to return the compressed data (as defined in RFC 1951) without any header if the clients claimed to support deflate. Taking that into account I added some fallback code to soup-message-io.c that tries to detect if the server is returning us raw data when we expect a zlib header. If so, we give the zlib decompressor a dummy zlib header (got it from the Mozilla network code) and try to resume the data decoding. Tried it with this test page http://carsten.codimi.de/gzip.yaws/ and it seems to work fine for gzip, zlib and raw compressed data.
Review of attachment 199021 [details] [review]: Patch works, but has some edge cases... ::: libsoup/soup-message-io.c @@ -360,0 +361,9 @@ + } else if (input_cur == 0 && + !dummy_header_used && + G_IS_ZLIB_DECOMPRESSOR (converter) && ... 6 more ... g_clear_error (error); // moved here from below... @@ -360,0 +361,17 @@ + } else if (input_cur == 0 && + !dummy_header_used && + G_IS_ZLIB_DECOMPRESSOR (converter) && ... 14 more ... Why not: static const char dummy_header[2] = { 0x78, 0x9c }; ? @@ -360,0 +361,26 @@ + } else if (input_cur == 0 && + !dummy_header_used && + G_IS_ZLIB_DECOMPRESSOR (converter) && ... 23 more ... Here you are overwriting non-null error/GError @@ -360,0 +361,29 @@ + } else if (input_cur == 0 && + !dummy_header_used && + G_IS_ZLIB_DECOMPRESSOR (converter) && ... 26 more ... drop the g_clear_error from here.
(In reply to comment #5) > Review of attachment 199021 [details] [review]: > > Patch works, but has some edge cases... Sorry Priit but it's not possible to understand your comments if you always refer to the same lines in my patch.
(In reply to comment #5) > Review of attachment 199021 [details] [review]: > > Patch works, but has some edge cases... Trying to figure out what your comments refer to... > g_clear_error (error); // moved here from below... You cannot do that. The error must be set if it was produced by for example a GZIP decompressor. In that case the fallback will not make any sense and the function will return NULL with the error set. > Why not: static const char dummy_header[2] = { 0x78, 0x9c }; ? I prefer to keep it with the original value it has in the Mozilla sources. It's kind of a way of keeping the authorship... > Here you are overwriting non-null error/GError True, will change that > drop the g_clear_error from here. See first comment
OK, it was the first time I tried the "Review" thingy here. If you click on the review link in comment #5 then you'll probably get better view. Yeah, the textual "comments" provided by the splinter tool are next to useless... If the Review page doesn't work, I'll just add my patch...
> + if (format == G_ZLIB_COMPRESSOR_FORMAT_ZLIB) { We need to clear error here in case the format is not headerless deflate. If a new error occurs we set the new error on-top of the old one. > + /* Some servers (especially Apache with mod_deflate) > + * return RAW compressed data without the zlib headers > + * when the client claims to support deflate. For > + * those cases use a dummy header (stolen from > + * Mozilla's nsHTTPCompressConv.cpp) and try to > + * continue uncompressing data. > + */ > + static char dummy_header[2] = { > + 0x8 + 0x7 * 0x10, > + (((0x8 + 0x7 * 0x10) * 0x100 + 30) / 31 * 31) & 0xFF, > + }; static const char dummy_header[2] = { 0x78, 0x9c}; I looked at these values and wondered why? Then I saw on the Carsten's page example streams and header was always 0x78 and 0x9c. > + > + g_converter_reset (converter); > + result = g_converter_convert (converter, > + dummy_header, sizeof(dummy_header), > + outbuf + outbuf_cur, outbuf_length - outbuf_cur, > + 0, &input_used, &outbuf_used, error); Here we overwrite previously set error (which is always non-NULL) > + dummy_header_used = TRUE; > + if (result == G_CONVERTER_CONVERTED) { > + g_clear_error (error); So clearing the error here makes no sense... > + continue; > + } > + } > + > + g_free (outbuf); > + return NULL; > +
needs additions to coding-test too. Just add deflate and raw-compressed versions of mbox (there's already a gzip version) to tests/resources, and then make sure that they come out right. "But how do I get deflate and raw encoded versions of that file?" I dunno. Write a little program using GZlibCompressor to compress them I guess.
Created attachment 199674 [details] [review] Added tests to coding-test.c Added tests for deflate'd content to coding-test.c. I have refactored the code so all the checks are now performed by two utility functions.
The above code does not include the mbox.zlib and mbox.raw files used by the test but I have both of them in my machine.
Comment on attachment 199021 [details] [review] Patch > > Here you are overwriting non-null error/GError > > True, will change that (You never updated this patch for that.) > > Why not: static const char dummy_header[2] = { 0x78, 0x9c }; ? > > I prefer to keep it with the original value it has in the Mozilla sources. It's > kind of a way of keeping the authorship... Nah, you already mention in the comment that it was stolen from there. I think the simpler version is better. Also, "header" in the context of soup-message-io suggests "HTTP header" to me; when I first saw the "dummy_header_used" variable, I thought you were going to be inserting a dummy "Content-Encoding" header into the message... How about "dummy_zlib_header" and "dummy_zlib_header_used"?
Comment on attachment 199674 [details] [review] Added tests to coding-test.c >Refactored the original code. A couple of utility functions perform now all >the checks. it would be nice if the refactoring was a separate patch (but I'll still take the patch even if you don't do that) The set of new tests should be: GET /mbox, Accept-Encoding: deflate GET /mbox, Accept-Encoding: deflate, plus trailing junk GET /mbox, Accept-Encoding: deflate, with server error (*) GET /mbox, Accept-Encoding: deflate, prefer-deflate-raw GET /mbox, Accept-Encoding: deflate, prefer-deflate-raw, plus trailing junk (*) GET /mbox, Accept-Encoding: deflate, prefer-deflate-raw, with server error (*) This patch doesn't implement the (*)ed ones. ("with server error" is the case where the server adds the Content-Encoding header but then sends the unencoded data, and we need to make sure we recognize that and fall back to not decoding.) You don't need the "GET /mbox.zlib" and "GET /mbox.raw" tests, because deflate and raw-zlib aren't file types the way gzip is, so there'd never be a resource where the deflate/raw encoding was part of the Content-Type rather than part of the Content-Encoding.
(In reply to comment #13) > (From update of attachment 199021 [details] [review]) > > > Here you are overwriting non-null error/GError > > > > True, will change that > > (You never updated this patch for that.) Yeah forgot to do that. > > > Why not: static const char dummy_header[2] = { 0x78, 0x9c }; ? > > > > I prefer to keep it with the original value it has in the Mozilla sources. It's > > kind of a way of keeping the authorship... > > Nah, you already mention in the comment that it was stolen from there. I think > the simpler version is better. Fair enough. > Also, "header" in the context of soup-message-io suggests "HTTP header" to me; > when I first saw the "dummy_header_used" variable, I thought you were going to > be inserting a dummy "Content-Encoding" header into the message... How about > "dummy_zlib_header" and "dummy_zlib_header_used"? Makes perfect sense.
(In reply to comment #14) > The set of new tests should be: [...] > GET /mbox, Accept-Encoding: deflate, prefer-deflate-raw > GET /mbox, Accept-Encoding: deflate, prefer-deflate-raw, plus trailing junk > (*) I'm not sure about this one. I don't think zlib's inflate would be able to successfully deal with that knowing that the raw data does not have any header nor crc check. At least in the tests I run I was always getting data errors when trying to uncompress raw data+junk.
Created attachment 201473 [details] [review] Add support for deflate
Created attachment 201474 [details] [review] Added tests to coding-test.c Note: this does not include mbox.raw and mbox.zlib files to be uploaded to the repos.
Comment on attachment 201474 [details] [review] Added tests to coding-test.c make sure to add the data files before committing
Committed: 6d8ff9731eae9bf57982de273c8a45673996bcb4: actual patch e44809a82be8dcaa97ffdc9788e00e08f874dce9: mbox compressed files dbcd20621c274712460401af58f8aede0027d804: coding-test.c