GNOME Bugzilla – Bug 665884
SoupCache: do not store headers on cache index
Last modified: 2018-09-24 16:54:31 UTC
Currently the message headers are stored in the cache index file. The reason is easy, the code is simpler. There are some drawbacks tough: * memory usage: all the resource headers are loaded in memory _always_ even if they are never used * cache load/dump times: we need to serialize/deserialize all the message headers to store the cache index file _always_
Created attachment 203165 [details] [review] Patch
Comment on attachment 203165 [details] [review] Patch Cool. It basically looks good. I'm assuming you've tested. One thought: >+ /* If the size of the new headers is the same we can write to >+ * the original cached file in place. If the new headers size >+ * is less than the original one we can also directly write to >+ * the original file (leaking the difference on disk). > */ Maybe you should add some padding when you initially write the file, so that, eg, if the page comes with a slightly-larger cookie the next time, you don't need to rewrite the whole thing?
Created attachment 203328 [details] [review] New patch This new version uses GVariant's "maybe types" to implement a 64bit padding that will be appended to the headers data when first stored on disk to prevent future full rewrites of the cached resources. It also handles the case of SoupCache not generating correctly a conditional request (due to the impossibility of reading the headers for example)
Actually the patch from comment #3 is wrong as it always replaces the whole file instead of overwriting it. I have a new patch working fine, but will wait for Dan's new streaming API to upload it.
(In reply to comment #4) > Actually the patch from comment #3 is wrong as it always replaces the whole > file instead of overwriting it. I have a new patch working fine, but will wait > for Dan's new streaming API to upload it. done
Created attachment 212765 [details] [review] SoupCache: do not store message headers on cache index SoupMessageHeaders should not be stored in the cache index file because we do not want all of them permanently in memory. Instead, they're now stored in the resource cached files. http://bugzilla.gnome.org/show_bug.cgi?id=665884
Created attachment 212768 [details] [review] Forgot to add another fix
Comment on attachment 212768 [details] [review] Forgot to add another fix >+/* >+ * Padding to add when writing headers to minimize rewrites >+ */ >+#define HEADERS_PADDING 64 This doesn't get used >+ /* Optionally add a padding. We use a 64-bit padding after the >+ * headers to try to prevent complete rewrites of cached >+ * resources when the response headers slightly change after a >+ * resource revalidation. >+ */ 8 bytes is not really a lot of leeway. 64 sounded a lot better. You don't need to include the padding as part of the variant, if that would help...
Created attachment 220522 [details] [review] Fixed size input stream New fixed size input stream that reads data from a base stream up to a certain number of bytes.
Created attachment 220523 [details] [review] Do not store headers on index Do not store headers on index but attached to the resource data.
So, in order to simplify the original patch a little bit and to avoid races, I decided to append the headers to the resource data instead of prepending them. This way, whenever we have to update the headers (after a revalidation) we just need to modify the tail of the file, and we do not have to deal with paddings or complete file rewrites. This also means that when reading a resource from disk, we cannot just read the whole file until EOF, but instead, stop reading when the headers data is found. We achieve that by using a new type of GFilterInputStream (attached in a different patch). I had to code it for the new block-size storage (https://bugzilla.gnome.org/show_bug.cgi?id=666143) and I found it useful for this case too.
Comment on attachment 220522 [details] [review] Fixed size input stream Hm... this is *very* similar to SoupBodyInputStream with encoding=SOUP_ENCODING_CONTENT_LENGTH. Can we get rid of the duplication? From a quick look, it seems like the only real differences are that 1) SoupFixedSizeInputStream implements GSeekable and requires its base-stream to be a GSeekable 2) SoupBodyInputStream requires its base-stream to be a SoupFilterInputStream But we should be able to merge those together; move your GSeekable implementation into SoupBodyInputStream, but have can_seek() be return G_IS_SEEKABLE (base_stream) && g_seekable_can_seek (G_SEEKABLE (base_stream)); And then change soup_body_input_stream_new() to take just a GInputStream, and only require it to be a SoupFilterInputStream if encoding is SOUP_ENCODING_CHUNKED (which is the only time we use SoupFilterInputStream methods). >+static void soup_fixed_size_input_stream_set_property (GObject *object, >+ guint prop_id, >+ const GValue *value, >+ GParamSpec *pspec); Note that I did a bunch of cleanups recently (f465016..d9db072) and in particular, reorganized all the existing files to put class_init and iface_init functions *after* the method implementations so that we don't need forward declarations everywhere. (http://git.gnome.org/browse/libsoup/commit/?id=7b91607) >+soup_fixed_size_input_stream_skip (GInputStream *stream, This implementation will only work with SOUP_ENCODING_CONTENT_LENGTH. Making it work with SOUP_ENCODING_EOF too is easy; for CHUNKED, you'd have to just fall back to calling the parent class implementation though. >+ if (priv->pos + count > priv->len) { >+ g_set_error_literal (error, >+ G_IO_ERROR, >+ G_IO_ERROR_INVALID_ARGUMENT, >+ _("Invalid skip request")); g_input_stream_skip() is defined to have the same effect as doing a g_input_stream_read() and throwing away the result. So if you can't skip @count bytes, just skip as many as you can. >+ skipped = g_input_stream_skip (base_stream, count, cancellable, error); >+ >+ if (skipped != -1) >+ priv->pos += count; should be += skipped, not count. >+static goffset >+soup_fixed_size_input_stream_tell (GSeekable *seekable) >+{ >+ GInputStream *base_stream = G_FILTER_INPUT_STREAM (seekable)->base_stream; >+ >+ return g_seekable_tell (G_SEEKABLE (base_stream)); >+} You can just return priv->pos here, right? >+ default: >+ g_set_error_literal (error, >+ G_IO_ERROR, >+ G_IO_ERROR_INVALID_ARGUMENT, >+ _("Invalid GSeekType supplied")); That case is a programmer error. do g_return_val_if_reached (FALSE). (Ah, I see you copied this from gio... well, it's wrong there.)
Comment on attachment 220523 [details] [review] Do not store headers on index >+ /* FIXME: we should fix content sniffing because it is not >+ * really correct as we are directly modifying the headers >+ * that will be stored in the cache. Another possibility here is to leave the headers unchanged, but store the sniffed content-type as a field in the cache index. (Though, as you note, this will be fixed by the stream filter rewrites later, so maybe it's not worth it.) >+ headers_variant = pack_headers (headers); >+ headers_buffer = soup_buffer_new (SOUP_MEMORY_STATIC, >+ g_variant_get_data (headers_variant), >+ g_variant_get_size (headers_variant)); This leaks headers_variant doesn't it? You want to use soup_buffer_new_with_owner(). >+ bytes_written = g_output_stream_write_finish (os, res, &error); >+ entry->dirty = FALSE; >+ entry->being_validated = FALSE; >+ >+ if (!bytes_written || error) { This assumes that if bytes_written is non-0 then it's entry->length; but you might have to do multiple writes. Also, g_output_stream_write() will never return 0; you want "if (bytes_written == -1)" there. Also, you need to close the iostream after writing.
(In reply to comment #12) > (From update of attachment 220522 [details] [review]) > Hm... this is *very* similar to SoupBodyInputStream with > encoding=SOUP_ENCODING_CONTENT_LENGTH. Can we get rid of the duplication? Yeah sure I can try. I think I implemented it last year and just decided to use it for this case. Thx for the quick review BTW. > >+soup_fixed_size_input_stream_skip (GInputStream *stream, > > This implementation will only work with SOUP_ENCODING_CONTENT_LENGTH. Making it > work with SOUP_ENCODING_EOF too is easy; for CHUNKED, you'd have to just fall > back to calling the parent class implementation though. Right. > g_input_stream_skip() is defined to have the same effect as doing a > g_input_stream_read() and throwing away the result. So if you can't skip @count > bytes, just skip as many as you can. Ah OK, I was not sure about that. > >+ skipped = g_input_stream_skip (base_stream, count, cancellable, error); > >+ > >+ if (skipped != -1) > >+ priv->pos += count; > > should be += skipped, not count. Absolutely > >+static goffset > >+soup_fixed_size_input_stream_tell (GSeekable *seekable) > >+{ > >+ GInputStream *base_stream = G_FILTER_INPUT_STREAM (seekable)->base_stream; > >+ > >+ return g_seekable_tell (G_SEEKABLE (base_stream)); > >+} > > You can just return priv->pos here, right? Yep, makes no sense not to do it > >+ default: > >+ g_set_error_literal (error, > >+ G_IO_ERROR, > >+ G_IO_ERROR_INVALID_ARGUMENT, > >+ _("Invalid GSeekType supplied")); > > That case is a programmer error. do g_return_val_if_reached (FALSE). > > (Ah, I see you copied this from gio... well, it's wrong there.) Maybe I don't remember the details :)
(In reply to comment #13) > >+ headers_variant = pack_headers (headers); > >+ headers_buffer = soup_buffer_new (SOUP_MEMORY_STATIC, > >+ g_variant_get_data (headers_variant), > >+ g_variant_get_size (headers_variant)); > > This leaks headers_variant doesn't it? You want to use > soup_buffer_new_with_owner(). Initially the headers_variant was passed as part of the writing fixture and freed in its free method, that's why STATIC was used here. Will change it. > >+ bytes_written = g_output_stream_write_finish (os, res, &error); > >+ entry->dirty = FALSE; > >+ entry->being_validated = FALSE; > >+ > >+ if (!bytes_written || error) { > > This assumes that if bytes_written is non-0 then it's entry->length; but you > might have to do multiple writes. Good point will change it, don't know why I wrongly assumed that the write was a do or die.
(In reply to comment #12) > (From update of attachment 220522 [details] [review]) > Hm... this is *very* similar to SoupBodyInputStream with > encoding=SOUP_ENCODING_CONTENT_LENGTH. Can we get rid of the duplication? > > From a quick look, it seems like the only real differences are that > > 1) SoupFixedSizeInputStream implements GSeekable and requires its > base-stream to be a GSeekable > 2) SoupBodyInputStream requires its base-stream to be a SoupFilterInputStream Haven't taken a detailed look at this but the body stream is pollable and the GLocalFileInputStream is not, wouldn't it require a lot of changes?
oh right, I always forget files aren't pollable... You'll have to override can_poll() to check the pollability of the base stream, basically just like can_seek()
Created attachment 220719 [details] [review] Make SoupBodyInputStream seekable With this refactoring we do not need the proposed SoupFixedSizeInputStream anymore
Created attachment 220720 [details] [review] Do not store headers on index (new version) New version using the SoupBodyInputStream
Comment on attachment 220720 [details] [review] Do not store headers on index (new version) Removing this from the review queue as it needs a couple of changes.
Comment on attachment 220719 [details] [review] Make SoupBodyInputStream seekable >+ switch (priv->encoding) { >+ gsize safe_count; ew. Is it really valid c90 to put that there? Anyway, you don't need safe_count anyway; just re-use count. >+static gboolean >+soup_body_input_stream_can_seek (GSeekable *seekable) >+{ >+ GInputStream *base_stream = SOUP_BODY_INPUT_STREAM (seekable)->priv->base_stream; >+ >+ return G_IS_SEEKABLE (base_stream) && g_seekable_can_seek (G_SEEKABLE (base_stream)); >+} && priv->encoding != SOUP_ENCODING_CHUNKED Ah... I see, you did a check for CHUNKED in new()... but still, we might make SoupFilterInputStream seekable in the future, so you should explicitly check priv->encoding here. >+ end_position = priv->pos + priv->read_length; >+ case G_SEEK_END: >+ position = end_position + offset; That won't work for SOUP_ENCODING_EOF. I guess the easiest thing here is probably to make that non-seekable too. (And "non-seekable" implies "non-tell()able" too, so you can simplify read and skip to only keep track of pos for CONTENT_LENGTH.) >+ if (!g_seekable_seek (G_SEEKABLE (priv->base_stream), offset, type, cancellable, error)) >+ return FALSE; >+ >+ priv->pos = g_seekable_tell (G_SEEKABLE (priv->base_stream)); This assumes that the base stream has the same origin and length as the SoupFilterInputStream does, but that's wrong. (Eg, if the base stream is a cache file, it will have header info at the end, and so if the caller seeks to type=G_SEEK_END, offset=-10 of the filter stream, you can't just pass that type and offset to the base_stream, because you'd end up near the end of the headers rather than near the end of the data. This should work: if (!g_seekable_seek (G_SEEKABLE (priv->base_stream), position - priv->pos, G_SEEK_CUR, cancellable, error)) return FALSE; priv->pos = position;
(In reply to comment #21) > This assumes that the base stream has the same origin and length as the > SoupFilterInputStream does, but that's wrong. (Eg, if the base stream is a > cache file, it will have header info at the end, and so if the caller seeks to > type=G_SEEK_END, offset=-10 of the filter stream, you can't just pass that type > and offset to the base_stream, because you'd end up near the end of the headers > rather than near the end of the data. > > This should work: > > if (!g_seekable_seek (G_SEEKABLE (priv->base_stream), > position - priv->pos, G_SEEK_CUR, > cancellable, error)) > return FALSE; > priv->pos = position; I am not so sure about this. In the case you mentioned (the cache file) the one creating the SoupBodyInputStream will pass a GInputStream for the local file and just the length of the cached data (without including the headers). So the seek will return the proper position. Maybe I'm forgetting about some other special case...
Created attachment 221044 [details] [review] Store sniffed Content-Type in cache index I finally decided to add the sniffed Content-Type to the cache index. We'll remove it in the future once the cache becomes part of the stream stack. Original headers will not be modified but the sniffed content type will still be accessible through soup_request_get_content_type()
Created attachment 221047 [details] [review] Do not store headers in cache index This new version uses a SoupBodyInputStream to read data instead of the old SoupFixedSizeInputStream.
Say the cache file has 8000 bytes of data followed by 500 bytes of headers. The cache does: stream = soup_body_input_stream_new (file_stream, SOUP_ENCODING_CONTENT_LENGTH, 8000); Now, later, I take that stream and call g_seekable_seek (G_SEEKABLE (body_input_stream), -100, G_SEEK_END, ...); The semantics of that ought to be "seek to byte 7900", right? But the current code in your patch just passes the "offset" and "type" args directly to the base_stream seek implementation, so the file stream will seek to 100 bytes before *its* end, which is byte 8400, which isn't even supposed to be visible in the SoupBodyInputStream. Similarly, though not relevant in this particular case, if you had a SoupBodyInputStream that was created to represent just bytes 587-2349 of its underlying stream (ie, you'd already read the first 587 bytes of that stream before creating the body stream; maybe because the underlying stream represents a multipart response, and you want to parse each part separately), then G_SEEK_SET would mess up too, because if the user seeked to byte 0 of the SoupBodyInputStream, you'd want to seek to byte 587 of the underlying stream, not byte 0. But a G_SEEK_CUR will always do the right thing (assuming you made sure that the end position is in-bounds). So to make it work in all cases, just always translate to G_SEEK_CUR.
(In reply to comment #25) > The semantics of that ought to be "seek to byte 7900", right? But the current > code in your patch just passes the "offset" and "type" args directly to the > base_stream seek implementation, so the file stream will seek to 100 bytes > before *its* end, which is byte 8400, which isn't even supposed to be visible > in the SoupBodyInputStream. Beh true, I just was ignoring that I was passing the type directly to the inner stream, yeah you're totally right, will rework it.
Created attachment 221104 [details] [review] Make SoupBodyInputStream seekable This new approach does only allow content-length encoded streams to be seekable.
Comment on attachment 221044 [details] [review] Store sniffed Content-Type in cache index looks good, assuming it's not obsoleted by 682112
Comment on attachment 221104 [details] [review] Make SoupBodyInputStream seekable great
Comment on attachment 221047 [details] [review] Do not store headers in cache index looks right... we really need test cases for the cache though
Review of attachment 221104 [details] [review]: Landed in master: f723ab8
this was accidentally left open...
I was not accidentally left open, I haven't committed changes yet
hm, i somehow read "accepted-commit_now" as "committed"
(In reply to comment #33) > It was not accidentally left open, I haven't committed changes yet any reason? Are there other cache file format changes pending and we want to make all the changes at once?
(In reply to comment #35) > (In reply to comment #33) > > It was not accidentally left open, I haven't committed changes yet > > any reason? Are there other cache file format changes pending and we want to > make all the changes at once? I don't remember the exact details but I think it was a mixture of not being happy with the implementation and some bugs rewriting the headers IIRC.
Comment on attachment 221044 [details] [review] Store sniffed Content-Type in cache index un-acn-ing per last comment
Created attachment 237971 [details] [review] Store headers in a separate DB The old patch had many issues because the headers were stored in the same file as the resource data. The resource data is immutable but the headers could change, so that was leading to full file rewrites which brought new issues. I decided then to try storing the headers in a separate location (losing the locality advantages) and use a sqlite3 database (as we already have a dependency on it). This approach has a performance issue because each INSERT involves a transaction, something that is slow and blocking. That's why I decided to add some intermediate storage (a couple of hash tables) which collects modifications to the database and dump them in a single transaction once we have enough. If we go with this idea, we might even move the headers retrieval/storage logic to a different object (implementing some SoupCacheHeadersStorage interface for example).
Created attachment 238393 [details] [review] Refactored headers access code in SoupCacheHeadersStorage class I decided to split the changes in two patches. This is the first one which basically adds a new interface called SoupCacheHeadersStorage and a default implementation using sqlite3. This new interface defines the methods required to access and manipulate headers information (read/write/delete/update). The SoupCache will use that interface instead of having to implement all the headers persistence logic. This allows us to remove a lot of code that is not really part of the core of an HTTP cache.
Created attachment 238394 [details] [review] Do not store headers in cache index This second patch just contains the required changes to remove the headers from the cache index.
Review of attachment 238393 [details] [review]: So... I'm not sure the sqlite db ends up being a win overall (as opposed to just storing the body and headers in separate files in the cache dir, for instance). You end up needing a lot of code to work around the problems it creates... Although, an alternate possibility to deal with the slow/blocking thing would be to do the SQL stuff in a different thread, and write out all changes right away. (You can use a GAsyncQueue to pass data from the main thread to the sqlite thread.) Then you wouldn't need the hash tables and stuff. ::: libsoup/soup-cache-headers-storage-db.c @@ +165,3 @@ + filename = g_build_filename (cache_dir, "headers.db", NULL); + EXEC_SQL_ON_ERROR_RETURN_WITH_CODE (sqlite3_open (filename, &priv->headers_db), g_free (filename), result); + g_free (cache_dir); leaks cache_dir in the error case; you can just move the g_free(cache_dir) to before the sql @@ +193,3 @@ + + if (!headers_storage->priv->headers_db) + if (!init_headers_db (headers_storage)) {}s around the inner if statement @@ +245,3 @@ + + if (!priv->headers_db) + if (!init_headers_db (SOUP_CACHE_HEADERS_STORAGE_DB (headers_storage))) {}s around the inner if statement @@ +259,3 @@ + headers_variant = (GVariant *) value; + + if ((SQLITE_OK == sqlite3_bind_int (prepared_stmt, 1, entry_key)) && if ((sqlite3_bind_int (prepared_stmt, 1, entry_key) == SQLITE_OK) && (likewise throughout) @@ +273,3 @@ + } + sqlite3_finalize (prepared_stmt); + g_hash_table_remove_all (priv->pending_header_writes); as a minor optimization, you can instead do "g_hash_table_iter_remove (&iter)" at the end of each loop iteration @@ +312,3 @@ + /* Trigger the flush if there is no much activity going on (no + * pending caching operations) and there is a reasonable + * amount of pending operations (100 for example). 100 feels like a lot... how long does it take to reach that point in an ordinary browsing session? (And this would probably need to be configurable. Eg, I eventually want to make the GProxyResolver PAC-running code use a cache, but that would only ever be downloading a single file, which would almost never change. So we'd want to flush changes right away in that case.) @@ +351,3 @@ + in a hash table indexed by the entry key (only the last + version will be inserted in the DB), and flush them once we + have enought data. */ /* comment * should be * like this */ also, typo at the end ("enought") ::: libsoup/soup-cache-private.h @@ +29,3 @@ G_BEGIN_DECLS +typedef struct _SoupCacheEntry { you need to remove this from soup-cache.c too
Review of attachment 238394 [details] [review]: ::: libsoup/soup-cache.c @@ +30,3 @@ #endif +#include <sqlite3.h> you don't need this here any more, right? @@ +35,2 @@ #include "soup-cache.h" +#include "soup-cache-private.h" you're already including soup-cache-private a few lines down
(In reply to comment #41) > Review of attachment 238393 [details] [review]: > > So... I'm not sure the sqlite db ends up being a win overall (as opposed to > just storing the body and headers in separate files in the cache dir, for > instance). You end up needing a lot of code to work around the problems it > creates... I am not 100% convinced either, we'll probably need some benchmark in order to know how it behaves. Anyway that's the main reason why I decided to create the interface, because we could think about different implementations to fix the issue. > Although, an alternate possibility to deal with the slow/blocking thing would > be to do the SQL stuff in a different thread, and write out all changes right > away. (You can use a GAsyncQueue to pass data from the main thread to the > sqlite thread.) Then you wouldn't need the hash tables and stuff. Looks like a good idea, but I am not sure that it will be a huge win over batching operations, because the SELECTs will still have to wait for all the INSERTs to happen, because they lock the tables in the database. Looks like another possibility is to use what sqlite3 calls temporary tables. Transactions are not automatically flushed to disk so it seems writes are faster. The temporary table would then have to be flushed to the main table before the sqlite3_close. > @@ +273,3 @@ > + } > + sqlite3_finalize (prepared_stmt); > + g_hash_table_remove_all (priv->pending_header_writes); > > as a minor optimization, you can instead do "g_hash_table_iter_remove (&iter)" > at the end of each loop iteration I thought that could invalidate the iter, I'll change it. > @@ +312,3 @@ > + /* Trigger the flush if there is no much activity going on (no > + * pending caching operations) and there is a reasonable > + * amount of pending operations (100 for example). > > 100 feels like a lot... how long does it take to reach that point in an > ordinary browsing session? Actually it is not a lot, after 4-5 websites you get a flush. The rationale behind using a figure like 100 is that, for a cache, it is better to lose some cached resources (in the event of a crash for example) than to have to wait too much for disk writes. > (And this would probably need to be configurable. Eg, I eventually want to make > the GProxyResolver PAC-running code use a cache, but that would only ever be > downloading a single file, which would almost never change. So we'd want to > flush changes right away in that case.) That's a good point, maybe we could remove that condition and check only for "is not busy".
(In reply to comment #43) > (In reply to comment #41) > > Review of attachment 238393 [details] [review] [details]: > > > > So... I'm not sure the sqlite db ends up being a win overall (as opposed to > > just storing the body and headers in separate files in the cache dir, for > > instance). You end up needing a lot of code to work around the problems it > > creates... > > I am not 100% convinced either, we'll probably need some benchmark in order to > know how it behaves. Anyway that's the main reason why I decided to create the > interface, because we could think about different implementations to fix the > issue. > > > Although, an alternate possibility to deal with the slow/blocking thing would > > be to do the SQL stuff in a different thread, and write out all changes right > > away. (You can use a GAsyncQueue to pass data from the main thread to the > > sqlite thread.) Then you wouldn't need the hash tables and stuff. > > Looks like a good idea, but I am not sure that it will be a huge win over > batching operations, because the SELECTs will still have to wait for all the > INSERTs to happen, because they lock the tables in the database. > > Looks like another possibility is to use what sqlite3 calls temporary tables. > Transactions are not automatically flushed to disk so it seems writes are > faster. The temporary table would then have to be flushed to the main table > before the sqlite3_close. Or we could also use the block files infrastructure, introduced in bug 666143, and use the block files to store headers information. Chromium does something like this (https://sites.google.com/a/chromium.org/dev/developers/design-documents/network-stack/disk-cache) and looks like it works for them. I'll try something like that.
*** Bug 664891 has been marked as a duplicate of this bug. ***
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/libsoup/issues/40.