GNOME Bugzilla – Bug 649963
SoupCache: use hashes instead of strings for cache entry keys
Last modified: 2011-07-06 10:00:47 UTC
Basically for two reasons: * storage size: hashes normally have a more compact representation * performance: is faster to compare unsigned integers than strings
Created attachment 187626 [details] [review] Patch This patch replaces the string used to store the entry key by a hash stored in an unsigned int. It also removes the filename from the cache entries as we do not need to store it (and keep it in memory) because it's a computed value. This patch changes the GVariant format. Before committing this we have to think about a mechanism to easily change the format used to serialize/deserialize data. Perhaps we might store a file with the format used by the cache. If the format is different to the one hardcoded in the library then we'd just remove the old files of the cache.
Adding dependency
Comment on attachment 187626 [details] [review] Patch >- md5 = g_compute_checksum_for_string (G_CHECKSUM_MD5, entry->key, -1); >- entry->filename = g_build_filename (cache->priv->cache_dir, md5, NULL); >- g_free (md5); OK, this worried me at first, since you're vastly decreasing the amount of entropy in the filename, and thus increasing the chance of a cache collision. But then I saw that Firefox also uses 32-bit cache filenames too, so I guess it's ok. One problem though is that since you don't store the URL in the cache entry, it won't even notice if you do end up in the situation where two URLs hash to the same value; it will just return the same cache result for both URLs without ever noticing that anything is wrong... You could call that a separate bug if you want though, since the code already sort of has that problem. So feel free to commit this as is if you want, or attach a new patch that saves the URL in the cache entry and then checks it when reading it back.
Created attachment 190734 [details] [review] Patch Whenever we find a collision we delete the old entry and insert a new one. I have also a different and a bit more complicated approach in some other patch. Basically I keep track of every single entry with the same key in linked lists. The code works fine but it adds some extra complexity to the code just to handle a not very common scenario IMO. That's why I uploaded a simpler version of the patch that just discards the old entry with the same key.
Comment on attachment 190734 [details] [review] Patch >+ guint key; >+ char *filename = g_strdup_printf ("%s%s%"G_GUINT32_FORMAT, cache->priv->cache_dir, >+ G_DIR_SEPARATOR_S, entry->key); entry->key is a guint, not a guint32. Although based on the previous (version 4) patch, maybe you meant for entry->key to be guint32? Either way, G_GUINT32_FORMAT is just ugly. I'd use "%u" and "(guint)entry->key". >+soup_cache_entry_free (SoupCacheEntry *entry, GFile *file) > { >- if (purge) { >- GFile *file = g_file_new_for_path (entry->filename); >+ if (file) { > g_file_delete (file, NULL, NULL); > g_object_unref (file); It's slightly odd to unref file here, so you should add a comment before the function noting that it does that. >+ entry->response_time = response_time; hm. why did that get moved?
(In reply to comment #5) > (From update of attachment 190734 [details] [review]) > >+ guint key; > > >+ char *filename = g_strdup_printf ("%s%s%"G_GUINT32_FORMAT, cache->priv->cache_dir, > >+ G_DIR_SEPARATOR_S, entry->key); > > entry->key is a guint, not a guint32. Although based on the previous (version > 4) patch, maybe you meant for entry->key to be guint32? Right we should force it to be 32bit. > Either way, G_GUINT32_FORMAT is just ugly. I'd use "%u" and > "(guint)entry->key". > I agree that looks ugly but what would happen for architectures were the "%u" does not mean a 32bit usigned int, but 64 or 16 for example ? > >+soup_cache_entry_free (SoupCacheEntry *entry, GFile *file) > > { > >- if (purge) { > >- GFile *file = g_file_new_for_path (entry->filename); > >+ if (file) { > > g_file_delete (file, NULL, NULL); > > g_object_unref (file); > > It's slightly odd to unref file here, so you should add a comment before the > function noting that it does that. Yeah I also think that is a good idea to free the memory in the same block where it is reserved, but knowing that it's a private function I deciced to move it there. I'll add the comment as it's good to explain it. > >+ entry->response_time = response_time; > > hm. why did that get moved? Actually it didn't get removed I'm just moving it up from a few lines bellow. It was a bug in the old code I think, because we were setting the response_time only if we found the "Date" header and we have to do it always.
(In reply to comment #6) > > Either way, G_GUINT32_FORMAT is just ugly. I'd use "%u" and > > "(guint)entry->key". > > > > I agree that looks ugly but what would happen for architectures were the "%u" > does not mean a 32bit usigned int, but 64 or 16 for example ? right, that's why I said the "(guint)entry->key" part. (ie, cast entry->key to guint when passing it to printf, so that it is a %u when printf sees it.) Also, FWIW, glib won't work on systems where int is 16 bits.
Created attachment 191215 [details] [review] Patch
Committed d6c9aaac9c6c156391f8ee73e1cae353af9d2bed.