Bug 649963 – SoupCache: use hashes instead of strings for cache entry keys

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 649963 - SoupCache: use hashes instead of strings for cache entry keys


Summary:	SoupCache: use hashes instead of strings for cache entry keys


Status:	RESOLVED FIXED

Product:	libsoup
Classification:	Core
Component:	Misc
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	libsoup-maint@gnome.bugs
QA Contact:	libsoup-maint@gnome.bugs

URL:
Whiteboard:

Depends on:	652554
Blocks:

Reported:	2011-05-11 13:29 UTC by Sergio Villar
Modified:	2011-07-06 10:00 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Patch (14.42 KB, patch) 2011-05-11 13:34 UTC, Sergio Villar	reviewed	Details \| Review
Patch (19.60 KB, patch) 2011-06-27 08:43 UTC, Sergio Villar	reviewed	Details \| Review
Patch (19.79 KB, patch) 2011-07-04 08:39 UTC, Sergio Villar	accepted-commit_now	Details \| Review

Description Sergio Villar 2011-05-11 13:29:38 UTC

Basically for two reasons:
   * storage size: hashes normally have a more compact representation
   * performance: is faster to compare unsigned integers than strings

Comment 1 Sergio Villar 2011-05-11 13:34:42 UTC

Created attachment 187626 [details] [review]
Patch

This patch replaces the string used to store the entry key by a hash stored in an unsigned int. It also removes the filename from the cache entries as we do not need to store it (and keep it in memory) because it's a computed value.

This patch changes the GVariant format. Before committing this we have to think about a mechanism to easily change the format used to serialize/deserialize data. Perhaps we might store a file with the format used by the cache. If the format is different to the one hardcoded in the library then we'd just remove the old files of the cache.

Comment 2 Sergio Villar 2011-05-11 13:36:17 UTC

Adding dependency

Comment 3 Dan Winship 2011-05-16 08:52:50 UTC

Comment on attachment 187626 [details] [review]
Patch

>-	md5 = g_compute_checksum_for_string (G_CHECKSUM_MD5, entry->key, -1);
>-	entry->filename = g_build_filename (cache->priv->cache_dir, md5, NULL);
>-	g_free (md5);

OK, this worried me at first, since you're vastly decreasing the amount of entropy in the filename, and thus increasing the chance of a cache collision. But then I saw that Firefox also uses 32-bit cache filenames too, so I guess it's ok.

One problem though is that since you don't store the URL in the cache entry, it won't even notice if you do end up in the situation where two URLs hash to the same value; it will just return the same cache result for both URLs without ever noticing that anything is wrong...

You could call that a separate bug if you want though, since the code already sort of has that problem. So feel free to commit this as is if you want, or attach a new patch that saves the URL in the cache entry and then checks it when reading it back.

Comment 4 Sergio Villar 2011-06-27 08:43:07 UTC

Created attachment 190734 [details] [review]
Patch

Whenever we find a collision we delete the old entry and insert a new one.

I have also a different and a bit more complicated approach in some other patch. Basically I keep track of every single entry with the same key in linked lists. The code works fine but it adds some extra complexity to the code just to handle a not very common scenario IMO. That's why I uploaded a simpler version of the patch that just discards the old entry with the same key.

Comment 5 Dan Winship 2011-06-29 15:26:28 UTC

Comment on attachment 190734 [details] [review]
Patch

>+	guint key;

>+	char *filename = g_strdup_printf ("%s%s%"G_GUINT32_FORMAT, cache->priv->cache_dir,
>+					  G_DIR_SEPARATOR_S, entry->key);

entry->key is a guint, not a guint32. Although based on the previous (version 4) patch, maybe you meant for entry->key to be guint32?

Either way, G_GUINT32_FORMAT is just ugly. I'd use "%u" and "(guint)entry->key".

>+soup_cache_entry_free (SoupCacheEntry *entry, GFile *file)
> {
>-	if (purge) {
>-		GFile *file = g_file_new_for_path (entry->filename);
>+	if (file) {
> 		g_file_delete (file, NULL, NULL);
> 		g_object_unref (file);

It's slightly odd to unref file here, so you should add a comment before the function noting that it does that.

>+	entry->response_time = response_time;

hm. why did that get moved?

Comment 6 Sergio Villar 2011-06-29 15:52:00 UTC

(In reply to comment #5)
> (From update of attachment 190734 [details] [review])
> >+	guint key;
> 
> >+	char *filename = g_strdup_printf ("%s%s%"G_GUINT32_FORMAT, cache->priv->cache_dir,
> >+					  G_DIR_SEPARATOR_S, entry->key);
> 
> entry->key is a guint, not a guint32. Although based on the previous (version
> 4) patch, maybe you meant for entry->key to be guint32?

Right we should force it to be 32bit. 

> Either way, G_GUINT32_FORMAT is just ugly. I'd use "%u" and
> "(guint)entry->key".
> 

I agree that looks ugly but what would happen for architectures were the "%u" does not mean a 32bit usigned int, but 64 or 16 for example ?

> >+soup_cache_entry_free (SoupCacheEntry *entry, GFile *file)
> > {
> >-	if (purge) {
> >-		GFile *file = g_file_new_for_path (entry->filename);
> >+	if (file) {
> > 		g_file_delete (file, NULL, NULL);
> > 		g_object_unref (file);
> 
> It's slightly odd to unref file here, so you should add a comment before the
> function noting that it does that.

Yeah I also think that is a good idea to free the memory in the same block where it is reserved, but knowing that it's a private function I deciced to move it there. I'll add the comment as it's good to explain it.

> >+	entry->response_time = response_time;
> 
> hm. why did that get moved?

Actually it didn't get removed I'm just moving it up from a few lines bellow. It was a bug in the old code I think, because we were setting the response_time only if we found the "Date" header and we have to do it always.

Comment 7 Dan Winship 2011-06-29 16:04:24 UTC

(In reply to comment #6)
> > Either way, G_GUINT32_FORMAT is just ugly. I'd use "%u" and
> > "(guint)entry->key".
> > 
> 
> I agree that looks ugly but what would happen for architectures were the "%u"
> does not mean a 32bit usigned int, but 64 or 16 for example ?

right, that's why I said the "(guint)entry->key" part. (ie, cast entry->key to guint when passing it to printf, so that it is a %u when printf sees it.)

Also, FWIW, glib won't work on systems where int is 16 bits.

Comment 8 Sergio Villar 2011-07-04 08:39:54 UTC

Created attachment 191215 [details] [review]
Patch

Comment 9 Sergio Villar 2011-07-06 10:00:47 UTC

Committed d6c9aaac9c6c156391f8ee73e1cae353af9d2bed.