After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 667682 - Cache size dramatically exceeds limit
Cache size dramatically exceeds limit
Status: RESOLVED FIXED
Product: libsoup
Classification: Core
Component: API
2.46.x
Other Linux
: Normal major
: ---
Assigned To: libsoup-maint@gnome.bugs
libsoup-maint@gnome.bugs
Depends on:
Blocks:
 
 
Reported: 2012-01-11 11:04 UTC by xav795
Modified: 2014-12-07 13:48 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
cache: Clean up leaked resources when loading the cache (8.19 KB, patch)
2014-09-29 16:01 UTC, Carlos Garcia Campos
none Details | Review
Updated patch (8.45 KB, patch)
2014-10-02 11:26 UTC, Carlos Garcia Campos
reviewed Details | Review
Updated patch (7.77 KB, patch)
2014-12-05 09:47 UTC, Carlos Garcia Campos
committed Details | Review

Description xav795 2012-01-11 11:04:13 UTC
Hi, 

I recently discover that despite i was using the inside functionality to manage cookies, temporary downladed files, etc.. from epiphany; the cache folder of the application was never cleaned up
In fact that folder at least 190Mo and over that 10000 elements

What could be the reason?
Is the managing personal data functionality normally able to manage this folder?
Should i delete this folder manually to make it work normally?

Xavier
Comment 1 Diego Escalante Urrelo (not reading bugmail) 2012-03-02 01:25:08 UTC
I just tried this with 3.3.x

I had a small cache, recently cleaned, which was 9Mb. The Clean button in Preferences → Privacy reduced it to 3Mb.

Perhaps it retained the cache files relevant to currently opened tabs.

Would you check again with a newer Epiphany? At least 3.2.
Comment 2 xav795 2012-03-07 18:50:53 UTC
In my distribution (Fedora 15) which i follow in order to keep easy my using, my version is only 3.03.
Can i help you on that version anyway?
Comment 3 Michael Gratton 2012-07-12 02:38:59 UTC
I can confirm this.

Ephy is leaking disk cache, perhaps when it crashes. This has been the case at least since the cache directory was moved out of .gnome2/epiphany.

Just now, "du -hs .cache/epiphany" reported 4.5G, hitting the Clean button reduced that to 4.3G. My cache size is set to 256MB in the prefs.

Currently, the most common way Ephy exits for me is either because of a crash, I accidentally hit Ctrl-Q, or I have to kill it. This happens at least once a week, so is perhaps a candidate reason for this unbounded disk cache growth.

Epiphany 3.4.1
WebkitGtk 1.8.0
Ubuntu Precise
Comment 4 Bastien Nocera 2014-08-30 11:08:41 UTC
On my machine, the Epiphany cache ends up taking 12GB. That's far too much, and is going to be a problem on machines with small amounts of disk (netbooks, tablets with 16GB or 32GB of SSD).
Comment 5 Bastien Nocera 2014-08-30 11:09:36 UTC
This is especially a problem when the "disk cache" option in the Preferences is set to "50 MB".
Comment 6 Michael Catanzaro 2014-08-30 19:50:54 UTC
(In reply to comment #5)
> This is especially a problem when the "disk cache" option in the Preferences is
> set to "50 MB".

This preference isn't hooked up to anything, and WebKit does not provide any way to control the size of the cache. :/
Comment 7 Bastien Nocera 2014-08-30 20:14:00 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > This is especially a problem when the "disk cache" option in the Preferences is
> > set to "50 MB".
> 
> This preference isn't hooked up to anything, and WebKit does not provide any
> way to control the size of the cache. :/

I think that would be:
NetworkProcess::platformSetCacheModel()
in Source/WebKit2/NetworkProcess/soup/NetworkProcessSoup.cpp

It only uses the free disk space and the memory size to calculate the max cache size.
Comment 8 Carlos Garcia Campos 2014-08-31 11:08:40 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > (In reply to comment #5)
> > > This is especially a problem when the "disk cache" option in the Preferences is
> > > set to "50 MB".
> > 
> > This preference isn't hooked up to anything, and WebKit does not provide any
> > way to control the size of the cache. :/
> 
> I think that would be:
> NetworkProcess::platformSetCacheModel()
> in Source/WebKit2/NetworkProcess/soup/NetworkProcessSoup.cpp
> 
> It only uses the free disk space and the memory size to calculate the max cache
> size.

Yes, it's calculated automatically depending on the cache model. But we can provide API to override that.
Comment 9 Michael Catanzaro 2014-08-31 23:41:20 UTC
Well either I am seriously misreading calculateCacheSizes (in Source/WebKit2/Shared/CacheModel.cpp), or else 12 GB is way over the max limit regardless (175 MiB).  So we have two bugs:

1) Setting exists in preferences, but isn't hooked up
2) Cache size wildly exceeds the computed limit

I think we do not need or want a preference for cache size in Epiphany. If the cache size was really limited to 175 MiB, I doubt anyone would care to change it.
Comment 10 Carlos Garcia Campos 2014-09-03 07:10:37 UTC
Could it be that, after a crash, or if web/network process exits without calling soup_cache_dump() for whatever reason, we end up with files in the cache dir not referenced from the soup cache file?
Comment 11 Sergio Villar 2014-09-03 07:50:49 UTC
(In reply to comment #10)
> Could it be that, after a crash, or if web/network process exits without
> calling soup_cache_dump() for whatever reason, we end up with files in the
> cache dir not referenced from the soup cache file?

Carlos is right. Due to the way the cache works, it's mandatory to call soup_cache_dump() in order to dump the cache index to disk. The advantage is that we reduce disk writes but a big disadvantage which is that we loose the info of the cache (and "leak" the stored resources) if not called.

A "quick" fix that comes to my mind would be to mark the index of the cache as dirty just after loading it. That dirty flag would be unset by soup_cache_dump(). Should we find the dirty bit when reading the index, means that we must clear the contents of the cache and the index. WDYT?

The "good" fix would be to have the index stored in some writable mmap'ed portion of memory so the OS would take care of the updates to disk.
Comment 12 Carlos Garcia Campos 2014-09-03 10:02:58 UTC
Wrote a simple patch for libsoup to get the leaked entries in my cache dir and got this:

14670 entries (958966651 bytes total) leaked in /home/cgarcia/.cache/epiphany

$ du -b /home/cgarcia/.cache/epiphany
1059909993	/home/cgarcia/.cache/epiphany

So the real cache size here is 100943342 bytes (~96MB), and there's almost 1G leaked.
Comment 13 Michael Catanzaro 2014-09-03 12:52:52 UTC
(In reply to comment #11)
> A "quick" fix that comes to my mind would be to mark the index of the cache as
> dirty just after loading it. That dirty flag would be unset by
> soup_cache_dump(). Should we find the dirty bit when reading the index, means
> that we must clear the contents of the cache and the index. WDYT?
> 
> The "good" fix would be to have the index stored in some writable mmap'ed
> portion of memory so the OS would take care of the updates to disk.

I think any user of SoupCache would need to do one or the other, or else never ever crash. Maybe libsoup should automatically take care of cleaning a dirty cache, or have an option for doing so? Dan?
Comment 14 Michael Catanzaro 2014-09-11 21:32:20 UTC
Well as far as I can tell, this will have to be fixed in libsoup....
Comment 15 Carlos Garcia Campos 2014-09-29 16:01:07 UTC
Created attachment 287367 [details] [review]
cache: Clean up leaked resources when loading the cache
Comment 16 Sergio Villar 2014-09-30 09:46:05 UTC
(In reply to comment #15)
> Created an attachment (id=287367) [details] [review]
> cache: Clean up leaked resources when loading the cache

The idea is good but the problem I see with the patch is that it's adding a g_unlink for every single _wrap_input() call, something that is likely adding a context switch plus an I/O operation which will likely degrade performance. We should try to minimize the amount of I/O as much as possible.

BTW awesome test case, the cache really needs more testing!
Comment 17 Carlos Garcia Campos 2014-09-30 13:14:38 UTC
Yes, actually I used the inverted logic, using dumped file, instead of a locked file for several reasons:

  a) To make sure the first time load is called again all existing leaks will be removed for everybody. 
  b) In case soup_load is not called for whatever reason
  c) To handle also the case when the cache is loaded, nothing is updated and not dumped. There aren't leaks in this case even if dump hasn't been called.

a) could be solved by bumping the index version even if the index format hasn't changed, to force a clean cache. b) is an api misuse according to sergio and c) is probably a situation that hardly ever happens. So, what do you think about creating a lock file in load, deleted by dump and bump the index version to force a clean cache?

Regarding b), if it's a programmer error, shouldn't we ensure it somehow? making other api methods fail if load hasn't be called or something like that? Note that non of the existing tests use soup_cache_load at all, that's why I assumed it was not mandatory.
Comment 18 Carlos Garcia Campos 2014-10-02 11:25:58 UTC
I've noticed another problem of using the lock approach. There isn't a 1:1 relationship between load and dump, because dump could be called multiple times. So, for example the case I'm testing in the unit test wouldn't work:

1.- load()
<some resources are cached>
2.- dump()
<more resources are cached>
<crash>

The first dump removed the lock, so the resources cached after the dump are crashed. Since the only concern of the current approach is the unlink call, I'll use a variable to ensure that unlink is only called for the first cached resource after dump
Comment 19 Carlos Garcia Campos 2014-10-02 11:26:53 UTC
Created attachment 287575 [details] [review]
Updated patch

Make sure we don't call unlink for every cached resource, but only for the first one cached after dump().
Comment 20 Dan Winship 2014-11-02 14:46:51 UTC
Comment on attachment 287575 [details] [review]
Updated patch

Looks good to me.

>+	if (!priv->dumped_file_exists) {
>+		fd = g_open (priv->dumped_filename, O_CREAT | O_EXCL, 0600);
>+		if (fd != -1) {
>+			priv->dumped_file_exists = TRUE;
>+			close (fd);
>+		}
>+	}

another possibility would be

    if (!priv->dumped_file_exists)
        priv->dumped_file_exists = g_file_set_contents (priv->dumped_filename, "", 0, NULL);

(which would do more work internally, but is simpler here...)

>+		else if (leaked_entries) {
>+			char *key_str = g_strdup_printf ("%u", entry->key);
>+
>+			/* Not a leaked entry */
>+			g_hash_table_remove (leaked_entries, key_str);

Hm... it would be easier to use the guint32 as the hash table key directly so you didn't have to create a temporary string here...
Comment 21 Carlos Garcia Campos 2014-12-05 08:23:04 UTC
Thinking again about this, I wonder if it wouldn't be easier to simply check for leaks on every cache load unconditionally. Note that it's a read dir, we don't actually need to read every file, only the directory to get the filenames, and it only happens once when the cache is loaded. It shouldn't be slow, and creating and checking the dumped file also adds IO and makes the code a lot more complex and racy too.
Comment 22 Sergio Villar 2014-12-05 08:58:09 UTC
(In reply to comment #21)
> Thinking again about this, I wonder if it wouldn't be easier to simply check
> for leaks on every cache load unconditionally. Note that it's a read dir, we
> don't actually need to read every file, only the directory to get the
> filenames, and it only happens once when the cache is loaded. It shouldn't be
> slow, and creating and checking the dumped file also adds IO and makes the code
> a lot more complex and racy too.

I think it's worth trying.
Comment 23 Carlos Garcia Campos 2014-12-05 09:47:14 UTC
Created attachment 292173 [details] [review]
Updated patch

Removed all the dumped file mess, and simplified everything. Also added a helper soup_cache_file_foreach function to use it also by clear_cache_files().
Comment 24 Dan Winship 2014-12-06 15:57:55 UTC
Comment on attachment 292173 [details] [review]
Updated patch

Sure, makes sense.
Comment 25 Carlos Garcia Campos 2014-12-07 13:48:30 UTC
Comment on attachment 292173 [details] [review]
Updated patch

Pushed