After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 355549 - Thunderbird indexer leaks memory
Thunderbird indexer leaks memory
Status: RESOLVED OBSOLETE
Product: beagle
Classification: Other
Component: General
0.2.9
Other Linux
: High critical
: ---
Assigned To: Beagle Bugs
Beagle Bugs
Depends on:
Blocks:
 
 
Reported: 2006-09-12 07:59 UTC by Alexander Larsson
Modified: 2007-08-11 08:18 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Backtraces of memory usage (46.50 KB, text/plain)
2006-09-16 04:50 UTC, Kevin Kubasik
  Details
Start Mork.cs cleanup (2.65 KB, patch)
2006-11-13 18:13 UTC, Kevin Kubasik
none Details | Review
An (unfinished) implementation of the Mork parser in C++ (7.80 KB, application/x-compressed-tar)
2006-11-17 11:18 UTC, Florian Hackenberger
  Details
Fancy optimization (11.52 KB, patch)
2006-12-01 06:59 UTC, Debajyoti Bera
none Details | Review
Correct patch (against HEAD) (12.66 KB, patch)
2006-12-01 07:06 UTC, Debajyoti Bera
none Details | Review

Description Alexander Larsson 2006-09-12 07:59:40 UTC
This is a bug report from the RedHat bugzilla at:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=205985

Bug report contents:
--------------------
Hi,

Beagled is running on startup. FOr reasons I don't understand, beagle grows in
memory enormously while indexing the large Thunderbird Imap cache files. I
usually kill it when it reaches 750m, and my 1gb machine starts trashing
                
[hanwen@haring ~]$ beagle-status

Scheduler:
Count: 14
Status: Waiting for next task.

Pending Tasks:
1 Delayed 0 (9/11/2006 10:30:25 AM)
/home/hanwen/.thunderbird/nf78k6gi.default/ImapMail/maagd/shop.msf

2 Delayed 0 (9/11/2006 10:30:25 AM)
/home/hanwen/.thunderbird/nf78k6gi.default/ImapMail/maagd/Lists.msf

3 Delayed 0 (9/11/2006 10:30:25 AM)
/home/hanwen/.thunderbird/nf78k6gi.default/ImapMail/maagd/sent.sbd/VU-Orkest.msf

4 Delayed 0 (9/11/2006 10:30:25 AM)
/home/hanwen/.thunderbird/nf78k6gi.default/ImapMail/maagd/INBOX.sbd/paypal.msf

5 Delayed 0 (9/11/2006 10:30:26 AM)
/home/hanwen/.thunderbird/nf78k6gi.default/ImapMail/maagd/INBOX.sbd/logfiles.msf

[hanwen@haring ~]$ ls -l
/home/hanwen/.thunderbird/nf78k6gi.default/ImapMail/maagd/INBOX.sbd/logfiles.msf 
-rw-rw-r-- 1 hanwen hanwen 180149 Sep 11 10:34
/home/hanwen/.thunderbird/nf78k6gi.default/ImapMail/maagd/INBOX.sbd/logfiles.msf


[hanwen@haring ~]$ rpm -q beagle
beagle-0.2.9-1.fc6
Comment 1 Kevin Kubasik 2006-09-12 20:18:49 UTC
Any possibility someone with a large IMAP e-mail account could run the indexing with heap-buddy and post the .out files for us? The thunderbird code is very large and very complex, without some good profiling/logs its going to be very difficult to track down....
Comment 2 Alexander Larsson 2006-09-13 10:40:21 UTC
I just stared thunderbird and connected to a medium size imap folder. Beagled seems to take about 750 megs or so if i let it index stuff for a while.

I only let it go to about 500 megs when running heap-buddy, and here is the output. It seems to use a lot of memory for the mork databases, mostly strings and regexp matches. Is it really supposed to store all this in the memory of the beagle daemon?

Here is the heap-buddy output:
http://www.gnome.org/~alexl/outfile.beagled.with_thunderbird.gz

Also, looking quickly at the code i saw this issue:
diff -u -p -r1.1 Mork.cs
--- Mork.cs     4 Aug 2006 03:22:35 -0000       1.1
+++ Mork.cs     13 Sep 2006 10:36:47 -0000
@@ -95,7 +95,7 @@ namespace Beagle.Util
 
                        while (++position != content.Length) {
 
-                               if (content [position].Equals ('/') && content [position].Equals ('/'))
+                               if (content [position].Equals ('/') && content [position+1].Equals ('/'))
                                        // Ignore comments
                                        position = content.IndexOf ('\n', position);
                                else if (content [position].Equals ('<') && content [position+2].Equals ('<'))
Comment 3 Alexander Larsson 2006-09-13 10:59:18 UTC
This just doesn't look ready. I'm gonna disable it for now in Fedora Core.
Comment 4 Joe Shaw 2006-09-13 17:08:29 UTC
Yeah, there are a handful of scary statistics in there.

From the summary:

           SUMMARY

         Filename: outfile.beagled.with_thunderbird
  Allocated Bytes: 1404.8M
Allocated Objects: 23663280
              GCs: 54
          Resizes: 28
  Final heap size: 269.4M

   Distinct Types: 510
       Backtraces: 35080

A final heap size of 270 megs is crazy.  It should be roughly 1% of that.

There is an abundance of objects created, causing the heap to grow incredibly:

09:40:06 | GC 29  | Collected 3 of 67707 objects (0.0%)
         |        | Collected 3.5M of 11.8M (29.6%)
         |        | Heap went from 44.6% to 31.4% capacity
         |        |
09:40:06 | Resize | Grew heap from 26.5M to 35.4M
         |        | 8.3M in live objects
         |        | Heap went from 31.4% to 23.5% capacity
         |        |
09:40:06 | GC 30  | Collected 1 of 67705 objects (0.0%)
         |        | Collected 4.0M of 16.3M (24.5%)
         |        | Heap went from 46.1% to 34.8% capacity
         |        |
09:40:06 | Resize | Grew heap from 35.4M to 52.4M
         |        | 12.3M in live objects
         |        | Heap went from 34.8% to 23.5% capacity
         |        |
09:40:07 | GC 31  | Collected 98 of 67801 objects (0.1%)
         |        | Collected 8.0M of 28.3M (28.2%)
         |        | Heap went from 54.0% to 38.8% capacity
         |        |
09:40:07 | Resize | Grew heap from 52.4M to 85.6M
         |        | 20.3M in live objects
         |        | Heap went from 38.8% to 23.8% capacity
         |        |
09:40:07 | GC 32  | Collected 1 of 67704 objects (0.0%)
         |        | Collected 16.0M of 52.3M (30.6%)
         |        | Heap went from 61.2% to 42.5% capacity
         |        |
09:40:07 | Resize | Grew heap from 85.6M to 150.3M
         |        | 36.3M in live objects
         |        | Heap went from 42.5% to 24.2% capacity

Collecting 1 16 meg object out of 67704 is not good. :(
Comment 5 Kevin Kubasik 2006-09-16 04:07:48 UTC
Ok, I've been looking at this and we do seem to throttle somewhat correctly, I'm just missing a reference somewhere that is keeping all these objects alive. I feel that gc should be collecting the regular expressions/enumerators after they are turned into indexables. 

I dunno, maybe I'm just not experienced enough with the mono GC etc to know, but we never seem to load the whole thing without letting the objects go for a gc....
Comment 6 Kevin Kubasik 2006-09-16 04:45:13 UTC
Ok, heres the killer:

public void Read ()
		{
			string content;
			StreamReader reader = new StreamReader (mork_file);;
			
			// Check if this is a mork file and save database version if it is. We assume the first line will tell us this.
			if (!IsValid (reader.ReadLine (), out mork_version)) {
				reader.Close ();
				throw new InvalidMorkDatabaseException ("This file is missing a valid mork header");
			}
			
			content  = reader.ReadToEnd ();
			reader.Close ();
			
			Reset ();
			Read (content);
		}

We do read the entire thing in, this is about 200 megs in just string allocations (just to hold the mork db) 

So heres were to start for all you who need some fun...
Comment 7 Kevin Kubasik 2006-09-16 04:50:42 UTC
Created attachment 72889 [details]
Backtraces of memory usage

Here's just some backtraces, we can see exactly where in the code it gets so crazy. I'm not going to lie and pretend that I have anywhere near the free time to fix this, its going to be a somewhat serious change to how the mork parsing works, and school is too much for me to consider this at the moment.

Sorry, but I wanted to do my best to help diagnose this...
Comment 8 Kevin Kubasik 2006-09-16 04:56:57 UTC
Confirming  and setting urgent, this is a definite must fix before the next release.
Comment 9 Alexander Larsson 2006-09-18 07:21:24 UTC
I have a .Net question about this:

protected string Read (string content, ref int position, string start, string end)
{
	int tmp = position, start_position = position;
	
	do {
		position = content.IndexOf (end, position+1);
		if ((tmp = content.IndexOf (start, tmp+1)) < 0)
			break;
	} while (tmp < position);
	
	return content.Substring (start_position, position-start_position+1);
}

Does the content.Substring() call copy that part of the full content string, or does it keep a reference to the string its a substring of? (I assume strings are immutable.)

If it keeps the parent string alive, we only require a single leak of the read string or a further substring of it for the main app to keep 200 meg allocated.
Comment 10 Joe Shaw 2006-09-19 15:19:43 UTC
Strings are immutable, and you have to assume that any operation on them does a copy, although the runtime probably optimizes operations like Substring(), I'm not sure.
Comment 11 Pierre Östlund 2006-09-22 14:58:53 UTC
This is obviously a very big issue and I need some help to create a much better implementation. The current implementation works very well for me, but it has lots of flaws/bugs and a very bad "base design". A full re-write would be preferred. The lack of experience and time just keeps me pushing this "project" further and further in time and if someone could help me, I would gladly help out by answering questions, writing code or whatever has to be done - just to create a working implementation. If someone is up for a real challenge, just let me know and we'll probably work something out.
Comment 12 Kevin Kubasik 2006-11-13 18:13:20 UTC
Created attachment 76503 [details] [review]
Start Mork.cs cleanup

This patch drops about 7-9% of the total string overload off (for me at least) try it out and let me know where we stand.
Comment 13 Florian Hackenberger 2006-11-17 11:18:46 UTC
Created attachment 76750 [details]
An (unfinished) implementation of the Mork parser in C++

This implementation is unfinished, but I'm posting it here, because I currently don't have the time to work on it (busy with my master thesis). This work was the result of a discussion on dashboard-hackers [1].

[1] http://www.mail-archive.com/dashboard-hackers%40gnome.org/msg02491.html
Comment 14 Kevin Kubasik 2006-11-19 07:38:27 UTC
How incomplete is it? Slash are you looking for someone to pick this up, or will you be finishing it yourself?
Comment 15 Debajyoti Bera 2006-12-01 06:59:52 UTC
Created attachment 77468 [details] [review]
Fancy optimization

Attaching a patch which optimizes the Mork parser quite a bit ( on my teeny-weeny test mork file, about ~ 50% reduction in heap size and allocations).

Except, I dont know if the end code is correct. The fact is, I dont use thunderbird nor do I understand and like the mork file format. What I did was to heavily optimize the functions in original Mork.cs keeping their goals intact. So, the functions do the same as before - recursively building a huge mork tree from a huuuuge string - except, with this patch, they do it a bit slowly but smartly.
Junta, please check the patch to cross verify that all the functions do the same  thing as before.
People who are familiar with Mork-fu and regex-es, I request your attention.
Comment 16 Debajyoti Bera 2006-12-01 07:06:09 UTC
Created attachment 77470 [details] [review]
Correct patch (against HEAD)

Argh... the last patch wasnt against CVS head. This one is.
Comment 17 Kevin Kubasik 2006-12-05 05:50:38 UTC
Alright, some testing here is showing promising results, We need to catch errors on things like empty folderes more gracefully, but it seems to be working quite well overall.
Comment 18 Joe Shaw 2006-12-06 22:58:44 UTC
(In reply to comment #17)
> Alright, some testing here is showing promising results, We need to catch
> errors on things like empty folderes more gracefully, but it seems to be
> working quite well overall.

Is this something we can get into CVS for 0.2.14?
Comment 19 Debajyoti Bera 2006-12-06 23:06:14 UTC
> Alright, some testing here is showing promising results,

Kevin, were you referring to my patch ?

I dont think this patch should be committed. I dont know if it does the correct thing. Unless someone manually compares the two version, I am not ready to believe that this even works :) Peter told me he would be able to review the patch and work on the Mork parser in about 3 weeks. So, till then ... au revoir to the patch.

Comment 20 Kevin Kubasik 2006-12-17 23:07:02 UTC
Ok, 2 things

1) To Dbera's comment 19: My bad, I just applied it and let beagle run, it was working pretty well. Threw an exception on empy mork files, so that should be cleaned up, but it seemed to be functioning pretty well, and my mails were being indexed. I'll do a closer comparison of actual code to see if the methods still do/return what they should sometime this holiday break.

2) I saw Joe's fix for the evolution backend that stores summary information in a sqlite database, as opposed to  a hash table. Maybe thats the best way to go here? Mork is a relational db implementation etc, I don't know enough about either sqlite or mork to know if this is a horrible idea or not, but assuming we can make a rational sqlite representation of the mork information, we could step through the mork file and just check each entry against a sqlite table, theoretically we could durastically reduce mem usage, although im sure this would be significantly slower than the current imlementation, but I'm guessing that isn't a huge issue here.

I dunno about the second part, maybe I'm completely off-base, but it seems doable to me.
Comment 21 Joe Shaw 2006-12-18 18:45:14 UTC
(In reply to comment #20)
> 2) I saw Joe's fix for the evolution backend that stores summary information in
> a sqlite database, as opposed to  a hash table. 

If the Thunderbird backend is doing a similar hack, then this is definitely a hotspot that can be fixed.

With Evolution, the problem is that the summary files aren't seekable, so we have to walk the whole file looking for changes.  To notice when flags change, or even if a message is removed, we track that state.  Previously it was in a hash table, which was bad because it kept references around to boxed uints and strings in memory, one each per message.  On my system that is millions of allocations.  Another option was just to get that information from the Lucene index -- which I thought was how the Thunderbird backend did it -- but I thought that was a little bit of overkill for this task.

What I really recommend you do is use heap-shot for this.  That is how I've found all the leaks and hotspots over the past three weeks.  It's an incredibly useful tool.  You can get it from Mono SVN.  It requires a fairly new Mono; I know that it doesn't work on 1.1.18, but works great on 1.2.2.
Comment 22 Kevin Kubasik 2006-12-18 18:54:44 UTC
Ok,  well, we determine changes by going through the mork db and querying lucene, but the issue is our mork representation is a GIANT hashtable with multiple strings in each etc. I have used heapshot to see where the memory usage is, its all from the Mork.cs file. Its how we create a working implementation of our mork file, I was thinking we could make a sqlite representation of the mork db and then use that for all mork-like queries.

Basically, we would still see memory spikes upon reading in the mork db, but once it was read, it could be dumped and we could just work from the sqlite db. Its not ideal, but its hard to come up with a solid answer. 

On that note, I have gotten in touch with a thunderbird developer who is working on a sane means for search engines to access tbird e-mail and what not, since pretty much everyone else has the same problems. If were lucky, were looking at something like a socket connection to tbird where we can request information etc. Problem is obviously, it only works when tbird is running. 

dBera's solution is looking pretty good if we can stable it up a bit and test it will do wonders.
Comment 23 Joe Shaw 2006-12-18 19:05:59 UTC
Is the mork file something which can be processed incrementally?  The way we handle that aspect on the Evolution side is to read and parse the evolution summary files message by message.  (There is a header to the summary that's decoded and stored for the life of the object initially.)  Take a look at Util/camel.cs to see how that's done.

I don't think the memory spike is acceptable in any case, because Mono doesn't have a compacting garbage collector.  Once those huge allocations are made and the heap is resized, there's no way to shrink it again.  We'll continue to use all that heap space, only triggering GCs when it gets filled up.  Eventually some of that space might be paged out, but the right way to do it is to not allocate it all up front in the first place.
Comment 24 Debajyoti Bera 2006-12-18 20:40:24 UTC
We are talking of two different optimizations here - right ? One is parsing the Mork file and other is storing the information in the Mork file in memory for future processing.

From what little I figured out from the Mork parser, I dont think it is possible to incrementally parse a mork file. Its kind of a huge table with cross references. I might be wrong though. I started having a headache after I looked at its grammer.

The parsing is tricky. I believe my optimization reduces the number of allocations to a huge extent. Peter personally told me that he will start working on a streaming parser (sort of state machine) in near future, so we can wait till that. About my patch above, I know of at least one place where it differs from the original regex heavy one - my patch ignores the presence of whitespaces in rows. It could be possible that t-bird does not use whitespaces in rows, so my patch is working. But until that is verified, there could some mork file which would go havoc with my patched parser. I am speaking about the Clean(...) method and its usage in ParseRows(...). Other methods I think are invariant in terms of their working, but additional eyes should verify it.

Or, we could just wait for Peter to start working on an improved parser :)
Comment 25 Pierre Östlund 2007-08-11 08:18:56 UTC
This should issue should be fixed in trunk as a result of my Summer of Code project. Marking this bug as obsolete since this code is no longer used.