Bug 307612 – Beagle uses large amounts of memory when indexing Blam

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 307612 - Beagle uses large amounts of memory when indexing Blam


Summary:	Beagle uses large amounts of memory when indexing Blam


Status:	RESOLVED FIXED

Product:	beagle
Classification:	Other
Component:	General
Version:	unspecified
Hardware:	Other All

Importance:	Normal critical
Target Milestone:	---
Assigned To:	Beagle Bugs
QA Contact:	Beagle Bugs

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2005-06-14 11:25 UTC by Nico Kaiser
Modified:	2005-11-04 18:22 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
IndexHelper Logfile (213.39 KB, text/plain) 2005-06-15 09:49 UTC, Nico Kaiser		Details
Beagle Logfile (245.66 KB, text/plain) 2005-06-15 09:51 UTC, Nico Kaiser		Details
+ beagled/AkregatorQueryable/AkregatorQueryable.cs (4.74 KB, patch) 2005-09-14 01:21 UTC, Debajyoti Bera	none	Details \| Review
+ beagled/AkregatorQueryable/FeedIndexableGenerator.cs (6.71 KB, patch) 2005-09-14 01:22 UTC, Debajyoti Bera	none	Details \| Review
Modified LifereaQuerayble.cs (8.89 KB, patch) 2005-09-16 01:19 UTC, Debajyoti Bera	none	Details \| Review
blam 1.8 feeds (~251) (417.90 KB, text/xml) 2005-09-30 17:01 UTC, sham		Details
patch for BlamQueryable.cs to use stream parser (9.83 KB, patch) 2005-10-01 19:07 UTC, Debajyoti Bera	none	Details \| Review

Description Nico Kaiser 2005-06-14 11:25:34 UTC

Please describe the problem:
When indexing Blam blog entries beagled slowly takes all memory it can get.
Running for 5 minutes takes about 500 MB memory. It does not get killed or
restarted as its memory usage monitor does not work apparently.

Steps to reproduce:
1. Subscribe to some blogs in Blam
2. Start beagled
3. wait some minutes


Actual results:
mono-beagled takes > 500 MB memory, rising.

Expected results:
I'd expect Beagle not to consume that much memory, or at lest shut down itself
and restart when using more than, say 50 MB.

Does this happen every time?
Yes.

Other information:

Comment 1 Joe Shaw 2005-06-14 17:12:43 UTC

Can you run beagled with "--debug --allow-backend blam" and attach the output of
the logs?  If you can narrow down what file/blog entry is causing it, it'll be
much easier to fix.

Comment 2 Nico Kaiser 2005-06-15 09:49:28 UTC

Created attachment 47791 [details]
IndexHelper Logfile

Comment 3 Nico Kaiser 2005-06-15 09:51:03 UTC

Created attachment 47792 [details]
Beagle Logfile

Memory usage rises mainly during the "INFO: Scanning Weblogs / INFO: Found 714
items in 40 weblogs in ,10s" loop, e.g.  11.45.29.95 (these 15 seconds took
about 200 MB memory)

Comment 4 Joe Shaw 2005-06-15 18:05:24 UTC

inotify support, or no?

Comment 5 Joe Shaw 2005-06-15 18:14:00 UTC

also, when you are running beagle, are you running/using Blam?  All those
rescans imply that the ~/.gnome2/blam/collection.xml file is being constantly
changed.

Comment 6 Fredrik Hedberg 2005-06-16 03:17:45 UTC

Ok, this is the situation, the Blam backend is a total bunch of crap. 

Since Blam uses a singe file for all feeds, thus triggering events every time
any feed is updated, we reindex all of the feeds every time anything happens.

We should probably either keep track of when we've indexed certain feeds, or
move to a IndexingService based solution inside Blam.

Comment 7 Joe Shaw 2005-06-16 16:26:54 UTC

It would also be very good to move to an IndexableGenerator-based backend.

Comment 8 Joe Shaw 2005-06-16 17:02:03 UTC

I just checked in code which switches it to IndexableGenerator, which will keep
it from beating on the scheduler quite so much, but I'm not sure it'll lower the
memory usage.  We probably need to be smarter about what we index.

Nico, how many channels and items do you have in your
~/.gnome2/blam/collection.xml file?  Can you try the code in CVS to see if it
helps things at all?

Comment 9 Joe Shaw 2005-06-16 19:14:44 UTC

Beagle's method of accessing the collection.xml file isn't great; it could
create a huge amount of memory usage if your collection.xml file is huge, but
it's basically exactly the same as Blam itself.  Does Blam use up a tremendous
amount of memory as well?

Comment 10 Martin Probst 2005-06-23 09:26:18 UTC

I can confirm this bug, I have the same problems. My collection.xml is 1.3M and
this is blam's memory usage:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
10203 martin    16   0  117m  40m  22m S  0.0  4.0   0:11.36 mono

Which is rather sane ... I got about 30 something feeds within blam.

I also experience that beagles memory consumption does not get that high
immediately (probably not before indexing the feeds?), but once it is up at 500M
it stays there, making the system more or less unusable.

Comment 11 Martin Probst 2005-06-23 09:32:34 UTC

This is beagles memory usage on indexing.

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11688 martin    15   0  563m 486m  10m S  0.0 48.0   0:42.58 mono

I noticed that beagle-info --status shows roughly about 1000 single items, one
for each entry in each feed. Is that intended?

Comment 12 Joe Shaw 2005-06-23 19:53:41 UTC

The 1000 single items should be fixed in CVS as of about a week ago.  But it is
not in the 0.0.11.1 release.  It would be good if you can compare the memory
usage between the release and CVS.

Comment 13 Debajyoti Bera 2005-09-14 01:20:54 UTC

What is the current status of the problem ? Did anybody try the comparison Joe     
suggested (I have small xml files so I am not getting useful results).    
    
On a different note, I wrote a stream based reader/parser for Akregator (which    
has similar xml file format as Blam). Technically with the new reader, beagled    
should take less memory (which doesnt show up on my system since I have so few    
number of feeds).   
   
If anybody can post a blam collection.xml (please post a small one) file here, 
I can port BlamQueryable to use the new stream based parser. If nothing, this 
will fix two "FIXME" in blam and liferea backend :-).  
  
Finally, attaching the new AkregatorQ files (bulk of the work happens in the  
indexable generator).

Comment 14 Debajyoti Bera 2005-09-14 01:21:58 UTC

Created attachment 52198 [details] [review]
+ beagled/AkregatorQueryable/AkregatorQueryable.cs

Comment 15 Debajyoti Bera 2005-09-14 01:22:16 UTC

Created attachment 52199 [details] [review]
+ beagled/AkregatorQueryable/FeedIndexableGenerator.cs

Comment 16 Debajyoti Bera 2005-09-16 01:19:37 UTC

Created attachment 52295 [details] [review]
Modified LifereaQuerayble.cs

Found some liferea feeds on my computer - so here is the modified
lifereaqueryable with stream parsing.
Technically this one shouldnt create tons of objects while indexing.

Comment 17 sham 2005-09-30 17:01:01 UTC

Created attachment 52867 [details]
blam 1.8 feeds (~251)

its a blam 1.8 xml file (from .gnome2/blam)
it has around 251 items in it.
(dBera requested).

Comment 18 sham 2005-09-30 20:25:18 UTC

edit: beagle is now reporting it as having 503 items

Comment 19 Debajyoti Bera 2005-10-01 04:52:27 UTC

sham,
the attached file has 254 items in it (check using grep). if beagle is reporting
anything more or less, then its time for another bug!

Comment 20 Debajyoti Bera 2005-10-01 04:54:24 UTC

sham,
the attached file has exactly 254 items in it (check using grep "<Item"
/path/to/collectiom.xml). if beagled is reporting anything more or less, its
time to file another bug :)

Comment 21 Debajyoti Bera 2005-10-01 19:07:46 UTC

Created attachment 52919 [details] [review]
patch for BlamQueryable.cs to use stream parser

Using stream parsing with the collection.xml from sham didnt give me any
benefit (~400k reduction on memory usage).

Comment 22 Debajyoti Bera 2005-10-21 01:32:28 UTC

Stream parsing patch in CVS. 
 
Though stream parsing is being used, beagled nevertheless parses the whole 
file whenever something changes. The behaviour can be somewhat improved by 
using a cache file (something similar to evo-mail backend) to keep track of 
what items have not changed since last parsing.

Comment 23 Joe Shaw 2005-11-04 18:22:54 UTC

I'm going to close this as FIXED.