Bug 424423 - [jamendo] use less memory
[jamendo] use less memory
Status: RESOLVED FIXED
Product: rhythmbox
Classification: Other
Component: Plugins (other)
unspecified
Other Linux
: Normal normal
: ---
Assigned To: RhythmBox Maintainers
RhythmBox Maintainers
:
Depends on:
Blocks:
  Show dependency tree
 
Reported: 2007-03-30 10:04 UTC by Vincent Untz
Modified: 2009-03-20 11:32 UTC (History)
5 users (show)

See Also:
GNOME target: ---
GNOME version: ---


Attachments
some more interesting changes (6.40 KB, patch)
2007-06-09 12:04 UTC, Jonathan Matthew
committed Details | Diff | Review
reduce magnatune memory usage too (4.46 KB, patch)
2007-06-12 10:44 UTC, Jonathan Matthew
none Details | Diff | Review
less effective, less nasty patch (1.36 KB, patch)
2007-06-12 11:45 UTC, Jonathan Matthew
committed Details | Diff | Review
Reworked parsing to save tracks into DB immediately - first draft (13.42 KB, patch)
2009-03-14 18:57 UTC, Kim Sullivan
needs-work Details | Diff | Review
Save tracks into DB during parsing - second iteration (10.99 KB, patch)
2009-03-15 12:37 UTC, Kim Sullivan
committed Details | Diff | Review

Description Vincent Untz 2007-03-30 10:04:42 UTC
Running rhythmbox 0.9.8.

Rb was using around 50MB of memory before I switched to the jamendo source. And it's now using ~250MB. Which is quite a lot :-)

I tried to disable the jamendo plugin and I was surprised to see the memory usage didn't change: it was still using ~250MB.
Comment 1 Luis Menina 2007-06-06 13:07:28 UTC
Confirming this one. The plugin is really in need of some optimization. It consumes almost all the memory of my 256MB laptop at work.
Comment 2 Jonathan Matthew 2007-06-09 08:49:54 UTC
Is there a useful python memory usage profiler?  Using massif is no fun at all.

Anyway, here, with an empty library, I see a peak somewhere above 120MB when the catalog loading finishes, which then drops off to somewhere around 70MB.  I'm guessing that the freed memory is not released back to the OS, so the resident size of the process doesn't go down.
Comment 3 Jonathan Matthew 2007-06-09 09:05:49 UTC
I've just committed a change to reduce the peak memory usage by ~20MB by ignoring the 'lyrics' and 'description' tags.  There may be other tags we can usefully ignore.
Comment 4 Jonathan Matthew 2007-06-09 12:04:28 UTC
Created attachment 89640 [details] [review]
some more interesting changes

- ignore the unused 'licenseURL' and 'link' attributes
- store track album IDs in the musicbrainz album ID field, rather than in a python dict (tiny bit of a hack..)
- instead of storing a reference to all the P2PLink data for each track, only store the bittorrent URLs for each album ID
- emit cover art URI in an idle handler (helps it actually work)

With all these changes, again with an empty library, the peak is about 70MB, and the ongoing memory usage is around 30MB.
Comment 5 Jonathan Matthew 2007-06-12 10:44:04 UTC
Created attachment 89798 [details] [review]
reduce magnatune memory usage too

This abuses the musicbrainz fields in RhythmDBEntry to store the data the magnatune plugin stores in four python dicts.  Saves about 12MB.  I think most of this is due to refstrings, as much of the information is duplicated across all the tracks in each album.
Comment 6 Jonathan Matthew 2007-06-12 11:45:19 UTC
Created attachment 89803 [details] [review]
less effective, less nasty patch

This just interns the strings, still storing them in python dicts.  Saves about 10MB instead of 12, but isn't quite so ugly.
Comment 7 James "Doc" Livingston 2007-06-16 08:20:45 UTC
Interning the strings sounds like a good idea. It would be great if we exposed RBRefString to python, but if we're interning in python it wouldn't save a huge amount more space.
Comment 8 Jonathan Matthew 2007-06-17 22:36:21 UTC
Some further experiments showed that using per-album (rather than per-track) dicts for the cover art, artist info, and purchase URLs (keyed on the SKU field) works a bit better, so I've committed a patch that does that.  We can squeeze a bit more out of it by using the musicbrainz-albumid RhythmDBEntry field to store the SKU, but I haven't done that, since it's a bit ugly.
Comment 9 Jonathan Matthew 2007-07-10 12:49:02 UTC
I've committed the jamendo patch now.  We can probably still reduce the memory usage of the jamendo and magnatune plugins further, so I'll leave the bug open.
Comment 10 Kim Sullivan 2009-01-25 20:17:58 UTC
I placed some memory status calls (http://code.activestate.com/recipes/286222/)into the jamendo loading code (before parsing, during parsing and after loading), and got the following results. Would it be possible to put the track data into the database right away in the sax parser (like magnatune does), instead of first constructing a large in-memory data structure and then processing it? My system usually runs out of physical memory during loading and starts swapping.

(20:54:59) [0x8137408] [JamendoSource.__load_catalogue_open_cb] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:427: Parsing start: 117428224.0
(20:54:59) [0x8137408] [JamendoSource.__load_catalogue_read_cb] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:370: Parsing progress: 118210560.0
(20:54:59) [0x8137408] [JamendoSource.__load_catalogue_read_cb] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:370: Parsing progress: 118689792.0
(20:54:59) [0x8137408] [JamendoSource.__load_catalogue_read_cb] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:370: Parsing progress: 119017472.0
(20:54:59) [0x8137408] [JamendoSource.__load_catalogue_read_cb] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:370: Parsing progress: 119345152.0
...
(20:55:21) [0x8137408] [JamendoSource.__load_catalogue_read_cb] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:370: Parsing progress: 232378368.0
(20:55:21) [0x8137408] [JamendoSource.__load_catalogue_open_cb] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:427: Parsing start: 232378368.0
(20:55:21) [0x8137408] [JamendoSource.__load_catalogue_read_cb] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:370: Parsing progress: 232378368.0


Here, parsing apparently started (or worse, re-started?) for a second time (this roughly corresponds to a moment where the progress bar stops updating).

...
(20:56:12) [0x8137408] [JamendoSource.__load_catalogue_read_cb] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:370: Parsing progress: 517931008.0
(20:56:12) [0x8137408] [JamendoSource.finish_loadscreen] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:337: Parsing done, memory usage: 517931008.0
(20:56:12) [0x8137408] [JamendoSource.__load_db] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:618: Loading start, memory usage: 517931008.0
(20:56:19) [0x8137408] [JamendoSource.__load_db] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:666: Nb artistes : 4999
(20:56:19) [0x8137408] [JamendoSource.__load_db] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:667: Nb albums : 8838
(20:56:19) [0x8137408] [JamendoSource.__load_db] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:668: Nb tracks : 62807
(20:56:19) [0x8137408] [JamendoSource.__load_db] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:671: Done loading, memory: 517931008.0
(20:56:19) [0x8137408] [JamendoSource.__load_catalogue_read_cb] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:370: Parsing progress: 517931008.0
...
(20:56:46) [0x8137408] [JamendoSource.__load_catalogue_read_cb] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:370: Parsing progress: 517931008.0


After it peaks at about 500MB, the memory that's allocated in the parsing structure doesn't seem to be released (even though self.__saxHandler that contains the data is set to None after loading has finished).
 
Another thing that might benefit memory consumption (without having to completely rewrite the parsing and loading code) would be to ignore more elements (currently, the sax parser places a lot of the data from the XML file into memory, like the tags and their weights). I'll try this and report my findings.
Comment 11 Kim Sullivan 2009-01-25 20:30:20 UTC
Adding "Tags","tag","idstr" and "weight" to ignore in JamendoSaxHandler.py didn't help with memory consumption.
Comment 12 Jonathan Matthew 2009-02-19 12:48:54 UTC
(In reply to comment #10)
> I placed some memory status calls
> (http://code.activestate.com/recipes/286222/)into the jamendo loading code
> (before parsing, during parsing and after loading), and got the following
> results.

Is this the virtual memory size or the resident set size?

> Would it be possible to put the track data into the database right
> away in the sax parser (like magnatune does), instead of first constructing a
> large in-memory data structure and then processing it? 

Probably.  Why not try it and see?

> After it peaks at about 500MB, the memory that's allocated in the parsing
> structure doesn't seem to be released (even though self.__saxHandler that
> contains the data is set to None after loading has finished).

As I understand it, python does not release freed memory back to the OS.

> Another thing that might benefit memory consumption (without having to
> completely rewrite the parsing and loading code) would be to ignore more
> elements (currently, the sax parser places a lot of the data from the XML file
> into memory, like the tags and their weights). I'll try this and report my
> findings.

I doubt it.  There isn't all that much useless stuff in the jamendo song dump these days.
 

Comment 13 Kim Sullivan 2009-02-20 22:24:00 UTC
> Is this the virtual memory size or the resident set size?

VmSize (I know there's a difference between VM and actual physical memory used, but since my computer starts swapping madly, it looks like a large part of the memory is actually used, not just allocated). I re-ran the test, this time getting both VmSize and VmRSS, and VmRSS climbs too:
(22:58:12) [0x9f56408] [printMem] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:995: Parsing start, VmSize: 169041920.0 VmRSS: 45338624.0
(22:58:12) [0x9f56408] [printMem] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:995: Parsing progress, VmSize: 178208768.0 VmRSS: 45973504.0
...
(22:58:59) [0x9f56408] [printMem] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:995: Parsing progress, VmSize: 424497152.0 VmRSS: 292139008.0
(22:58:59) [0x9f56408] [printMem] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:995: Parsing progress, VmSize: 424497152.0 VmRSS: 292143104.0
(22:58:59) [0x9f56408] [printMem] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:995: Parsing done, VmSize: 424497152.0 VmRSS: 292147200.0
(22:58:59) [0x9f56408] [printMem] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:995: Loading start, VmSize: 424497152.0 VmRSS: 292147200.0
(22:59:12) [0x9f56408] [JamendoSource.__load_db] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:666: Nb artistes : 9093
(22:59:12) [0x9f56408] [JamendoSource.__load_db] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:667: Nb albums : 16103
(22:59:12) [0x9f56408] [JamendoSource.__load_db] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:668: Nb tracks : 114091
(22:59:12) [0x9f56408] [printMem] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:995: Done loading, VmSize: 480497664.0 VmRSS: 348168192.0

And THEN the whole parsing starts again (this time, it started again after the first run was finished, the last time the second run started somewhere inbetween):

(23:00:18) [0x9f56408] [printMem] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:995: Parsing start, VmSize: 488493056.0 VmRSS: 347533312.0
(23:00:18) [0x9f56408] [printMem] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:995: Parsing progress, VmSize: 488493056.0 VmRSS: 347533312.0
...
(23:01:04) [0x9f56408] [printMem] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:995: Parsing progress, VmSize: 503599104.0 VmRSS: 363368448.0
(23:01:04) [0x9f56408] [printMem] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:995: Parsing progress, VmSize: 503599104.0 VmRSS: 363368448.0
(23:01:04) [0x9f56408] [printMem] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:995: Parsing done, VmSize: 503599104.0 VmRSS: 363368448.0
(23:01:04) [0x9f56408] [printMem] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:995: Loading start, VmSize: 503599104.0 VmRSS: 363368448.0
(23:01:12) [0x9f56408] [JamendoSource.__load_db] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:666: Nb artistes : 9093
(23:01:12) [0x9f56408] [JamendoSource.__load_db] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:667: Nb albums : 16103
(23:01:12) [0x9f56408] [JamendoSource.__load_db] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:668: Nb tracks : 114091
(23:01:12) [0x9f56408] [printMem] /usr/lib/rhythmbox/plugins/jamendo/JamendoSource.py:995: Done loading, VmSize: 503599104.0 VmRSS: 363372544.0


(On one instance, I even got VmSize 720B, VmRSS 550MB after the second parse)
> Probably.  Why not try it and see?

Sorry, didn't have time to try it yet. You're saying that putting the stuff into the database during loading shouldn't be a problem? Also, since I'm going to have to rewrite it anyway, any idea how to fix the double load? At first I thought this was some weird threading output buffering hickup, but now I'm almost certain that the parsing function sometimes actually gets called twice for some reason (I don't know much python, much less PyGTK, I only understdood that the parsing is done via callbacks that should be called once everything is ready).

It might have something to do with how long it takes to download the file (first parsing starts immediately, second parsing seems to start after the file is downloaded, so it depends on how fast your line is).
Comment 14 Jonathan Matthew 2009-02-20 23:40:40 UTC
(In reply to comment #13)
> (On one instance, I even got VmSize 720B, VmRSS 550MB after the second parse)
> > Probably.  Why not try it and see?
> 
> Sorry, didn't have time to try it yet. You're saying that putting the stuff
> into the database during loading shouldn't be a problem? Also, since I'm going
> to have to rewrite it anyway, any idea how to fix the double load?

All that's happening here is that we load the current version of the catalogue, then download the updated version, then load that.  When we know the catalogue is out of date, we should wait until the download finishes to load it.  I'm in the middle of rewriting the downloading code, so I'll fix this myself.
Comment 15 Jonathan Matthew 2009-02-26 03:07:21 UTC
That's fixed now.
Comment 16 Kim Sullivan 2009-03-14 16:39:56 UTC
Finally, I started working on a patch for parsing. The parsing logic is done (memory overhead of parsing is now 10M VmSize and 1M VmRSS), now I want to start adding the stuff to the DB. While working on it, I thought about some further things to do. I already changed the way id3 genres are loaded, but there is at least one other thing (storing the ids in location and overriding get_playback_uri so the album id doesn't have to be stored in the musicbrainz column).

Is it OK to submit everything in one big patch, or should I first submit the changed loading code (bug for bug compatibility), then the genre patch, then the custom get_playback_uri patch (maybe as part of a different bug)?
Comment 17 Christophe Fergeau 2009-03-14 17:10:09 UTC
Split patches make reviewing much easier since the reviewer will know that everything in the patch is related to doing "this" while if there are several different changes in the same patch, he has to guess and that sometimes can be tricky. If you already have some working code for part of this bug that you are happy with, you should attach it now :)
Comment 18 Kim Sullivan 2009-03-14 18:57:42 UTC
Created attachment 130676 [details] [review]
Reworked parsing to save tracks into DB immediately - first draft

This is my first patch, please bear with me.

The patch contains new loading code that stores tracks immediately into the database, resulting in the following memory characteristics:
(19:00:31) [0x92e4408] [printMem] /home/kim/gnome-svn/rhythmbox/lib/rhythmbox/plugins/jamendo/JamendoSource.py:763: __load_catalogue Parsing Start, VmSize: 115314688.0 VmRSS: 42803200.0
...(19:01:44) [0x92e4408] [printMem] /home/kim/gnome-svn/rhythmbox/lib/rhythmbox/plugins/jamendo/JamendoSource.py:763: __catalogue_chunk_cb Parsing done, VmSize: 197636096.0 VmRSS: 116473856.0
(19:01:44) [0x92e4408] [printMem] /home/kim/gnome-svn/rhythmbox/lib/rhythmbox/plugins/jamendo/JamendoSource.py:763: __load_db Load start, VmSize: 197636096.0 VmRSS: 116473856.0
(19:01:44) [0x92e4408] [printMem] /home/kim/gnome-svn/rhythmbox/lib/rhythmbox/plugins/jamendo/JamendoSource.py:763: __load_db Load done, VmSize: 197636096.0 VmRSS: 116473856.0

A few notes:
I generated the patch inside the rhythmbox/plugins/jamendo/jamendo directory (I don't know if this is OK)
The memory printing code is still in place
Downloading is short-circuited (see bug 575379) so its easier to test (but it requires a valid catalogue already downloaded and unpacked)
Genres are loaded differently (if there are inconsistencies, the genre is Album genre / Track genre), I'll separate it into a patch for bug 575380.
Comment 19 Jonathan Matthew 2009-03-15 06:23:17 UTC
This generally looks OK.  Please don't include debug or testing code in patches you submit to bugzilla.  It's harder to review a patch when I have to figure out which bits I'm supposed to ignore.  Same goes for changes for other bugs, such as the genre parsing stuff.  It doesn't matter which directory you generate patches from.

I don't like the way track and album data is stored in the same dict.  The code would be more readable if they were kept separate, and this part in particular would be much better:

+		#clean up data
+		if name in sections:
+			for key in self.__track.keys():
+				if key.startswith(name):
+					del self.__track[key]

if it was just
                self.__data[name] = {}

or something like that.
Comment 20 Jonathan Matthew 2009-03-15 06:32:00 UTC
Also, it's better to call db.commit() every 1000 entries or so.
Comment 21 Kim Sullivan 2009-03-15 09:05:13 UTC
I'm in the process of cleaning up the patch, I made a mistake and didn't use patch queues right away so I have to manually separate the code first.

(In reply to comment #19)
> if it was just
>                 self.__data[name] = {}
> 
> or something like that.
> 

You mean using nested dictionaries? Like self.__data['track']['name']? No problem, I just wanted to get something working quickly, and I wasn't sure how you work with those in python. I'll look it up and get back with an updated patch.

(In reply to comment #20)

> Also, it's better to call db.commit() every 1000 entries or so.

Ok.

While cleaning up, I found that I forgot to add the album-id-in-musicbrainz hack, which I'll fix also.
Comment 22 Kim Sullivan 2009-03-15 12:37:40 UTC
Created attachment 130696 [details] [review]
Save tracks into DB during parsing - second iteration

Removed the genre and memory code
Store album id in musicbrainz tag
Commit to DB every 1000 songs + after the last tag
Use nested dictionaries to store data, instead of a flat dict. I still explicitly clear them using clear(). Just overwriting with a new empty dict should be normal in a garbage collected language, but it seems to have a somewhat larger memory footprint (generally using nested dictionaries seems to have a larger memory footprint than the approach I used before using one flat dict, I don't know what I'm doing wrong).
Comment 23 Kim Sullivan 2009-03-15 20:27:51 UTC
What's the significance of the __db_load_finished variable from JamendoSource? I'm not sure if I'm setting/resetting it correctly in the new parser (I haven't touched it yet, the logic still corresponds to the old parsing and loading).
Initially it is False
After downloading has finished but before parsing starts it is set to True (__download_catalogue_chunk_cb)
Before parsing starts, it is set to False again (__load_catalogue)
After parsing finishes it's set to True

The only method that checks this variable is __update_catalogue, which calls the __load_catalogue method if the xml file doesn't need downloading and __db_load_finished is False (which is currently any time during downloading and parsing, with the exception of a short interval after downloading and before parsing).
Comment 24 Jonathan Matthew 2009-03-16 07:18:42 UTC
This looks much better.  Reading it again, I'm not sure the nested dicts are really helping much.  The first level is only really used here:

 	def endElement(self, name):
		if self.__parse_content:
			self.__data[self.__section][name] = self.__text

and here, at the end of endElement:

		if name in data:
			self.__data[name].clear ()

and everywhere else it just slightly gets in the way.  Maybe it'd be easier to have separate artist, album, and track dicts, and a reference to the 'current' one, which would be set in startElement().  I don't know.

Anyway, this appears to work, and it makes a significant difference to memory usage, so I'm going to commit it after the 0.12.0 release (in two days).  If you feel like trying out other ways to store the parser state, then great, but otherwise this is fine as is.

(In reply to comment #23)
> What's the significance of the __db_load_finished variable from JamendoSource?
> I'm not sure if I'm setting/resetting it correctly in the new parser (I haven't
> touched it yet, the logic still corresponds to the old parsing and loading).

It's there to ensure that the catalogue gets loaded the first time through if it hasn't been updated.  Your patch doesn't appear to have touched it in any way, so don't worry about it.
Comment 25 Jonathan Matthew 2009-03-20 02:43:19 UTC
This seems to be triggering the pygobject crash I reported in bug 575781, but the current code doesn't.  I can't really see why.
Comment 26 Jonathan Matthew 2009-03-20 03:03:13 UTC
OK, got it.  I had to add a small hack to work around the crash.

2009-03-20  Jonathan Matthew  <jonathan@d14n.org>

        patch by:  Kim Sullivan  <alicebot@seznam.cz>

        * plugins/jamendo/jamendo/JamendoSaxHandler.py:
        * plugins/jamendo/jamendo/JamendoSource.py:
        Rework the jamendo xml parser to create database entries in a single 
        pass, rather than creating an intermediate structure and converting 
        that to database entries.  Speeds up catalogue loading and reduces 
        memory consumption.  From #424423.

Maybe it's time to close this bug?  I don't know.
Comment 27 Kim Sullivan 2009-03-20 09:08:30 UTC
I was thinking about storing the album, artist and track in the database, instead of the URL. But it was more for reasons of functionality than space savings, although some bytes could probably be shaved off (the stream url is quite a long string of text, replacing that with three numbers, even if encoded as text, would make it shorter.)

I'd say close this, I'll file a new bug once I have the time to play with the code again.
Comment 28 Jonathan Matthew 2009-03-20 11:32:44 UTC
There are a few more rhythmdb entry fields that could be misused for storing artist ID and anything else that might be useful.

Using 'jamendo:<track-id>' instead of the stream URL as the entry location would probably save a reasonable amount of space.  You'd then have to provide a 'get_playback_uri' method for the entry type, which would extract the track ID and construct the stream URL.

Anyway, I'm going to close this bug now, and we can handle any other improvements in separate bugs.  Thanks for working on this.

Note You need to log in before you can comment on or make changes to this bug.