GNOME Bugzilla – Bug 358077
save memory by strippping out common text in multipart message-ids
Last modified: 2006-09-30 01:43:12 UTC
Message-Ids in multipart articles are usually nearly identical, like this: <JIudnQRwg-iopJbYnZ2dnUVZ_v-dnZ2d@giganews.com> <JIudnQdwg-ihpJbYnZ2dnUVZ_v-dnZ2d@giganews.com> <JIudnQZwg-jepJbYnZ2dnUVZ_v-dnZ2d@giganews.com> <JIudnQFwg-jXpJbYnZ2dnUVZ_v-dnZ2d@giganews.com> <JIudnQBwg-jMpJbYnZ2dnUVZ_v-dnZ2d@giganews.com> <JIudnQNwg-jFpJbYnZ2dnUVZ_v-dnZ2d@giganews.com> In large newsgroups, _many_ megs can be saved by stripping out common text. There are lots of ways to do this, but the implementation in the following attachment uses this scheme: We assign Article::Part's Message-Id by passing in its real Message-Id and a reference key (which currently is always the owner Article's message_id). The identical chars at the beginning (b) and end (e) of the two are counted. b and e have an upper bound of UCHAR_MAX (255). Article::Part::folded_message_id's first byte holds 'b'. The unique middle characters follow, then the last byte holds 'e'. As a special case, when the Part's Message-Id is equal to the key, part.folded_message_id is set to "=".
Created attachment 73528 [details] [review] 0.114 patch First draft.
From a 30 day sampling of a.b.drwho: 0.114: 109 meg 0.114 + patch: 91 meg Given the large memory win, I'd like to get this into 1.0 if the patch proves to be stable enough.
Created attachment 73577 [details] [review] 0.114 patch
Comment on attachment 73577 [details] [review] 0.114 patch Second draft.
Created attachment 73599 [details] [review] 0.114 patch Third draft. * save more memory (cost of a.b.drwho goes from 130M to 101M) by having Part use char* instead of std::strings * faster Part loading from disk. * avoid unnecessary string cloning during xover's load_part. This draft looks good in valgrind & sysprof.
BTW, that's 130M in the second draft, not 130M in 0.114. We've now cut the footprint by over half in large groups. Here's top looking at 0.114 vs 0.114 + third draft. This was taken after starting up each and loading a 30 day snapshot of a.b.dvd: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10331 charles 15 0 917m 910m 9.8m S 0 25.8 0:21.99 pan-old 10319 charles 16 0 400m 392m 9992 S 0 11.1 0:18.80 pan-new