Bug 673112 – Add ArraySet and ArrayMap

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 673112 - Add ArraySet and ArrayMap


Summary:	Add ArraySet and ArrayMap


Status:	RESOLVED WONTFIX

Product:	libgee
Classification:	Platform
Component:	general
Version:	git master
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	libgee-maint
QA Contact:	libgee-maint

URL:
Whiteboard:

Depends on:	595826
Blocks:	653239

Reported:	2012-03-29 20:32 UTC by Philip Withnall
Modified:	2013-02-06 00:55 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Add ArraySet and ArrayMap (24.31 KB, patch) 2012-03-29 20:41 UTC, Philip Withnall	none	Details \| Review

Description Philip Withnall 2012-03-29 20:32:17 UTC

For situations where Sets and Maps are expected to contain very few entries, and not change much, a data structure with lower memory usage than HashSet/HashMap would be nice.

Patch coming with an implementation of ArraySet and ArrayMap, which store their elements/entries in compact arrays and have linear complexity in basically all their operations.

The use case is for folks: lots of objects in folks have Sets and Maps which are generally empty, or have 1 or 2 elements. In all cases, using a HashSet/HashMap allocates space for a minimum of 11 elements. Iteration over HashSets/HashMaps is O(size of allocated space), whereas iteration over ArraySets/ArrayMaps is O(number of elements in Set/Map). A small difference, but lots of the Sets/Maps in folks are iterated over frequently.

From some measurements on folks, using ArraySet/ArrayMap in various places reduces memory consumption (for ~160 contacts) by ~300 KiB while making no measurable difference to startup time of the test program.

Comment 1 Philip Withnall 2012-03-29 20:41:10 UTC

Created attachment 210900 [details] [review]
Add ArraySet and ArrayMap

Here we go. The implementation isn't incredibly polished, but does implement tests for the new classes (and they both).

It's probably possible to optimise lookup behaviour in both by using a binary tree instead of a linear array. I haven't done this for two reasons:
 • For the given use case (very few items in the data structure) the difference between linear and logarithmic time is probably negligible.
 • I'd had enough of trying to get arrays of structs to cooperate with me in Vala.

The implementation of ArrayMap is particularly icky in places, but I see no way to improve it. This is probably a failing in my knowledge of Vala, so suggestions are welcome on ways to improve it.

The patch is based on libgee's 0.6 branch, but I can rebase it if necessary.

Comment 2 Maciej (Matthew) Piechotka 2012-03-29 21:51:06 UTC

Thank you for submitting the patches. I am not sure about what exactly the benefits are so I will take the liberty of clarifying it.

(In reply to comment #0)
> The use case is for folks: lots of objects in folks have Sets and Maps which
> are generally empty, or have 1 or 2 elements. In all cases, using a
> HashSet/HashMap allocates space for a minimum of 11 elements. Iteration over
> HashSets/HashMaps is O(size of allocated space), whereas iteration over
> ArraySets/ArrayMaps is O(number of elements in Set/Map). A small difference,
> but lots of the Sets/Maps in folks are iterated over frequently.

I would suspect that any benefit would be negligible compared to overhead of interfaces (2 indirect calls). In benchmark it had major impact. The big-O analysis is not the best tool to analyze such cases by its very nature.

If you iterate a lot you may want to use foreach function/try it (https://bugzilla.gnome.org/show_bug.cgi?id=673112). Currently there is no specialisation for non-iterators but it should be even faster (avoiding data allocation).

> From some measurements on folks, using ArraySet/ArrayMap in various places
> reduces memory consumption (for ~160 contacts) by ~300 KiB while making no
> measurable difference to startup time of the test program.

Is 300 KiB large amount for you? On most modern systems it is less then 0.03% (assuming 1GiB memory). Even on 'embedded' modern systems like mobile phones it can be 0.1% (assuming 300MiB memory). As a side note the /bin/cat seems to have larger resident size on my system (456 KiB) and gnome-contacts have more then 100 MiB of resident size. If it is please explain your use-case .

(In reply to comment #1)
> It's probably possible to optimise lookup behaviour in both by using a binary
> tree instead of a linear array. I haven't done this for two reasons:

Do you mean sorted array and binary search?

> The implementation of ArrayMap is particularly icky in places, but I see no way
> to improve it. This is probably a failing in my knowledge of Vala, so
> suggestions are welcome on ways to improve it.

It looks like rest of libgee. 

> The patch is based on libgee's 0.6 branch, but I can rebase it if necessary.

0.6 is currently in bug-fixes mode while 0.7 is opened for new features.

Comment 3 Philip Withnall 2012-03-29 22:45:57 UTC

(In reply to comment #2)
> Thank you for submitting the patches. I am not sure about what exactly the
> benefits are so I will take the liberty of clarifying it.
> 
> (In reply to comment #0)
> > The use case is for folks: lots of objects in folks have Sets and Maps which
> > are generally empty, or have 1 or 2 elements. In all cases, using a
> > HashSet/HashMap allocates space for a minimum of 11 elements. Iteration over
> > HashSets/HashMaps is O(size of allocated space), whereas iteration over
> > ArraySets/ArrayMaps is O(number of elements in Set/Map). A small difference,
> > but lots of the Sets/Maps in folks are iterated over frequently.
> 
> I would suspect that any benefit would be negligible compared to overhead of
> interfaces (2 indirect calls). In benchmark it had major impact. The big-O
> analysis is not the best tool to analyze such cases by its very nature.

Yes, the main point of these classes is to lower memory usage. As I said, the iteration difference is small and you're right in that it'll be negligible compared to the indirect calls.

> If you iterate a lot you may want to use foreach function/try it
> (https://bugzilla.gnome.org/show_bug.cgi?id=673112). Currently there is no
> specialisation for non-iterators but it should be even faster (avoiding data
> allocation).

Did you mean to give a different link? That link's for this bug.

> > From some measurements on folks, using ArraySet/ArrayMap in various places
> > reduces memory consumption (for ~160 contacts) by ~300 KiB while making no
> > measurable difference to startup time of the test program.
> 
> Is 300 KiB large amount for you? On most modern systems it is less then 0.03%
> (assuming 1GiB memory). Even on 'embedded' modern systems like mobile phones it
> can be 0.1% (assuming 300MiB memory). As a side note the /bin/cat seems to have
> larger resident size on my system (456 KiB) and gnome-contacts have more then
> 100 MiB of resident size. If it is please explain your use-case .

That's a 300 KiB reduction in writeable memory usage by a program which normally uses 9.6 MiB (using Array[Set|Map], it's reduced to 9.3 MiB). That's a ~3% reduction. I did a few test runs, and the numbers were fairly consistent, but the tests weren't particularly rigorous.

(The test I did was to run folks-inspect, wait until it reached quiescence and then look at its writeable memory usage.)

> (In reply to comment #1)
> > It's probably possible to optimise lookup behaviour in both by using a binary
> > tree instead of a linear array. I haven't done this for two reasons:
> 
> Do you mean sorted array and binary search?

I was thinking about a flattened binary tree (http://en.wikipedia.org/wiki/Binary_tree#Arrays), but a sorted array and binary search would work similarly. I really haven't put much thought into it.

> > The implementation of ArrayMap is particularly icky in places, but I see no way
> > to improve it. This is probably a failing in my knowledge of Vala, so
> > suggestions are welcome on ways to improve it.
> 
> It looks like rest of libgee. 

:-(

> > The patch is based on libgee's 0.6 branch, but I can rebase it if necessary.
> 
> 0.6 is currently in bug-fixes mode while 0.7 is opened for new features.

OK, I'll rebase it to 0.7 if you decide the classes are worth pursuing.

Comment 4 Maciej (Matthew) Piechotka 2012-03-29 23:19:01 UTC

(In reply to comment #3)
> (In reply to comment #2)
> > If you iterate a lot you may want to use foreach function/try it
> > (https://bugzilla.gnome.org/show_bug.cgi?id=673112). Currently there is no
> > specialisation for non-iterators but it should be even faster (avoiding data
> > allocation).
> 
> Did you mean to give a different link? That link's for this bug.
> 

Ups. Sorry - https://bugzilla.gnome.org/show_bug.cgi?id=645850#c4

> > > From some measurements on folks, using ArraySet/ArrayMap in various places
> > > reduces memory consumption (for ~160 contacts) by ~300 KiB while making no
> > > measurable difference to startup time of the test program.
> > 
> > Is 300 KiB large amount for you? On most modern systems it is less then 0.03%
> > (assuming 1GiB memory). Even on 'embedded' modern systems like mobile phones it
> > can be 0.1% (assuming 300MiB memory). As a side note the /bin/cat seems to have
> > larger resident size on my system (456 KiB) and gnome-contacts have more then
> > 100 MiB of resident size. If it is please explain your use-case .
> 
> That's a 300 KiB reduction in writeable memory usage by a program which
> normally uses 9.6 MiB (using Array[Set|Map], it's reduced to 9.3 MiB). That's a
> ~3% reduction. I did a few test runs, and the numbers were fairly consistent,
> but the tests weren't particularly rigorous.
> 
> (The test I did was to run folks-inspect, wait until it reached quiescence and
> then look at its writeable memory usage.)
> 

I meant more in grand scheme of things - in short: does reduction of memory footprint outweighs the cost maintenance. 

I don't think that 300 KiB would affect user in any visible way (I'm less interested in memory consumption relative to program size - 1% reduction in program that takes 4 GiB would be much more visible).

> > > The implementation of ArrayMap is particularly icky in places, but I see no way
> > > to improve it. This is probably a failing in my knowledge of Vala, so
> > > suggestions are welcome on ways to improve it.
> > 
> > It looks like rest of libgee. 
> 
> :-(
> 

Thanks.

In any case I hope that I will do some cleanup of code.

> > > The patch is based on libgee's 0.6 branch, but I can rebase it if necessary.
> > 
> > 0.6 is currently in bug-fixes mode while 0.7 is opened for new features.
> 
> OK, I'll rebase it to 0.7 if you decide the classes are worth pursuing.

I don't think that the use case is worth pursuing unless there is memory-constrained platform we want libgee to run on or there is some benchmark showing better cache usage. (In many cases whole program will fit into shared cache on x86).

I am marking it as WONTFIX but please feel free to reopen if you have some new informations

Comment 5 Philip Withnall 2012-03-31 12:29:53 UTC

(In reply to comment #4)
> I don't think that the use case is worth pursuing unless there is
> memory-constrained platform we want libgee to run on or there is some benchmark
> showing better cache usage. (In many cases whole program will fit into shared
> cache on x86).

I don't suppose Array[Set|Map] will improve cache utilisation for the number of elements they're targeted at.

I do think that 300 KiB is a decent reduction in memory consumption (in absolute terms), given that it will increase linearly with the number of contacts handled by libfolks, and will apply to every instance of libfolks on a desktop (which, for GNOME 3.4, will be 3 or 4 instances, giving ~1 MiB savings in total).

Still, as you say, the classes will have a maintenance cost. I guess they're not particularly general either.

Perhaps the same reductions could be achieved by changing MIN_SIZE in Hash[Set|Map] to 0 or 1, instead of the current value of 11? Is there any particular reason it's currently 11?

Comment 6 Maciej (Matthew) Piechotka 2012-04-01 22:09:17 UTC

Sorry for delay - I have travelled and after 11h of travel I was in no shape to reply.

(In reply to comment #5)
> (In reply to comment #4)
> > I don't think that the use case is worth pursuing unless there is
> > memory-constrained platform we want libgee to run on or there is some benchmark
> > showing better cache usage. (In many cases whole program will fit into shared
> > cache on x86).
> 
> I don't suppose Array[Set|Map] will improve cache utilisation for the number of
> elements they're targeted at.
> 
> I do think that 300 KiB is a decent reduction in memory consumption (in
> absolute terms), given that it will increase linearly with the number of
> contacts handled by libfolks, and will apply to every instance of libfolks on a
> desktop (which, for GNOME 3.4, will be 3 or 4 instances, giving ~1 MiB savings
> in total).
> 

On the other hand the cost of looking up/adding/removing would 'scale' linearly as well.

However it looks like partially unrolled trees (B-Trees IIRC) might be an answer (possibly not the answer). I have not looked into it in details.

> Still, as you say, the classes will have a maintenance cost. I guess they're
> not particularly general either.
> 
> Perhaps the same reductions could be achieved by changing MIN_SIZE in
> Hash[Set|Map] to 0 or 1, instead of the current value of 11? Is there any
> particular reason it's currently 11?

Hmm. What about using ArrayList directly (you would need to check the existence 'by hand')?

I don't like ArraySet/ArrayMap as it have vary bad average case of every operation but iteration. I guess we can somehow compromise between the memory saving/iteration and adding/removing. I was planning to do it with linked lists but I might start with sets.

Comment 7 Philip Withnall 2012-04-03 10:27:01 UTC

(In reply to comment #6)
> Sorry for delay - I have travelled and after 11h of travel I was in no shape to
> reply.
> 
> (In reply to comment #5)
> > (In reply to comment #4)
> > > I don't think that the use case is worth pursuing unless there is
> > > memory-constrained platform we want libgee to run on or there is some benchmark
> > > showing better cache usage. (In many cases whole program will fit into shared
> > > cache on x86).
> > 
> > I don't suppose Array[Set|Map] will improve cache utilisation for the number of
> > elements they're targeted at.
> > 
> > I do think that 300 KiB is a decent reduction in memory consumption (in
> > absolute terms), given that it will increase linearly with the number of
> > contacts handled by libfolks, and will apply to every instance of libfolks on a
> > desktop (which, for GNOME 3.4, will be 3 or 4 instances, giving ~1 MiB savings
> > in total).
> > 
> 
> On the other hand the cost of looking up/adding/removing would 'scale' linearly
> as well.

Well, the intention was that Array[Set|Map] would not be used for data structures expected to contain more than a few elements. For small numbers of elements, the cost of looking up/adding/removing is basically the same for all data structures since it's dominated by overheads.

> However it looks like partially unrolled trees (B-Trees IIRC) might be an
> answer (possibly not the answer). I have not looked into it in details.

B-trees are moderately wasteful of memory (about a 50% overhead per node due to maintaining child pointers), though better than linked lists. I guess they're about equal to hash tables in terms of memory waste.

> > Still, as you say, the classes will have a maintenance cost. I guess they're
> > not particularly general either.
> > 
> > Perhaps the same reductions could be achieved by changing MIN_SIZE in
> > Hash[Set|Map] to 0 or 1, instead of the current value of 11? Is there any
> > particular reason it's currently 11?
> 
> Hmm. What about using ArrayList directly (you would need to check the existence
> 'by hand')?

That's an option, but I'd like the code in folks to keep the Set and Map method calls/semantics it has already.

What is your opinion on reducing Hash[Set|Map].MIN_SIZE?

Comment 8 Maciej (Matthew) Piechotka 2012-04-03 14:39:27 UTC

(In reply to comment #7)
> (In reply to comment #6)
> > Sorry for delay - I have travelled and after 11h of travel I was in no shape to
> > reply.
> > 
> > (In reply to comment #5)
> > > (In reply to comment #4)
> > > > I don't think that the use case is worth pursuing unless there is
> > > > memory-constrained platform we want libgee to run on or there is some benchmark
> > > > showing better cache usage. (In many cases whole program will fit into shared
> > > > cache on x86).
> > > 
> > > I don't suppose Array[Set|Map] will improve cache utilisation for the number of
> > > elements they're targeted at.
> > > 
> > > I do think that 300 KiB is a decent reduction in memory consumption (in
> > > absolute terms), given that it will increase linearly with the number of
> > > contacts handled by libfolks, and will apply to every instance of libfolks on a
> > > desktop (which, for GNOME 3.4, will be 3 or 4 instances, giving ~1 MiB savings
> > > in total).
> > > 
> > 
> > On the other hand the cost of looking up/adding/removing would 'scale' linearly
> > as well.
> 
> Well, the intention was that Array[Set|Map] would not be used for data
> structures expected to contain more than a few elements. For small numbers of
> elements, the cost of looking up/adding/removing is basically the same for all
> data structures since it's dominated by overheads.
> 

But then memory savings are small - or I don't understand the use-case.

> > However it looks like partially unrolled trees (B-Trees IIRC) might be an
> > answer (possibly not the answer). I have not looked into it in details.
> 
> B-trees are moderately wasteful of memory (about a 50% overhead per node due to
> maintaining child pointers), though better than linked lists. I guess they're
> about equal to hash tables in terms of memory waste.
> 

Ups. Sorry - you are right. I didn't meant B-trees. I have an idea about storring arrays of elements inside the tree - I may look into details into this.

> > > Still, as you say, the classes will have a maintenance cost. I guess they're
> > > not particularly general either.
> > > 
> > > Perhaps the same reductions could be achieved by changing MIN_SIZE in
> > > Hash[Set|Map] to 0 or 1, instead of the current value of 11? Is there any
> > > particular reason it's currently 11?
> > 
> > Hmm. What about using ArrayList directly (you would need to check the existence
> > 'by hand')?
> 
> That's an option, but I'd like the code in folks to keep the Set and Map method
> calls/semantics it has already.
> 
> What is your opinion on reducing Hash[Set|Map].MIN_SIZE?

1. The problem probably is that for 'normal' case it results in large number of reallocation for startup (2->3->5->7->11).
2. For small number of elements we can have something more similar to linked list. My intuition is that for large number we have small chance of clashes even for large number of elements. However for small number the hash table degenerate into list (for 2 elements it is 50% for worst case).

Possibly add .sized constructor and do some twisting of hash instead of very simples used in glib?

Comment 9 Philip Withnall 2012-04-04 00:00:00 UTC

(In reply to comment #8)
> (In reply to comment #7)
> > (In reply to comment #6)
> > > Sorry for delay - I have travelled and after 11h of travel I was in no shape to
> > > reply.
> > > 
> > > (In reply to comment #5)
> > > > (In reply to comment #4)
> > > > > I don't think that the use case is worth pursuing unless there is
> > > > > memory-constrained platform we want libgee to run on or there is some benchmark
> > > > > showing better cache usage. (In many cases whole program will fit into shared
> > > > > cache on x86).
> > > > 
> > > > I don't suppose Array[Set|Map] will improve cache utilisation for the number of
> > > > elements they're targeted at.
> > > > 
> > > > I do think that 300 KiB is a decent reduction in memory consumption (in
> > > > absolute terms), given that it will increase linearly with the number of
> > > > contacts handled by libfolks, and will apply to every instance of libfolks on a
> > > > desktop (which, for GNOME 3.4, will be 3 or 4 instances, giving ~1 MiB savings
> > > > in total).
> > > > 
> > > 
> > > On the other hand the cost of looking up/adding/removing would 'scale' linearly
> > > as well.
> > 
> > Well, the intention was that Array[Set|Map] would not be used for data
> > structures expected to contain more than a few elements. For small numbers of
> > elements, the cost of looking up/adding/removing is basically the same for all
> > data structures since it's dominated by overheads.
> > 
> 
> But then memory savings are small - or I don't understand the use-case.

In relative terms (per Set instance, which we’ll assume to have 1 element for the sake of argument) they’re not: compare 11 hash set bins (1 of which is occupied) + constant overhead of a few pointers; to 1 array element + constant overhead of a few pointers. That's ~3 times more memory being used in the hash set than the array set for storing the same data.

> > > > Still, as you say, the classes will have a maintenance cost. I guess they're
> > > > not particularly general either.
> > > > 
> > > > Perhaps the same reductions could be achieved by changing MIN_SIZE in
> > > > Hash[Set|Map] to 0 or 1, instead of the current value of 11? Is there any
> > > > particular reason it's currently 11?
> > > 
> > > Hmm. What about using ArrayList directly (you would need to check the existence
> > > 'by hand')?
> > 
> > That's an option, but I'd like the code in folks to keep the Set and Map method
> > calls/semantics it has already.
> > 
> > What is your opinion on reducing Hash[Set|Map].MIN_SIZE?
> 
> 1. The problem probably is that for 'normal' case it results in large number of
> reallocation for startup (2->3->5->7->11).
> 2. For small number of elements we can have something more similar to linked
> list. My intuition is that for large number we have small chance of clashes
> even for large number of elements. However for small number the hash table
> degenerate into list (for 2 elements it is 50% for worst case).
> 
> Possibly add .sized constructor and do some twisting of hash instead of very
> simples used in glib?

Adding a .sized constructor would work for me.

Comment 10 Maciej (Matthew) Piechotka 2012-04-09 15:42:46 UTC

> > 1. The problem probably is that for 'normal' case it results in large number of
> > reallocation for startup (2->3->5->7->11).
> > 2. For small number of elements we can have something more similar to linked
> > list. My intuition is that for large number we have small chance of clashes
> > even for large number of elements. However for small number the hash table
> > degenerate into list (for 2 elements it is 50% for worst case).
> > 
> > Possibly add .sized constructor and do some twisting of hash instead of very
> > simples used in glib?
> 
> Adding a .sized constructor would work for me.

I've checked a bit and it seems that arrays are what you want. - hashtables have bad performance for small number of elements. Number 11 is resonable and it appears in underlaying glib function.

Could you show me the code of your use case/describe it in more detail?

Comment 11 Philip Withnall 2012-04-10 10:26:00 UTC

(In reply to comment #10)
> > > 1. The problem probably is that for 'normal' case it results in large number of
> > > reallocation for startup (2->3->5->7->11).
> > > 2. For small number of elements we can have something more similar to linked
> > > list. My intuition is that for large number we have small chance of clashes
> > > even for large number of elements. However for small number the hash table
> > > degenerate into list (for 2 elements it is 50% for worst case).
> > > 
> > > Possibly add .sized constructor and do some twisting of hash instead of very
> > > simples used in glib?
> > 
> > Adding a .sized constructor would work for me.
> 
> I've checked a bit and it seems that arrays are what you want. - hashtables
> have bad performance for small number of elements. Number 11 is resonable and
> it appears in underlaying glib function.
> 
> Could you show me the code of your use case/describe it in more detail?

Basically, folks has several hundred Individual instances when it’s run (for an average-sized contact list — this number changes linearly with the number of contacts you have). Each Individual is an aggregation of one or more Personas, but typically an Individual will only have 1 Persona (perhaps 2 or 3, but almost certainly no more than 5 or so).

http://git.gnome.org/browse/folks/tree/folks/individual.vala#n102

We currently store the Personas for each Individual in a HashSet. We need the Set semantics, fast iteration and relatively fast lookup. Once an Individual is constructed the set is basically immutable, so modifications to it don’t need to be fast.

In addition to this, each Individual aggregates sets of properties about its Personas. There are several such Sets like this, each of which will have perhaps up to 10 elements though could quite frequently have 0 elements (depends entirely on how much information your address book/contact list has about the Individual):

http://git.gnome.org/browse/folks/tree/folks/individual.vala#n471

These Sets behave similarly: basically immutable after creation, fast iteration and relatively fast lookup needed.

Does that help?

Comment 12 Maciej (Matthew) Piechotka 2012-04-10 15:24:23 UTC

(In reply to comment #11)
> (In reply to comment #10)
> > > > 1. The problem probably is that for 'normal' case it results in large number of
> > > > reallocation for startup (2->3->5->7->11).
> > > > 2. For small number of elements we can have something more similar to linked
> > > > list. My intuition is that for large number we have small chance of clashes
> > > > even for large number of elements. However for small number the hash table
> > > > degenerate into list (for 2 elements it is 50% for worst case).
> > > > 
> > > > Possibly add .sized constructor and do some twisting of hash instead of very
> > > > simples used in glib?
> > > 
> > > Adding a .sized constructor would work for me.
> > 
> > I've checked a bit and it seems that arrays are what you want. - hashtables
> > have bad performance for small number of elements. Number 11 is resonable and
> > it appears in underlaying glib function.
> > 
> > Could you show me the code of your use case/describe it in more detail?
> 
> Basically, folks has several hundred Individual instances when it’s run (for an
> average-sized contact list — this number changes linearly with the number of
> contacts you have). Each Individual is an aggregation of one or more Personas,
> but typically an Individual will only have 1 Persona (perhaps 2 or 3, but
> almost certainly no more than 5 or so).
> 
> http://git.gnome.org/browse/folks/tree/folks/individual.vala#n102
> 
> We currently store the Personas for each Individual in a HashSet. We need the
> Set semantics, fast iteration and relatively fast lookup. Once an Individual is
> constructed the set is basically immutable, so modifications to it don’t need
> to be fast.
> 
> In addition to this, each Individual aggregates sets of properties about its
> Personas. There are several such Sets like this, each of which will have
> perhaps up to 10 elements though could quite frequently have 0 elements
> (depends entirely on how much information your address book/contact list has
> about the Individual):
> 
> http://git.gnome.org/browse/folks/tree/folks/individual.vala#n471
> 
> These Sets behave similarly: basically immutable after creation, fast iteration
> and relatively fast lookup needed.
> 
> Does that help?

Yes. Thanks - the situation is much clearer now. Let me return to you in a few days - I will think how to solve the problem.

Comment 13 Philip Withnall 2012-04-21 23:46:13 UTC

As another data point, I’ve been running valgrind’s massif against folks (same test conditions; about 120 Individual instances). Here’s a snippet of the ms_print output, showing that HashSets use ~500 KiB of heap in folks. Note that this doesn’t consider whether the sets are empty, however.

(For my own reference, this is from massif.out.11555.)

84.71% (4,545,900B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->22.93% (1,230,644B) 0x59703CE: g_object_constructor (gobject.c:1849)
| ->20.76% (1,114,024B) 0x596FB9B: g_object_newv (gobject.c:1713)
| | ->20.66% (1,108,440B) 0x596F597: g_object_new (gobject.c:1542)
| | | ->09.27% (497,352B) 0x56E1AE1: gee_abstract_collection_construct (abstractcollection.c:1610)
| | | | ->09.27% (497,352B) 0x56E8E9E: gee_abstract_set_construct (abstractset.c:213)
| | | | | ->09.27% (497,352B) 0x57013B6: gee_hash_set_construct (hashset.c:319)
| | | | | | ->09.27% (497,352B) 0x5701567: gee_hash_set_new (hashset.c:351)

Comment 14 Maciej (Matthew) Piechotka 2012-04-22 04:58:17 UTC

> > http://git.gnome.org/browse/folks/tree/folks/individual.vala#n471
> > 
> > These Sets behave similarly: basically immutable after creation, fast iteration
> > and relatively fast lookup needed.
> > 
> > Does that help?
> 
> Yes. Thanks - the situation is much clearer now. Let me return to you in a few
> days - I will think how to solve the problem.

The only thing that came to my mind is for size < 11 (or other limit) just allocate an array and use HashSet transparently as ArraySet. I would prefer to run a few benchmarks first.

RE. valgrind - sorry I'm not familiar with this tool. Does it mean you have 500 KiB for HashSet objects?

Comment 15 Philip Withnall 2012-04-23 09:56:27 UTC

(In reply to comment #14)
> > > http://git.gnome.org/browse/folks/tree/folks/individual.vala#n471
> > > 
> > > These Sets behave similarly: basically immutable after creation, fast iteration
> > > and relatively fast lookup needed.
> > > 
> > > Does that help?
> > 
> > Yes. Thanks - the situation is much clearer now. Let me return to you in a few
> > days - I will think how to solve the problem.
> 
> The only thing that came to my mind is for size < 11 (or other limit) just
> allocate an array and use HashSet transparently as ArraySet. I would prefer to
> run a few benchmarks first.

Yeah, you could have a flag in HashSet which, if set, means that _nodes is actually a densely packed array and should be treated as in ArraySet. If the flag isn’t set, then _nodes could behave as normal.

> RE. valgrind - sorry I'm not familiar with this tool. Does it mean you have 500
> KiB for HashSet objects?

Yeah, although a small percentage of that might be heap overheads.

See: http://valgrind.org/info/tools.html#massif and http://valgrind.org/docs/manual/ms-manual.html.

Comment 16 Maciej (Matthew) Piechotka 2012-06-04 08:25:29 UTC

I started implementing it and it looks like it might code more complicated in a few important places. I guess that it would have negligible impact on performance but I want to confirm it by benchmarks.

Unfortunately the request for ideas (https://mail.gnome.org/archives/libgee-list/2012-April/msg00000.html) did not attract lot of attention. If anyone would give a way to easily "macro"-benchmark set I would be grateful.

Comment 17 Maciej (Matthew) Piechotka 2012-06-13 05:22:26 UTC

Could you check the changes I've made to hash-array branch (set only so far)?
https://gitorious.org/~uzytkownik/libgee/mpiechotkas-libgee/commits/hash-array

Comment 18 Philip Withnall 2012-06-13 17:01:38 UTC

Coalesced hashing is an interesting idea. Unless I’m mistaken though, it wouldn’t significantly reduce memory consumption for folks, since it’s still allocating ~9 nodes for each empty hash table, all with chaining pointers. It is only more memory efficient in cases where several items have been inserted into the same bucket in the hash table.

Comment 19 Maciej (Matthew) Piechotka 2013-02-06 00:55:52 UTC

After playing with it during past few months I become convinced that this is not worth effort. 8-element array produced no measurable difference in memory consumption(not sure why - I confirmed that appropriate versions of library are loaded). I found that coalesed hash perform worse in my test cases.

I don't believe that 1 MiB (~0.75% of G1 RAM) of data saved is relevant enough to investigate further.