GNOME Bugzilla – Bug 644437
GHashTable is memory hungry
Last modified: 2011-05-07 18:43:01 UTC
When the 7895161st element is added to a GHashTable, the memory usage goes up to ~400M. Temporarily, during the resize that happens then, it goes up to ~600M, all of which is accessed. That's about 75 bytes per element. (Someone threw a 8M cell spreadsheet at Gnumeric. I hadn't quite expected that kind of memory being used for the cell hash.)
Currently, GHashTable uses 12 bytes per bucket on 32-bit platforms, or 24 bytes on 64-bit platforms. Open hash tables also maintain a variable number of unused buckets in addition to the actual elements. I can think of one change that would reduce the memory footprint to 9 bytes per bucket (or 17 bytes on 64-bit platforms) without changing the API or scaling properties of GHashTable. This also applies to empty buckets and tombstones, so in effect the 600M figure would be reduced to 425M. The bucket structure currently looks like this: struct _GHashNode { gpointer key; gpointer value; /* If key_hash == 0, node is not in use * If key_hash == 1, node is a tombstone * If key_hash >= 2, node contains data */ guint key_hash; }; key_hash is 4 bytes, with another 4 bytes of padding on 64-bit platforms. We could take it out of the structure and maintain key_hash in a separate array. Additionally, we could store it as a single byte by xor'ing together the component bytes of the integer generated by the hash function. Subtracting the values for unused buckets and tombstones, there's still only a 1/254 chance that different values produce the same hash. The only downside would be that we'd need to access at least two cache lines for each lookup. I don't know if that'd matter much in practice - it's worth benchmarking. So: struct _GHashNode { gpointer key; gpointer value; }; struct _GHashTable { gint size; gint mod; guint mask; gint nnodes; gint noccupied; /* nnodes + tombstones */ GHashNode *nodes; /* If key_hash == 0, node is not in use * If key_hash == 1, node is a tombstone * If key_hash >= 2, node contains data */ guchar *key_hashes; GHashFunc hash_func; GEqualFunc key_equal_func; volatile gint ref_count; #ifndef G_DISABLE_ASSERT /* * Tracks the structure of the hash table, not its contents: is only * incremented when a node is added or removed (is not incremented * when the key or data of a node is modified). */ int version; #endif GDestroyNotify key_destroy_func; GDestroyNotify value_destroy_func; }; An added benefit is that the hash table's memory would be allocated in two chunks, reducing the need for contiguous memory slightly. If we were to provide a different API, we could do even better. Maybe it's time for a GHashDict where key == value?
> If we were to provide a different API, we could do even better. Maybe it's time > for a GHashDict where key == value? I don't think a new API is needed for that. One could call g_hash_table_set_flags (h, ..._KEY_ONLY /* value */, ..._KEY_ONLY /* mask */); at a time before the first element is inserted. That would require that the keys and values be stored in separate arrays, and only one of them would be allocated if the above flag was set. GHashNode would be down to single elements, i.e., pointless. In my prototype I also got rid of the need to store the hash at all, see bug 644197. That's pretty brutal and is tricky unless the user can supply two magic key values. I think I like your 1-byte version.
Having flags that can only be set prior to the first insertion feels a bit ad-hoc to me, though admittedly it might be better than a whole new data structure. We could also silently convert between key-value and key-only fairly easily, avoiding the requirement that the flags be set prior to first insertion. It will also slow us down a little with extra checks that need to be made for every operation. The 1-byte trick would likely slow us down more, since the full hash would need to be recalculated for every key when the table is resized - potentially a couple of million calls to a user-provided function that might reference extra memory and cause a lot of page faults/cache misses. It'd be nice to have benchmarks for this so we can see what the tradeoff's like, preferably from an app that uses GHashTable heavily (like Gnumeric). I don't trust synthetic benchmarks that much, since it's easy to over-emphasize a use case or get misleading results from a too "clean" environment that way.
I don't think optimizing for millions of elements is really what GHashTable is for... If you can come up with non-intrusive ways to curb memory consumption, sure. But complicated restructuring and new flags don't sound attractive to me.
"I don't think optimizing for millions of elements is really what GHashTable is for..." That seems to mean "we shouldn't make GHashTable optimal for the case of millions of elements disregarding any effects of other cases". I don't think anyone will disagree with that. (And no-one suggested it.) I think the following is feasible without major changes: 1. Split the GHashNode array into three separate arrays: keys/values/hashes. [Saves 1/6] 2. Make the values array optional, in practice by pointing it the same place as the keys array so most code wouldn't notice. [Saves 1/3] Step 2 requires some kind of flag setting, but if you don't like making that explicit, a separate constructor, say g_hash_table_new_set (...), could do it. I think the old code that needs to know about step 2 is constructor, destructor, and resize. And, perhaps, a key==value check for insert.
would be interesting to see patches for 1 and 2, indeed.
Created attachment 186747 [details] [review] 0001-GHash-use-macros-to-check-for-magic-hash-values.patch First of 4 patches.
Created attachment 186748 [details] [review] 0002-GHash-split-nodes-array-into-seperate-arrays.patch
Created attachment 186749 [details] [review] 0003-GHash-eliminate-one-of-the-lookup_nodes-functions.patch
Created attachment 186750 [details] [review] 0004-GHash-introduce-set-mode.patch
Patch 2 is roughly step 1. Patch 4 is roughly step 2. Simple test case included. Patches 1 and 3 are for simplifying this to make the other two easier.
Unrelatedly, someone ought to decide whether to use "int" or "guint" for node indices.
Thanks ! Now we need some performance tests...
Review of attachment 186750 [details] [review]: ::: glib/ghash.c @@ +604,3 @@ + * + * Creates a new #GHashTable like g_hash_table_new_full(), but for use as a + * set where key and value are always the same. Only the key will be stores s/stores/stored/ @@ +955,2 @@ g_return_if_fail (hash_table != NULL); g_return_if_fail (hash_table->ref_count > 0); We should probably mention in the docs that it is an error to call g_hash_table_insert (h, k, v) with k != v on a hash set. And maybe have a g_warning ?
We do? HP, do you have something from when you changed algorithm? That would be less suspect than anything I can cook up.
>We should probably mention in the docs that it is an error to call >g_hash_table_insert (h, k, v) with k != v on a hash set. And maybe have a >g_warning ? I was a bit more brutal, see g_hash_table_insert_internal: if (hash_table->is_set) value = key; In other words, we ignore the value. It might be better to document that.
I'm very much in favor of adding a g_return_if_fail (!is_set || key != value) to the insertion functions. The current approach of silently doing something else is not good. It makes people not find their bugs. Also, I'd be interested if for large hashes there's cache issues when using 3 separate arrays. I don't think it's worth optimizing for that case, but if someone does perf measurements, it'd be nice to know. Last but not least, do we need to think about bindability for this (so that bindings can allow set.insert (value) instead of needing set.insert (key, value)?
Bindability can be solved like this: void g_hash_table_set_insert (GHashTable *hash_table, gpointer key) { g_hash_table_insert (hash_table, key, key); } (Explicitly a valid and meaningful call even for non-sets.) I'm holding off on the little stuff until we figure out if HP has some performance tests around.
I added the following test function to tests/hash.c and timed the resulting code. static void performance_hash_test (void) { GHashTable *hash_table = g_hash_table_new (NULL, NULL); int i; for (i = 0; i < (1 << 22); i++) { int ri = g_test_rand_int_range (0, 1 << 24); g_hash_table_insert (hash_table, GINT_TO_POINTER (ri), NULL); } for (i = 0; i < (1 << 22); i++) { int ri = g_test_rand_int_range (0, 1 << 24); if (ri & 16) g_hash_table_insert (hash_table, GINT_TO_POINTER (ri), NULL); else g_hash_table_remove (hash_table, GINT_TO_POINTER (ri)); } g_hash_table_destroy (hash_table); } On this machine and with this test sample it appears that the new and old are about the same speed. If anything, the new code appears to be master. (That makes a bit of sense: the cpu's addressing unit can multiply by 4 and 8. Multiplying by 24 takes two instructions extra.) new: real 0m3.977s user 0m3.440s sys 0m0.348s real 0m3.883s user 0m3.448s sys 0m0.340s real 0m3.956s user 0m3.436s sys 0m0.408s real 0m3.947s user 0m3.384s sys 0m0.380s real 0m3.990s user 0m3.468s sys 0m0.376s real 0m3.968s user 0m3.388s sys 0m0.364s old: real 0m3.957s user 0m3.500s sys 0m0.416s real 0m3.986s user 0m3.412s sys 0m0.440s real 0m4.108s user 0m3.556s sys 0m0.416s real 0m4.056s user 0m3.560s sys 0m0.352s real 0m4.037s user 0m3.452s sys 0m0.460s real 0m4.112s user 0m3.528s sys 0m0.404s
(In reply to comment #18) > Bindability can be solved like this: > > void > g_hash_table_set_insert (GHashTable *hash_table, gpointer key) > { > g_hash_table_insert (hash_table, key, key); > } > > (Explicitly a valid and meaningful call even for non-sets.) > If we go down that route, we probably also want gboolean g_hash_table_set_contains (GHashTable *ht, gpointer key) My quick-and-dirty performance comparisons also point at the new code being slightly faster.
I don't like going that route. If we were to add set-specific APIs, we should have a GHashSet and give it a completely custom API, kinda like we do with all the different array types. But having hash table API that works on just some hash tables seems like a bad idea to me.
both g_hash_table_set_insert (ht, key) will work just fine on non-set hash tables (assuming that g_hash_table_set_insert (ht, key, key) does) g_hash_table_set_contains (ht, key) will work just fine on any hash table, set or not...
Created attachment 186880 [details] [review] 0005-GHash-typo.patch
Created attachment 186881 [details] [review] 0006-GHash-don-t-ignore-values-for-sets.-Test-that-they-m.patch
Created attachment 186882 [details] [review] 0007-GHash-introduce-g_hash_table_exists-convenience-API.patch
Created attachment 186883 [details] [review] 0008-GHash-introduce-g_hash_table_set_insert.patch
Review of attachment 186880 [details] [review]: Just commit stuff like this IMO.
There are still things that can be improved memory-wise. For example, if g_direct_hash is used -- explicitly or via NULL -- then there is no point in storing the hash value. We would still need the unused/tombstone/used marker, but a much smaller data item would suffice. I do *not* think that would be worth the complications of the code. Unrelatedly, if the sorting of names in glib.symbols is important, then g_hash_table_unref and g_hash_table_ref need to be moved.
Colin: attachment 186880 [details] [review] fixes a typo introduced in attachment 186750 [details] [review], so I can't commit it now.
(In reply to comment #28) > There are still things that can be improved memory-wise. > > For example, if g_direct_hash is used -- explicitly or via NULL -- then > there is no point in storing the hash value. We would still need the > unused/tombstone/used marker, but a much smaller data item would suffice. > I do *not* think that would be worth the complications of the code. Yeah, lets hold off on that. > Unrelatedly, if the sorting of names in glib.symbols is important, then > g_hash_table_unref and g_hash_table_ref need to be moved. Its not important, except for the fact that it makes stuff easier to find From my perspective, things are looking good; just waiting for a quick nod from hpj.
(In reply to comment #25) > Created an attachment (id=186882) [details] [review] > 0007-GHash-introduce-g_hash_table_exists-convenience-API.patch > I strongly prefer g_hash_table_contains() as a name for that function.
2cents: Questions of bindability aside, I think it would make sense to have this happen 'automatically' for every hash table instead of introducing a new constructor. First, we decide a convention -- either 'value' is always (void*)1 or equal to key. Let's say key == value, since that's easier. Then we just need to create all hash tables in 'set' mode and add a check like this on inserts: if G_UNLIKELY (key != value && key_array == value_array) value_array = memdup (value_array); (G_UNLIKELY since it would only happen once per table, and sometimes never). I agree with 'contains' being a better name than 'exists'. 'has' or 'has_key' would also be OK in my opinion. I mildly prefer storing (void*)1 instead of key == value for a couple of reasons: - it's less bizarre from a language-binding standpoint since you just have HashTable<thing, bool>. Then you could have a function for insert_true or something like that. - it lets you store NULL in your set in a way that's easy to distinguish (ie: without using the 'extended' lookup API) but with the drawback: - implementing lookup becomes very slightly more complicated (and therefore slightly slower). I wouldn't care if it wasn't such a common operation.
Ah. Of course, the contains() function would address my point about NULL.
At some point we need to stop talking and start doing. (void*)1 is not a good option. It requires a lot of extra changes (== run-time tests) that are currently avoided by pointing the two arrays to the same data.
(In reply to comment #32) > if G_UNLIKELY (key != value && key_array == value_array) > value_array = memdup (value_array); I guess that last line was supposed to read "memdup (key_array)" instead? (Apart from the actual memdup implementation which would be a malloc+memcpy I presume).
Read again. key_array == value_array. :) But ya... obviously it was pseudo-code, since we'd use g_memdup and last I checked, memdup takes more arguments. :)
(In reply to comment #30) > From my perspective, things are looking good; just waiting for a quick nod from > hpj. Performance-wise, it looks good to me. I trust you on the API changes. I like Ryan's "automatic conversion" approach, but I worry about hard-to-notice memory inefficiencies where buggy app code accidentally turns a set into a map.
(In reply to comment #34) > At some point we need to stop talking and start doing. Indeed. So, since I quite like the 'automatic conversion' idea myself, here is what I have done now: - I have taken the first 3 patches which do the splitting of the nodes array into three separate arrays - The patch that introduces 'set mode' has been tweaked to do so transparently without an explicit is_set boolean. - The documentation now contains an example that shows how to use a hash table for storing a set - No new api has been added; we can always do that later if we decide that we need bindable api or a completely separate GHashSet.
Thinking about it, it should work, but are we sure that g_hash_table_new_full () with 2 DestroyNotifys works properly? A test with a refcounted structure would be nice.
Commit 6e45153 broke "make check": ** ERROR **: failed to reallocate slice from magazine (after g_thread_init): size=81 /bin/sh: line 5: 14743 Trace/breakpoint trap srcdir=. LIBCHARSET_ALIAS_DIR=../glib/libcharset MALLOC_CHECK_=2 MALLOC_PERTURB_=$((${RANDOM:-256} % 256)) ${dir}$tst FAIL: slice-threadinit
(In reply to comment #39) > Thinking about it, it should work, but are we sure that g_hash_table_new_full > () with 2 DestroyNotifys works properly? A test with a refcounted structure > would be nice. Thinking about it, it doesn't make any sense to call g_hash_table_insert (h, key, key) on a hash table with 2 destroy notifys
(In reply to comment #41) > (In reply to comment #39) > > Thinking about it, it should work, but are we sure that g_hash_table_new_full > > () with 2 DestroyNotifys works properly? A test with a refcounted structure > > would be nice. > > Thinking about it, it doesn't make any sense to call > > g_hash_table_insert (h, key, key) > > on a hash table with 2 destroy notifys But g_hash_table_insert (h, key, ref(key)) might make sense. I've added a test and fix for that now.
(In reply to comment #40) > Commit 6e45153 broke "make check": > > ** ERROR **: failed to reallocate slice from magazine (after g_thread_init): > size=81 > /bin/sh: line 5: 14743 Trace/breakpoint trap srcdir=. > LIBCHARSET_ALIAS_DIR=../glib/libcharset MALLOC_CHECK_=2 > MALLOC_PERTURB_=$((${RANDOM:-256} % 256)) ${dir}$tst > FAIL: slice-threadinit Should be fixed now.
(In reply to comment #42) > (In reply to comment #41) > > Thinking about it, it doesn't make any sense to call > > > > g_hash_table_insert (h, key, key) > > > > on a hash table with 2 destroy notifys > > But g_hash_table_insert (h, key, ref(key)) might make sense. I've added a test > and fix for that now. How do you propose that this might work from a g_hash_table_add_to_set() type API, should we want to add one? We don't know the ref function... Also: even if we did know it, what if it is something like g_strdup() (not too unlikely since I think that strings are the most commonly hashed items)?
One approach would be to only allow the use of add_to_set if value_destroy is NULL
i'm mildly surprised that i don't entirely hate that idea. :)
> One approach would be to only allow the use of add_to_set if value_destroy is > NULL There's no point. Using g_free as key and value destructor and inserting with key==value has always been a Bad[tm] idea. That has nothing to do with set functionality. Just don't do it.
I thought about this a bit more and figured out a way that we could dodge the issue somewhat. I don't think it could be seen as a substantial API break if we failed to call the value destroy notify on NULL pointers. Probably there are two situations here anyway: - we fail to call a function that would do nothing anyway or - we fail to call a function that would have crashed due to NULL both are fine and it's hard to imagine a case where destroy(NULL) really needs to be called. Then we could store NULL as the value for sets (and adjust the code accordingly). This would never cause problems of any kind since we wouldn't call the destroy notify. My original reason for favouring (void*)1 instead of NULL was so that you could use g_hash_table_lookup() to query the existence of an item in the set but this can clearly be solved with the contains() function. The *only* problem that can be imagined here is that the user assumes that the return value of g_hash_table_lookup() is non-NULL and they call it on a key that they added with add_to_set(). That's really impossible to protect against (and particularly when you consider that NULL is also used to simple signal that the key is not in the hashtable in the first place). I'll write a patch implementing my suggestion. You can take it or leave it.
Apart from the fact that I don't think we need separate set API on GHashTable, I do agree with Morten: If you create a set hash table with two destroy notifies, you deserve what you get. In particular, adding magic NULL special cases to the general code to break a little less when people to stupid things is not a good idea IMO.
Created attachment 187301 [details] [review] GHashTable: tweak sets
(In reply to comment #49) > If you create a set hash table with two destroy > notifies, you deserve what you get. Your language binding may do this for you.
If anything, this is more a reason for separating hash tables and sets than it is for messing up destroy notifies.
> Your language binding may do this for you. I don't get it. They did not do so yesterday so why will they suddenly start doing it now? Note, that item->item mappings are far more useful that just item->1 mappings. With the former you can still search for an entry when by supplying just enough information to satisfy equal+hash. The lookup function will then give you the full entry. For example, in my case -- a set of cells in a spreadsheet -- I simply supply a cell structure with only the coordinates filled in to the lookup function. That gives me the cell that is actually in the hash.