GNOME Bugzilla – Bug 151615
Use linear probing for GHashTable?
Last modified: 2011-02-18 16:14:05 UTC
Create a hash from words to integer, with about 4 million entries. It will waste about 200MB. A straight linear probing hash takes only about 80MB. The reason is that Glib uses linked lists for hash implementation by now.
I am working in a project where we do create some big string->uint hash tables (> 4 million elements). One thing I've noticed is the big memory overhead of glib's hash table. glib's hash table uses up to 200MB to store the data, while a straight linear probing hash implementation, with the same hashing function, takes about 90MB, with the same performance. As far as I can see, the main reason behind this is that glib's hash implementation uses linked list. So my first question is: 1. Is there some reason to use linked lists instead of linear probing (putting collided elements in the next bucket)? 2. Would glib/gtk team be interested in a linear/custom probing hash implementation? One more thing I would like to have in the API would be the ability to arbitrarily change the equality function for hash table (between calls to g_hash_table_lookup). There are some dirty tricks that can be done with this, like looking in the hash table with or without case-sensitivity, without the need of keeping two different hash tables. Do you think that a g_hash_table_set_equal_func() function could be part of the API? Another advantage of the linear probing implementation is that g_hash_table_destroy() could be made faster (current implementation is quite slow).
First of all, thank you for all the feedback. I think most of you can go on for a long time about the advantages and disadvantages of different hash implementations. But probably everybody agrees that there is not a single approach that is good for every case. Let me discuss some of the options and give my opinion on how we could get this done. There are several options: 1. Current linked list implementation: this is simple, readable and gives good performance, but has a huge memory penalty 2. Dynamic arrays: probably better than linked list, but I don't think it is worth it. First, you still have a memory penalty for the size of the arrays of for a array terminator. Furthermore, allocating memory blocks at will might result in a very fragmented and slow to free memory. 3. Linear probing: this is what I want. It is simpler than double hashing and could be easily changed for quadric probing. Memory overhead is very small, and if you can tune your hash function and expose the g_hash_resize() call to "close" the hash, you can get very good results. By "close" I mean doing mostly (exclusively) lookup operations. 4. Double hashing: another good option that someone might want to give a try. Since there are too many options and too many tradeoffs/scenarios, my suggestion would be to go with a polymorphic implementation using a union. So you create your hash using g_hash_table_new (linked list), g_hash_table_linear_probing_new(), g_hash_double_hashing_new(), g_hash_array_new() and so on. Then you simply feed the returned pointers to the common methods (ex: g_hash_lookup()) and let the first byte of the union demultiplex the correct call. What I want to implement is the linear probing approach, but I can leave the hooks for those that want others approches. The advantage is that running code will not be affected at all. Do you like this idea?
> The general rule we've followed in GLib is that exposing minor > data structure differences or implementation details to the user > is wrong. We've consistently, for example, rejected requests to > add GRBTree to go along with GTree. (Which is an AVL tree.) > > If you can't give a clear and convincing description of why you > should use container A or container B in 10 seconds, then the difference > between the containers isn't worth worrying about for virtually all > application programmers. Well, I can point two. It uses too much memory. It is too slow to destroy. But I already got your point when Keith Packard said that glib should not be a data structures text book. > If a linear probing hash table is actually better for most applications, > then we should just change GHashTable to use it. The way to demonstrate > this is to look around at how GHashTable is used in applications, > GLib, GTK+, write some benchmarks that resemble those use cases, > and show comparisons of speed / memory usage. Well, I don't think I am capable of doing it, because, although I do use glib extensively, I am not a GTK+ programmer. Even if I could, I don't see with good eyes changing a fundamental structure implementation that has been around for so many time. Some applications might benefit, others not. So, if the union solution is not welcome (and I do understand the reasons), everything should stay as it is now. > If a linear probing hash table is only better for a million entry > hash table, then probably you should just include your own hash > table in your application.
Since the discussion led to the conclusion that implementation should not be changed unless someone prove that an open addressing (linear probing) yields better performance in most of glib use case, and, at this point, this is probably a lot of work and insignificant benefit, things should stay as they are. But glib's hash _documentation_ should be changed to say that it is not suitable for big (+1 million) hash tables.
For future reference, please *link* to mailing list threads in the archives rather than pasting parts into the bug. http://mail.gnome.org/archives/gtk-devel-list/2004-August/msg00136.html Continued in: http://mail.gnome.org/archives/gtk-devel-list/2004-September/msg00000.html
Created attachment 32290 [details] Beta implementation of linear probing hash This is a initial implementation of a linear probing hash. It seems to be working nice, although quite slowly. The main problem is that the implementation of the removal operation is non-trivial in using linear probing. I used a array of flags to differentiate removed, empty, and used entries, which resulted in a memory overhead of two bits per entry. The other reason that the code is two slow is that I used quite naive politics for rehashing (hash table resizing). I need to spend a bit more of time in it. Tips are welcome. Finally, the automated test, hash-test.c is a bit unfair, since it creates a full hash, remove all elements and fill the hash again. AFAIK, this is not a common use case.
Created attachment 32291 [details] Beta implementation of linear probing hash This is a initial implementation of a linear probing hash. It seems to be working nice, although quite slowly. The main problem is that the implementation of the removal operation is non-trivial in using linear probing. I used a array of flags to differentiate removed, empty, and used entries, which resulted in a memory overhead of two bits per entry. The other reason that the code is too slow is that I used quite naive politics for rehashing (hash table resizing). I need to spend a bit more of time in it. Tips are welcome. Finally, the automated test, hash-test.c is a bit unfair, since it creates a full hash, remove all elements and fill the hash again. AFAIK, this is not a common use case.
Anyone implementing another hash must take care with the automated test. Line 387 calls g_hash_table_foreach (hash_table, my_hash_callback, NULL), creating a subtle bug. my_hash_callback changes the values of the elements in the hash. But the values of the hash are the same pointers of the keys (line 351 g_hash_table_insert (hash_table, &array[i], &array[i])). Therefore, the keys will also be changed, what might corrupt the hash table. Both the current implementation and the linear probing hash can pass the test, but purely by luck.
Comment on attachment 32291 [details] Beta implementation of linear probing hash Dup of #32290
We're planning to switch GHashTable to open addressing in 2.19.x, see bug 526456
With open addressing being committed, I guess this can be closed now.