Bug 151615 – Use linear probing for GHashTable?

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 151615 - Use linear probing for GHashTable?


Summary:	Use linear probing for GHashTable?


Status:	RESOLVED WONTFIX

Product:	glib
Classification:	Platform
Component:	general
Version:	2.4.x
Hardware:	Other Linux

Importance:	Low minor
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2004-09-01 15:02 UTC by Davi de Castro Reis
Modified:	2011-02-18 16:14 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Beta implementation of linear probing hash (20.77 KB, text/plain) 2004-10-06 14:31 UTC, Davi de Castro Reis	Details
Beta implementation of linear probing hash (20.77 KB, text/plain) 2004-10-06 14:31 UTC, Davi de Castro Reis	Details

Description Davi de Castro Reis 2004-09-01 15:02:32 UTC

Create a hash from words to integer, with about 4 million entries. It will waste
about 200MB. A straight linear probing hash takes only about 80MB. The reason is
that Glib uses linked lists for hash implementation by now.

Comment 1 Davi de Castro Reis 2004-09-01 15:03:49 UTC

I am working in a project where we do create some big string->uint hash tables
(> 4 million elements). One thing I've noticed is the big memory overhead of
glib's hash table.

glib's hash table uses up to 200MB to store the data, while a straight linear
probing hash implementation, with the same hashing function, takes about 90MB,
with the same performance. As far as I can see, the main reason behind this is
that glib's hash implementation uses linked list. So my first question is:

1. Is there some reason to use linked lists instead of linear probing (putting
collided elements in the next bucket)?

2. Would glib/gtk team be interested in a linear/custom probing hash implementation?

One more thing I would like to have in the API would be the ability to
arbitrarily change the equality function for hash table (between calls to
g_hash_table_lookup). There are some dirty tricks that can be done with this,
like looking in the hash table with or without case-sensitivity, without the
need of keeping two different hash tables. Do you think that a
g_hash_table_set_equal_func() function could be part of the API?

Another advantage of the linear probing implementation is that
g_hash_table_destroy() could be made faster (current implementation is quite slow).

Comment 2 Davi de Castro Reis 2004-09-03 01:06:49 UTC

First of all, thank you for all the feedback. I think most of you can go on for
a long time about the advantages and disadvantages of different hash
implementations. But probably everybody agrees that there is not a single
approach that is good for every case.

Let me discuss some of the options and give my opinion on how we could get this
done. There are several options:

1. Current linked list implementation: this is simple, readable and gives good
performance, but has a huge memory penalty

2. Dynamic arrays: probably better than linked list, but I don't think it is
worth it. First, you still have a memory penalty for the size of the arrays of
for a array terminator. Furthermore, allocating memory blocks at will might
result in a very fragmented and slow to free memory.

3. Linear probing: this is what I want. It is simpler than double hashing and
could be easily changed for quadric probing. Memory overhead is very small, and
if you can tune your hash function and expose the g_hash_resize() call to
"close" the hash, you can get very good results. By "close" I mean doing mostly
(exclusively) lookup operations.

4. Double hashing: another good option that someone might want to give a try.

Since there are too many options and too many tradeoffs/scenarios, my suggestion
would be to go with a polymorphic implementation using a union. So you create
your hash using g_hash_table_new (linked list),
g_hash_table_linear_probing_new(), g_hash_double_hashing_new(),
g_hash_array_new() and so on. Then you simply feed the returned pointers to the
common methods (ex: g_hash_lookup()) and let the first byte of the union
demultiplex the correct call.

What I want to implement is the linear probing approach, but I can leave the
hooks for those that want others approches. The advantage is that running code
will not be affected at all. Do you like this idea?

Comment 3 Davi de Castro Reis 2004-09-03 01:07:59 UTC

> The general rule we've followed in GLib is that exposing minor
> data structure differences or implementation details to the user
> is wrong. We've consistently, for example, rejected requests to
> add GRBTree to go along with GTree. (Which is an AVL tree.)
>
> If you can't give a clear and convincing description of why you
> should use container A or container B in 10 seconds, then the difference
> between the containers isn't worth worrying about for virtually all
> application programmers. 


Well, I can point two. It uses too much memory. It is too slow to destroy. But I
already got your point when Keith Packard said that glib should not be a data
structures text book.

 > If a linear probing hash table is actually better for most applications,

> then we should just change GHashTable to use it. The way to demonstrate
> this is to look around at how GHashTable is used in applications,
> GLib, GTK+, write some benchmarks that resemble those use cases,
> and show comparisons of speed / memory usage.


Well, I don't think I am capable of doing it, because, although I do use glib
extensively, I am not a GTK+ programmer. Even if I could, I don't see with good
eyes changing a fundamental structure implementation that has been around for so
many time. Some applications might benefit, others not. So, if the union
solution is not welcome (and I do understand the reasons), everything should
stay as it is now.

> If a linear probing hash table is only better for a million entry
> hash table, then probably you should just include your own hash
> table in your application.

Comment 4 Davi de Castro Reis 2004-09-03 01:11:48 UTC

Since the discussion led to the conclusion that implementation should not be
changed unless someone prove that an open addressing (linear probing) yields
better performance in most of glib use case, and, at this point, this is
probably a lot of work and insignificant benefit, things should stay as they are.

But glib's hash _documentation_ should be changed to say that it is not suitable
for big (+1 million) hash tables.

Comment 5 Owen Taylor 2004-09-03 14:33:19 UTC

For future reference, please *link* to mailing list threads in the archives
rather than pasting parts into the bug.

http://mail.gnome.org/archives/gtk-devel-list/2004-August/msg00136.html

Continued in:

http://mail.gnome.org/archives/gtk-devel-list/2004-September/msg00000.html

Comment 6 Davi de Castro Reis 2004-10-06 14:31:14 UTC

Created attachment 32290 [details]
Beta implementation of linear probing hash

This is a initial implementation of a linear probing hash. It seems to be
working nice, although quite slowly. 

The main problem is that the implementation of the removal operation is
non-trivial in using linear probing. I used a array of flags to differentiate
removed, empty, and used entries, which resulted in a memory overhead of two
bits per entry.

The other reason that the code is two slow is that I used quite naive politics
for rehashing (hash table resizing). I need to spend a bit more of time in it.
Tips are welcome.

Finally, the automated test, hash-test.c is a bit unfair, since it creates a
full hash, remove all elements and fill the hash again. AFAIK, this is not a
common use case.

Comment 7 Davi de Castro Reis 2004-10-06 14:31:23 UTC

Created attachment 32291 [details]
Beta implementation of linear probing hash

This is a initial implementation of a linear probing hash. It seems to be
working nice, although quite slowly. 

The main problem is that the implementation of the removal operation is
non-trivial in using linear probing. I used a array of flags to differentiate
removed, empty, and used entries, which resulted in a memory overhead of two
bits per entry.

The other reason that the code is too slow is that I used quite naive politics
for rehashing (hash table resizing). I need to spend a bit more of time in it.
Tips are welcome.

Finally, the automated test, hash-test.c is a bit unfair, since it creates a
full hash, remove all elements and fill the hash again. AFAIK, this is not a
common use case.

Comment 8 Davi de Castro Reis 2004-10-06 14:37:05 UTC

Anyone implementing another hash must take care with the automated test. Line
387 calls g_hash_table_foreach (hash_table, my_hash_callback, NULL), creating a
subtle bug. my_hash_callback changes the values of the elements in the hash. But
the values of the hash are the same pointers of the keys (line 351
g_hash_table_insert (hash_table, &array[i], &array[i])). Therefore, the keys
will also be changed, what might corrupt the hash table. Both the current
implementation and the linear probing hash can pass the test, but purely by luck.

Comment 9 Davi de Castro Reis 2004-10-06 14:38:36 UTC

Comment on attachment 32291 [details]
Beta implementation of linear probing hash

Dup of #32290

Comment 10 Matthias Clasen 2008-08-17 02:03:01 UTC

We're planning to switch GHashTable to open addressing in 2.19.x, see bug 526456

Comment 11 Matthias Clasen 2008-11-04 04:13:32 UTC

With open addressing being committed, I guess this can be closed now.