GNOME Bugzilla – Bug 751610
g_str_hash produces collisions with strings of length 2
Last modified: 2015-07-27 10:54:12 UTC
Created attachment 306239 [details] [review]
patch to fix the bug
djb can have a lot of collisions when using strings of length 2. For example:
djb("Sc") = 5862747
djb("TB") = 5862747
djb("zP") = 5864015
djb("yq") = 5864015
sdbm provides a much better distribution and has a very low collision rate.
Using a large amount randomly generated strings of length 2-50, I couldn't
produce a single collision with it (while djb produced those examples and
a lot more).
Created attachment 306240 [details]
simple reproduction of the problem
A simple python script to show the problem happening with the 2 examples I mentioned on the description.
This is file I used to test both algorithms: https://www.async.com.br/~hackedbellini/tmp/data.txt (could not attach it as it is larger than the maximum allowed attachments size)
I created that file using this python script:
with open('data.txt') as f:
for i in xrange(100000):
w = ''.join(random.choice(string.ascii_letters) for x in range(1, random.choice(range(2, 50)))) + random.choice([' ', '\n'])
I was doing that to test a Lempel-Ziv-Vesh compressor for a university assignment. In the assignment, the ' ' and '\n' where the word separators.
I based my hash algorithm on glib, and after running it against that file, I noticed a lot of collisions happening on strings of length 2.
After changing the algorithm to use sdbm, no collisions happened anymore, and I could compress/decompress that file lossless.
Created attachment 306363 [details]
reproduction in c
This is the problem reproduced in c, just to have something that is not python as an example.
As we discussed on irc, the algorithm used by g_str_hash is publicly documented and fixed. We can't just replace it. What we _can_ do, however, is to document your findings - a patch to add this information to the docs for g_str_hash would be welcome.
Review of attachment 306239 [details] [review]: