GNOME Bugzilla – Bug 751610
g_str_hash produces collisions with strings of length 2
Last modified: 2015-07-27 10:54:12 UTC
Created attachment 306239 [details] [review] patch to fix the bug djb can have a lot of collisions when using strings of length 2. For example: djb("Sc") = 5862747 djb("TB") = 5862747 djb("zP") = 5864015 djb("yq") = 5864015 sdbm provides a much better distribution and has a very low collision rate. Using a large amount randomly generated strings of length 2-50, I couldn't produce a single collision with it (while djb produced those examples and a lot more).
Created attachment 306240 [details] simple reproduction of the problem A simple python script to show the problem happening with the 2 examples I mentioned on the description.
This is file I used to test both algorithms: https://www.async.com.br/~hackedbellini/tmp/data.txt (could not attach it as it is larger than the maximum allowed attachments size) I created that file using this python script: with open('data.txt') as f: for i in xrange(100000): w = ''.join(random.choice(string.ascii_letters) for x in range(1, random.choice(range(2, 50)))) + random.choice([' ', '\n']) f.write(w) I was doing that to test a Lempel-Ziv-Vesh compressor for a university assignment. In the assignment, the ' ' and '\n' where the word separators. I based my hash algorithm on glib, and after running it against that file, I noticed a lot of collisions happening on strings of length 2. After changing the algorithm to use sdbm, no collisions happened anymore, and I could compress/decompress that file lossless.
Created attachment 306363 [details] reproduction in c This is the problem reproduced in c, just to have something that is not python as an example.
As we discussed on irc, the algorithm used by g_str_hash is publicly documented and fixed. We can't just replace it. What we _can_ do, however, is to document your findings - a patch to add this information to the docs for g_str_hash would be welcome.
Review of attachment 306239 [details] [review]: .