After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 751610 - g_str_hash produces collisions with strings of length 2
g_str_hash produces collisions with strings of length 2
Product: glib
Classification: Platform
Component: general
Other Linux
: Normal normal
: ---
Assigned To: gtkdev
Depends on:
Reported: 2015-06-28 14:39 UTC by Thiago Bellini
Modified: 2015-07-27 10:54 UTC
See Also:
GNOME target: ---
GNOME version: ---

patch to fix the bug (1.86 KB, patch)
2015-06-28 14:39 UTC, Thiago Bellini
rejected Details | Review
simple reproduction of the problem (443 bytes, text/x-python)
2015-06-28 14:40 UTC, Thiago Bellini
reproduction in c (264 bytes, text/x-csrc)
2015-06-30 03:39 UTC, Thiago Bellini

Description Thiago Bellini 2015-06-28 14:39:14 UTC
Created attachment 306239 [details] [review]
patch to fix the bug

djb can have a lot of collisions when using strings of length 2. For example:

  djb("Sc") = 5862747
  djb("TB") = 5862747

  djb("zP") = 5864015
  djb("yq") = 5864015

sdbm provides a much better distribution and has a very low collision rate.
Using a large amount randomly generated strings of length 2-50, I couldn't
produce a single collision with it (while djb produced those examples and
a lot more).
Comment 1 Thiago Bellini 2015-06-28 14:40:59 UTC
Created attachment 306240 [details]
simple reproduction of the problem

A simple python script to show the problem happening with the 2 examples I mentioned on the description.
Comment 2 Thiago Bellini 2015-06-28 14:48:28 UTC
This is file I used to test both algorithms: (could not attach it as it is larger than the maximum allowed attachments size)

I created that file using this python script:

with open('data.txt') as f:
    for i in xrange(100000):
        w = ''.join(random.choice(string.ascii_letters) for x in range(1, random.choice(range(2, 50)))) + random.choice([' ', '\n'])

I was doing that to test a Lempel-Ziv-Vesh compressor for a university assignment. In the assignment, the ' ' and '\n' where the word separators.

I based my hash algorithm on glib, and after running it against that file, I noticed a lot of collisions happening on strings of length 2.

After changing the algorithm to use sdbm, no collisions happened anymore, and I could compress/decompress that file lossless.
Comment 3 Thiago Bellini 2015-06-30 03:39:52 UTC
Created attachment 306363 [details]
reproduction in c

This is the problem reproduced in c, just to have something that is not python as an example.
Comment 4 Matthias Clasen 2015-07-26 19:22:41 UTC
As we discussed on irc, the algorithm used by g_str_hash is publicly documented and fixed. We can't just replace it. What we _can_ do, however, is to document your findings - a patch to add this information to the docs for g_str_hash would be welcome.
Comment 5 Matthias Clasen 2015-07-26 19:23:16 UTC
Review of attachment 306239 [details] [review]: