Bug 751610 – g_str_hash produces collisions with strings of length 2

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 751610 - g_str_hash produces collisions with strings of length 2


Summary:	g_str_hash produces collisions with strings of length 2


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2015-06-28 14:39 UTC by Thiago Bellini
Modified:	2015-07-27 10:54 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
patch to fix the bug (1.86 KB, patch) 2015-06-28 14:39 UTC, Thiago Bellini	rejected	Details \| Review
simple reproduction of the problem (443 bytes, text/x-python) 2015-06-28 14:40 UTC, Thiago Bellini		Details
reproduction in c (264 bytes, text/x-csrc) 2015-06-30 03:39 UTC, Thiago Bellini		Details

Description Thiago Bellini 2015-06-28 14:39:14 UTC

Created attachment 306239 [details] [review]
patch to fix the bug

djb can have a lot of collisions when using strings of length 2. For example:

  djb("Sc") = 5862747
  djb("TB") = 5862747

  djb("zP") = 5864015
  djb("yq") = 5864015

sdbm provides a much better distribution and has a very low collision rate.
Using a large amount randomly generated strings of length 2-50, I couldn't
produce a single collision with it (while djb produced those examples and
a lot more).

Comment 1 Thiago Bellini 2015-06-28 14:40:59 UTC

Created attachment 306240 [details]
simple reproduction of the problem

A simple python script to show the problem happening with the 2 examples I mentioned on the description.

Comment 2 Thiago Bellini 2015-06-28 14:48:28 UTC

This is file I used to test both algorithms: https://www.async.com.br/~hackedbellini/tmp/data.txt (could not attach it as it is larger than the maximum allowed attachments size)

I created that file using this python script:


with open('data.txt') as f:
    for i in xrange(100000):
        w = ''.join(random.choice(string.ascii_letters) for x in range(1, random.choice(range(2, 50)))) + random.choice([' ', '\n'])
        f.write(w)


I was doing that to test a Lempel-Ziv-Vesh compressor for a university assignment. In the assignment, the ' ' and '\n' where the word separators.

I based my hash algorithm on glib, and after running it against that file, I noticed a lot of collisions happening on strings of length 2.

After changing the algorithm to use sdbm, no collisions happened anymore, and I could compress/decompress that file lossless.

Comment 3 Thiago Bellini 2015-06-30 03:39:52 UTC

Created attachment 306363 [details]
reproduction in c

This is the problem reproduced in c, just to have something that is not python as an example.

Comment 4 Matthias Clasen 2015-07-26 19:22:41 UTC

As we discussed on irc, the algorithm used by g_str_hash is publicly documented and fixed. We can't just replace it. What we _can_ do, however, is to document your findings - a patch to add this information to the docs for g_str_hash would be welcome.

Comment 5 Matthias Clasen 2015-07-26 19:23:16 UTC

Review of attachment 306239 [details] [review]:

.