After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 554943 - need index/sorting for typelib names
need index/sorting for typelib names
Status: RESOLVED FIXED
Product: gobject-introspection
Classification: Platform
Component: general
unspecified
Other Linux
: Normal normal
: ---
Assigned To: gobject-introspection Maintainer(s)
gobject-introspection Maintainer(s)
Depends on:
Blocks: 571373
 
 
Reported: 2008-10-04 00:30 UTC by Colin Walters
Modified: 2015-02-07 16:53 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Bug 554943 - Sort typelib directory entries, use bsearch for lookup (3.73 KB, patch)
2009-02-13 22:58 UTC, Colin Walters
none Details | Review
Bug 554943 - Sort typelib directory entries, use bsearch for lookup (3.99 KB, patch)
2009-03-05 23:13 UTC, Colin Walters
none Details | Review
gthash: String hashing for typelib (22.49 KB, patch)
2010-10-18 21:59 UTC, Colin Walters
reviewed Details | Review
Remove COPYING from .gitignore (597 bytes, patch)
2010-10-25 19:19 UTC, Colin Walters
committed Details | Review
Import cmph-1.0.tar.gz (391.73 KB, patch)
2010-10-25 19:19 UTC, Colin Walters
committed Details | Review
Add a test for CMPH-BDZ (4.80 KB, patch)
2010-10-25 19:19 UTC, Colin Walters
committed Details | Review
Add internal hashing API designed for the typelib (12.81 KB, patch)
2010-10-25 19:19 UTC, Colin Walters
committed Details | Review
Add directory index section (8.41 KB, patch)
2010-10-25 19:19 UTC, Colin Walters
committed Details | Review
[REBASE] Account for both sections (929 bytes, patch)
2010-10-25 19:50 UTC, Colin Walters
committed Details | Review

Description Colin Walters 2008-10-04 00:30:00 UTC
Right now the typelib format has no indices for names, and it's not sorted, so every lookup is a full linear scan.

We should investigate either sorting the entries at typelib compilation time so we can use bsearch or adding an index.
Comment 1 Dan Winship 2009-02-12 14:21:37 UTC
It would be nice to use the same stable sort order for .gir files as well; that way, when you make changes to the scanner (or to the Makefile rule used to invoke the scanner for a particular library) it's very easy to diff the old and new .girs to make sure that the only changes are the ones you were expecting.
Comment 2 Colin Walters 2009-02-12 15:13:38 UTC
Yeah, makes sense though I'd like to do it in a separate bug; the typelib is far more involved.  The new .gir bug is bug 571483, initial patch attached.
Comment 3 Colin Walters 2009-02-13 22:58:49 UTC
Created attachment 128692 [details] [review]
Bug 554943 - Sort typelib directory entries, use bsearch for lookup

Sort the directory entires by name; all local first, then nonlocal.

We can then use a binary search for lookups.
Comment 4 Colin Walters 2009-02-13 22:59:27 UTC
^ Not tested extensively, but filing now before I leave this coffee shop so it doesn't get lost and we can discuss the approach.
Comment 5 Colin Walters 2009-03-05 23:13:27 UTC
Created attachment 130162 [details] [review]
Bug 554943 - Sort typelib directory entries, use bsearch for lookup

Sort the directory entires by name; all local first, then nonlocal.

We can then use a binary search for lookups.
Comment 6 Colin Walters 2009-03-05 23:26:00 UTC
It would be possible to land this after typelib 1.0 - we could use the minor version to determine whether the entries are sorted, but it'd be nicer to get this in before if we can.
Comment 7 Johan (not receiving bugmail) Dahlin 2009-03-18 12:23:34 UTC
(In reply to comment #5)
> Created an attachment (id=130162) [edit]
> Bug 554943 - Sort typelib directory entries, use bsearch for lookup
> 
> Sort the directory entires by name; all local first, then nonlocal.
> 
> We can then use a binary search for lookups.
> 

Do you have any numbers for common use cases?
Looks good to me, as there are no API or typelib changes and I assume 
that everything keeps on working as it did before.
Comment 8 Colin Walters 2009-03-19 23:19:26 UTC
(In reply to comment #7)
> (In reply to comment #5)
> > Created an attachment (id=130162) [edit]
> > Bug 554943 - Sort typelib directory entries, use bsearch for lookup
> > 
> > Sort the directory entires by name; all local first, then nonlocal.
> > 
> > We can then use a binary search for lookups.
> > 
> 
> Do you have any numbers for common use cases?

I don't have good timings at the moment, but I can say that to start up gnome-shell, before this patch we were doing ~90000 string comparisons.  With it we're doing 4000.  How much it impacts your app speed depends on the app and the binding, but it's a worthwhile speedup I think.

> Looks good to me, as there are no API or typelib changes and I assume 
> that everything keeps on working as it did before.

Well it is a binary format incompatible change; you'll need to rebuild your typelibs.  But hopefully one of the last.
Comment 9 Colin Walters 2009-08-24 19:18:40 UTC
Tried applying this now but it fails in tests/repository:

ake[3]: Entering directory `/src/jhbuild/checkout/gobject-introspection/tests/repository'
==17891== Memcheck, a memory error detector.
==17891== Copyright (C) 2002-2008, and GNU GPL'd, by Julian Seward et al.
==17891== Using LibVEX rev 1884, a library for dynamic binary translation.
==17891== Copyright (C) 2004-2008, and GNU GPL'd, by OpenWorks LLP.
==17891== Using valgrind-3.4.1, a dynamic binary instrumentation framework.
==17891== Copyright (C) 2000-2008, and GNU GPL'd, by Julian Seward et al.
==17891== For more details, rerun with: -v
==17891== 
Successfully found GCancellable
==17891== Invalid read of size 4
==17891==    at 0x4018B14: find_method (ginfo.c:1206)
==17891==    by 0x401928A: g_object_info_find_method (ginfo.c:1468)
==17891==    by 0x8048CB0: test_constructor_return_type (gitestrepo.c:22)
==17891==    by 0x804904D: main (gitestrepo.c:71)
==17891==  Address 0x410e3ac is not stack'd, malloc'd or (recently) free'd
**
ERROR:./gitestrepo.c:23:test_constructor_return_type: assertion failed: (constructor)


Punting to next release.
Comment 10 Colin Walters 2010-10-18 21:59:16 UTC
Created attachment 172655 [details] [review]
gthash: String hashing for typelib

First, add the notion of additional "sections" to the typelib.
A section index is inserted between the header and the directory.

Use this to add an optional "directory index" section which contains a
serialized hash table.  Originally, we were planning to do a binary
search on the directory, but that doesn't work because things are
actually written out recursively.

The serialized hash table approach should be even faster than binary
searching too.
Comment 11 Colin Walters 2010-10-18 22:00:35 UTC
This patch drops g_irepository_find_by_name off the radar in gnome-shell performance.
Comment 12 Owen Taylor 2010-10-22 15:42:19 UTC
Review of attachment 172655 [details] [review]:

The basic approach of extension with the Sections blob seems sound to me. 

The approach of the hash table is a little surprising. it's not implausible to introspect a library with thousands of functions or constants defined.

$ grep 'define' gdkkeysyms.h | wc -l
2175

So using a fixed 8-bit hash means that we are doing chains of around ~10 strcmps. Obviously 256-times better is way better, but if we are going to go to the effort of implementing a hash table, why not actually implement a scheme that gives us O(1) lookups for any number of directory indices?

The GTK+ icon cache is pretty much a literal serialization of GHashTable:

 - The GLib string hash function is used (hardcoded since the exact behavior of g_str_hash() is not part of the GLib api)
 - A prime size is picked from g_spaced_primes_closest
 - Each hash bucket leads to a linked list of entries (in the icon cache, the linked list entries are actually combined with the icons instead of pointing to the icons [without the combining, an array would work better than a linear list]

Open addressing (http://en.wikipedia.org/wiki/Hash_table#Open_addressing) would lead to a simpler format and fits well with the fact that all we need to store per entry is a single offset; however, there's been no validation of the GLib hash function with open addressing and I think it falls down with (+ 1) linear probing - that is, g_str_hash("GDK_b") == g_str_hash("GDK_a") + 1, so large ranges of the table will be consecutively occupied. Linear probing using an offset larger than the table size probably works fine, or double hashing, though that would require a picking second string hash function. Might be more work to figure out the right open addressing scheme than to adopt a more literal GHashTable serialization....

(And again, the 8-bit approach is a big improvement from current, so if you want me to review that as is, let me know... I just wouldn't have picked that scheme.)

::: girepository/girmodule.c
@@ +407,3 @@
+  /* Right now, just the GI_SECTION_DIRECTORY_INDEX */
+  offset2 = ALIGN_VALUE (header_size, 4);
+  header->sections = offset2;

Hasn't this offset already been used for:

  header->directory = ALIGN_VALUE (header_size, 4);

@@ +409,3 @@
+  header->sections = offset2;
+
+  section = (Section*) &data[offset2];

Doesn't seem like you've accounted for Section in the size that data is allocated

@@ +412,3 @@
+  section->id = GI_SECTION_DIRECTORY_INDEX;
+  section->offset = 0; /* Will fill in later */
+  offset2 += sizeof(Section);

Not sure if this is supposed to be accounting for the Section added before this or the Section added after this, but don't you need to account for both?

::: girepository/gitypelib-internal.h
@@ +287,2 @@
   /* <private> */
+  guint16 padding[5];

The odd padding count here is peculiar - sizeof(Header) will be a multiple of it's natural alignment, which is , so my assumption was that the odd-sized padding meant that there was an odd number of guint16's above (and so you were taking up 3 padding slots) but it seems to be even. I'm unclear why there is padding at all - sizeof(Header) doesn't to be part of the format.
Comment 13 Colin Walters 2010-10-22 17:03:19 UTC
(I
> So using a fixed 8-bit hash means that we are doing chains of around ~10
> strcmps. Obviously 256-times better is way better, but if we are going to go to
> the effort of implementing a hash table, why not actually implement a scheme
> that gives us O(1) lookups for any number of directory indices?

In other words, perfect hashing?  http://en.wikipedia.org/wiki/Perfect_hash

This indeed was my first thought, and I spent maybe 5-6 hours investigating this when researching this patch.  There are libraries out there to do it (http://cmph.sourceforge.net/), but they're oriented towards different use cases (*huge* data sets, code generation), whereas we have read-only blobs and relatively small data sets.

It wasn't obvious to me how to apply this because we can't code generate (the same algorithm needs to apply to multiple typelibs).

> The GTK+ icon cache is pretty much a literal serialization of GHashTable:

Which is not O(1).  Granted the probing is 

> (And again, the 8-bit approach is a big improvement from current, so if you
> want me to review that as is, let me know... I just wouldn't have picked that
> scheme.)

A literal ghash serialization would probably bring us down to ~2 average strcmps on GTK, I'd guess?  Which would be better for sure, but on the other hand, it's not clear to me it's better enough to justify the time/code complexity.

If we had a plan to make it *perfect* I think that'd definitely be interesting.

What I plan to do is change introspection so that we dynamically cache the GIBaseInfo for the implemented interfaces in the GIBaseInfo for the object.  That would be even better, and would obviate the need for better hashing here.
Comment 14 Colin Walters 2010-10-22 17:07:15 UTC
(In reply to comment #13)
>
> Which is not O(1).  Granted the probing is 

...expected better than linear.  So right now the typelib right now is hard-limited to 65536 entries.  Which with this hash, implies 256 strcmps.  A GHash serialization would probably still be expected ~4 strcmps.  So maybe it is worth doing.
Comment 15 Colin Walters 2010-10-22 23:23:41 UTC
I'm investigating an import of some CMPH code now.  It looks pretty straightforward.
Comment 16 Colin Walters 2010-10-25 19:19:11 UTC
Created attachment 173205 [details] [review]
Remove COPYING from .gitignore

Why the heck was it in there?
Comment 17 Colin Walters 2010-10-25 19:19:21 UTC
Created attachment 173206 [details] [review]
Import cmph-1.0.tar.gz

The only modification really was to change cmph_types.h to use the GLib
types, rather than having two random ifdefs for x86_64 and ia64.
Comment 18 Colin Walters 2010-10-25 19:19:27 UTC
Created attachment 173207 [details] [review]
Add a test for CMPH-BDZ
Comment 19 Colin Walters 2010-10-25 19:19:33 UTC
Created attachment 173208 [details] [review]
Add internal hashing API designed for the typelib

In multiple places in the typelib, but most importantly the directory,
we need some fast indexing.  Perfect hashing, as implemented by CMPH
(previous commit), is an exact fit for the problem domain.

Add an API built on top of CMPH which maps strings->guint16 (we just
need a guint16 for the typelib index).
Comment 20 Colin Walters 2010-10-25 19:19:41 UTC
Created attachment 173209 [details] [review]
Add directory index section

Use the internal perfect hashing API to add an index to the directory.

To support this, add the notion of additional "sections" to the
typelib.  A section index is inserted between the header and the
directory.
Comment 21 Colin Walters 2010-10-25 19:39:27 UTC
(In reply to comment #12)
>
> ::: girepository/girmodule.c
> @@ +407,3 @@
> +  /* Right now, just the GI_SECTION_DIRECTORY_INDEX */
> +  offset2 = ALIGN_VALUE (header_size, 4);
> +  header->sections = offset2;
> 
> Hasn't this offset already been used for:
> 
>   header->directory = ALIGN_VALUE (header_size, 4);

Fixed.
> 
> @@ +409,3 @@
> +  header->sections = offset2;
> +
> +  section = (Section*) &data[offset2];
> 
> Doesn't seem like you've accounted for Section in the size that data is
> allocated

Fixed.
> 
> @@ +412,3 @@
> +  section->id = GI_SECTION_DIRECTORY_INDEX;
> +  section->offset = 0; /* Will fill in later */
> +  offset2 += sizeof(Section);
> 
> Not sure if this is supposed to be accounting for the Section added before this
> or the Section added after this, but don't you need to account for both?

Yeah, we are now with the above fix.

> ::: girepository/gitypelib-internal.h
> @@ +287,2 @@
>    /* <private> */
> +  guint16 padding[5];
> 
> The odd padding count here is peculiar - sizeof(Header) will be a multiple of
> it's natural alignment, which is , so my assumption was that the odd-sized
> padding meant that there was an odd number of guint16's above (and so you were
> taking up 3 padding slots) but it seems to be even. I'm unclear why there is
> padding at all - sizeof(Header) doesn't to be part of the format.

It is actually; we put the header size in header->size, and validate it.
Comment 22 Colin Walters 2010-10-25 19:50:13 UTC
Created attachment 173214 [details] [review]
[REBASE] Account for both sections
Comment 23 Owen Taylor 2010-11-01 15:37:34 UTC
Sorry for not seeing this earlier - because gobject-introspection bugmail goes into my glib bugmail folder, I typically miss it.

(In reply to comment #13)
> (I
> > So using a fixed 8-bit hash means that we are doing chains of around ~10
> > strcmps. Obviously 256-times better is way better, but if we are going to go to
> > the effort of implementing a hash table, why not actually implement a scheme
> > that gives us O(1) lookups for any number of directory indices?
> 
> In other words, perfect hashing?  http://en.wikipedia.org/wiki/Perfect_hash

No, I meant standard hash tables. Leaving aside big-o-little-o technicalities, hash function breakdown, lookups are O(1). As you know, O(1) doesn't mean "1", it means "average time doesn't depend on the number of items in the hash table". If perfect hashing works out well as a strategy, we can use it, of course.
 
> This indeed was my first thought, and I spent maybe 5-6 hours investigating
> this when researching this patch.  There are libraries out there to do it
> (http://cmph.sourceforge.net/), but they're oriented towards different use
> cases (*huge* data sets, code generation), whereas we have read-only blobs and
> relatively small data sets.
> 
> It wasn't obvious to me how to apply this because we can't code generate (the
> same algorithm needs to apply to multiple typelibs).
> 
> > The GTK+ icon cache is pretty much a literal serialization of GHashTable:
> 
> Which is not O(1).  Granted the probing is 

See above.

> > (And again, the 8-bit approach is a big improvement from current, so if you
> > want me to review that as is, let me know... I just wouldn't have picked that
> > scheme.)
> 
> A literal ghash serialization would probably bring us down to ~2 average
> strcmps on GTK, I'd guess?  Which would be better for sure, but on the other
> hand, it's not clear to me it's better enough to justify the time/code
> complexity.
> 
> If we had a plan to make it *perfect* I think that'd definitely be interesting.

Serialization of GHashTables seems no more complex than your first attempt. And I think "independent of the size" is an important characteristic. If perfect hashing is the best way to get there, sure.
 
> What I plan to do is change introspection so that we dynamically cache the
> GIBaseInfo for the implemented interfaces in the GIBaseInfo for the object. 
> That would be even better, and would obviate the need for better hashing here.

Certainly anything that makes lookups one-for-runtime rather than one-per-object or one-per-call reduces the need for better hashing here, but it's appealing to do a good job ... come up with something that works however anybody is using a typelib.... if someone wants to have a library with 10,000 functions and doesn't want to add a caching layer on top when invoking those functions, then if with a similar amount of code we can make that work, all the better.
Comment 24 Owen Taylor 2010-11-01 17:14:57 UTC
Review of attachment 173205 [details] [review]:

Looks good
Comment 25 Owen Taylor 2010-11-01 18:04:46 UTC
OK, some evaluation of the use of CMPH.

Performance
===========
One thing that you have to keep in mind when looking at the performance as compared to using a simple string hash is that there is a performance cost from the lookup tables that CMPH uses to build the hash .... we have to read these in off disk and into processor caches. (For "off disk" the main question is probably whether the total size of the hash table changes.)

  GI:              keys bytes
  GObject-2.0       232 112
  GModule-2.0         7 32
  Gio-2.0           491 200
  GIRepository-2.0. 145 80

So it looks like we use around half a byte per key. (The figure that the CMPH page gives for minimal perfect hashes using this algorithm is 2.6 bytes per key.)

It appears that the hashing approach is something like:

 - Hash the string producing 96 bytes of hash data using a "Jenkins Hash"
 - Read three values from the table based the hash data
 - Combine them to get the hash location

Performance win or loss compared to a non-perfect hash table approach ?- space usage is clearly similar or a bit better. Time: the hash function might be a bit slower and we need to read 12 bytes out of the perfect hash, but we save 1 or maybe 2 strcmps per lookup. Not really predictable without timing. Certainly competitive, might be 50% faster or 50% slower.

I think it's certainly going to be OK for our usage.

Code size
=========
Seems to increase text size of libgirepository from ~100k to ~190k. Probably a lot of that will never be paged in?

Other considerations
====================
The current version leaks all the CMPH symbols (which aren't properly namespaced) into the global namepace. Not good.

Adding to the libgirepositor_1_0_LDFLAGS '-export-symbols-regex '^gi?_' looks like it should basically work. There used to be a problem where -export-symbols-regex didn't actually remove the symbols from the dynamic linking table but 'nm -D' on the result looks right with current libtool. The above addition does break your test program, since the _-prefixed symbols are removed, but you can figure out something for that ... either include under-score-prefixed in the result, don't under-score prefix, or link the test program separately against what it is testing instead of linking it against libgirepository.

Conclusion
==========
I think my conclusion is that this approach is solid and should work. I might not have gone this way, just because it's a huge chunk of hard-to-understand code, but I can't see us needing much maintenance on it.

It does mean that that the typelib format becomes an internal detail of libgirepository - not something that you could reasonably reimplement a reader for, but I don't think there's any downside to that.

Will review the patches in detail.
Comment 26 Owen Taylor 2010-11-01 18:13:48 UTC
Review of attachment 173206 [details] [review]:

Is CMPH LGPL2+ or LGPL2-only? If it's 2+, is there documentation of it being 2+? (LGPL2-only is compatible with GPL3, since the LGPL2 has explicit upgrade-to-any-GPL language, but LGPL2-only seems undesirable for a GNOME library)

I would put a README in the cmph directory with a pointer to the upstream project and as much information as available on license, copyright, and authorship.

I think this should be broken into three commits:

 - Split out Makefile-girepository.am
 - Add cmph without any changes (no glib modifications)
 - Add Makefile-cmph.am and change cmph however you were going to change cmph

::: configure.ac
@@ +119,3 @@
 AC_DEFINE_UNQUOTED(GIR_DIR, "$GIR_DIR", [Director prefix for gir installation])
 
+PKG_CHECK_MODULES(GLIB, [glib-2.0])

I think using PKG_CHECK_MODULES like this is pretty sketchy. PKG_CHECK_MODULES is supposed to get all the flags for particular target, elinminating duplicates. This should be PKG_CHECK_MODULES(CMPH, [glib-2.0])

::: girepository/Makefile-cmph.am
@@ +66,3 @@
+	cmph/vstack.c \
+	cmph/vstack.h
+	$(NULL)

$(NULL) here without a trailing \ on the previous item makes little sense. If you aren't using $(NULL) hack in other Makefile.am's then you shouldn't use here, I'd think.
Comment 27 Owen Taylor 2010-11-01 18:17:23 UTC
Review of attachment 173207 [details] [review]:

seems OK

::: girepository/Makefile-cmph.am
@@ +3,2 @@
 libcmph_la_CPPFLAGS = -Icmph $(GLIB_CFLAGS)
+libcmph_la_LIBADD = -lm $(GLIB_LIBS)

Looks like it should be part of something else?
Comment 28 Owen Taylor 2010-11-01 18:38:17 UTC
Review of attachment 173208 [details] [review]:

Mostly cosmetic stuff

::: girepository/Makefile-cmph.am
@@ +71,1 @@
+GTESTER_PROGS += cmph-bdz-test

Would sort of be nice of the TEST => GTESTER rename was done in the first patch, rather than patch A does one thing, patch B something else. Very minor.

::: girepository/Makefile-girepository.am
@@ +54,3 @@
+	givfuncinfo.c				\
+	gthash.c				\
+	$(NULL)

See previous comment $(NULL). (I personally dislike $(NULL): weird illegible clutter to save a tiny bit of maintenance work, but it's your project, I'll just comment on consistency.)

@@ +59,2 @@
 libgirepository_1_0_la_CPPFLAGS = $(GIREPO_CFLAGS) -DG_IREPOSITORY_COMPILATION
+libgirepository_1_0_la_LIBADD = libcmph.la $(GIREPO_LIBS)

See comment on bug about -export-symbols-regex

::: girepository/cmph-bdz-test.c
@@ +133,3 @@
 
+  g_test_add_func ("/cmph-bdz/search", test_search);
+  g_test_add_func ("/cmph-bdz/search-packed", test_search_packed);

More stuff that would ideally be moved between patches

::: girepository/gitypelib-internal.h
@@ +1154,3 @@
+guint32 _gi_typelib_hash_builder_get_buffer_size (GITypelibHashBuilder *builder);
+
+void _gi_typelib_hash_builder_pack (GITypelibHashBuilder *builder, guint8* mem, guint32 size);

Again, it's your project, you pick whether you want to follow glib style for prototypes or not, but don't do one thing, then three lines later do something different. (If you don't have egtk.el, pick up a copy. It's code that I haven't understood or maintained in 10 years, but C-c C-p to align function headers in the region is way handy.)

::: girepository/gthash-test.c
@@ +62,3 @@
+  ret = g_test_run ();
+
+  return ret;

Weird temporary variable

::: girepository/gthash.c
@@ +55,3 @@
+  char **strs;
+  cmph_t *c;
+  GHashTable *strings;

strs and strings as two members? (There's no actual need to have strs as a member as far as I can tell)

@@ +94,3 @@
+
+  num_elts = g_hash_table_size (builder->strings);
+  g_assert (num_elts < 65536);

You would actually be fine if num_elts == 65536 too, right?

@@ +121,3 @@
+  builder->dirmap_offset = packed_size;
+
+  builder->packed_size = builder->dirmap_offset + num_elts * sizeof(guint16);

would be more natural for me to read here 'packed_size' rather than builder->dirmap_offset. Or even better, add to packed_size and then do builder->packed_size = packed_size

@@ +124,3 @@
+}
+
+guint32 _gi_typelib_hash_builder_get_buffer_size (GITypelibHashBuilder *builder)

type on the same line

@@ +174,3 @@
+      builder->c = NULL;
+    }
+  if (builder->strings)

builder->strings can't be NULL

@@ +191,3 @@
+
+  g_assert ((((unsigned long)memory) & 0x3) == 0);
+  mph_size = *((guint32*)memory);

This is documented as an offset in the code above, and doesn't make sense as a size , since it's one more word more the size of the mph
Comment 29 Owen Taylor 2010-11-01 18:43:47 UTC
Review of attachment 173214 [details] [review]:

looks fine, should be squashed
Comment 30 Owen Taylor 2010-11-01 18:51:53 UTC
Review of attachment 173208 [details] [review]:

Followup - the type is GITypelibHashBuilder and the source file is gthash.[ch]. Not a fan of having to grep to find code.
Comment 31 Owen Taylor 2010-11-01 18:59:50 UTC
Review of attachment 173209 [details] [review]:

A few comments, and one thought from the CMPQ faq - "The algorithms do not guarantee that a minimal perfect hash function can be created. In practice, it will always work if your input is big enough (>100 keys)". Do we need to handle that - certainly the input isnt' always > 100 keys? Fall back to a non-minimal perfect hash?

::: girepository/girmodule.c
@@ +262,3 @@
+  required_size = _gi_typelib_hash_builder_get_buffer_size (dirindex_builder);
+
+  new_offset = *offset2 + ALIGN_VALUE (required_size, 4);

Do we know that offset2 is aligned? If we do, should we still align it for safety? It seems better to make the person needs something aligned align it rather than always trusting the last person and aligning for the next person. (Note that if you align it, you need to do it before filling section->offset above)

::: girepository/gitypelib.c
@@ +190,1 @@
+      index = _gi_typelib_hash_search (hash, name);

What's the guarantee of CMPH on not-in-table data? If you have a minimal perfect  hash, does all input always give you an index in the range? (Couldn't actually find any docs for CMPH at that level of detail)
Comment 32 Colin Walters 2010-11-16 22:56:34 UTC
(In reply to comment #31)
> Review of attachment 173209 [details] [review]:
> 
> A few comments, and one thought from the CMPQ faq - "The algorithms do not
> guarantee that a minimal perfect hash function can be created. In practice, it
> will always work if your input is big enough (>100 keys)". Do we need to handle
> that - certainly the input isnt' always > 100 keys? Fall back to a non-minimal
> perfect hash?

In this case with the latest code, we just don't add a directory index.

> What's the guarantee of CMPH on not-in-table data? If you have a minimal
> perfect  hash, does all input always give you an index in the range? (Couldn't
> actually find any docs for CMPH at that level of detail)

Yeah, it does.
Comment 33 Colin Walters 2010-12-03 21:02:46 UTC
Attachment 173208 [details] pushed as 6911038 - Add internal hashing API designed for the typelib
Attachment 173209 [details] pushed as 24a2b4c - Add directory index section
Comment 34 André Klapper 2015-02-07 16:53:53 UTC
[Mass-moving gobject-introspection tickets to its own Bugzilla product - see bug 708029. Mass-filter your bugmail for this message: introspection20150207 ]