GNOME Bugzilla – Bug 500507
GHashTableIter API
Last modified: 2007-12-15 05:01:18 UTC
g_hash_table_foreach() is painful to use since C does not support closures. The attached patch implements a GHashTableIter API, which can be used as follows: GHashTableIter i; gpointer key; gpointer value; g_hash_table_iter_init(table, &i); while (g_hash_table_iter_next(&i, &key, &value)) { /* do something with key and value */ }
Created attachment 99865 [details] [review] GHashTableIter
Comment on attachment 99865 [details] [review] GHashTableIter >--- glib/ghash.c.orig 2007-09-16 18:28:33.000000000 +0200 >+++ glib/ghash.c 2007-11-29 21:27:45.000000000 +0100 >@@ -877,6 +877,38 @@ > } > } > >+void >+g_hash_table_iter_init (GHashTable *hash_table, >+ GHashTableIter *iter) >+{ >+ iter->hash_table = hash_table; >+ iter->node = NULL; >+ iter->position = 0; >+} >+ >+gboolean >+g_hash_table_iter_next (GHashTableIter *iter, >+ gpointer *key, >+ gpointer *value) >+{ >+ g_return_val_if_fail (iter != NULL, NULL); >+ >+ if (iter->node != NULL) >+ iter->node = iter->node->next; >+ >+ while (iter->node == NULL) >+ { >+ if (iter->position >= iter->hash_table->size) >+ return FALSE; >+ iter->node = iter->hash_table->nodes[iter->position++]; >+ } >+ >+ if (key != NULL) >+ *key = iter->node->key; >+ if (value != NULL) >+ *value = iter->node->value; >+ >+ return TRUE; >+} >+ > > #define __G_HASH_C__ > #include "galiasdef.c" >--- glib/ghash.h.orig 2007-09-16 18:28:33.000000000 +0200 >+++ glib/ghash.h 2007-11-29 21:24:45.000000000 +0100 >@@ -38,6 +38,16 @@ > gpointer value, > gpointer user_data); > >+typedef struct _GHashTableIter GHashTableIter; >+ >+struct _GHashTableIter >+{ >+ /* private */ >+ GHashTable *hash_table; >+ GHashNode *node; >+ int position; >+}; >+ > /* Hash tables > */ > GHashTable* g_hash_table_new (GHashFunc hash_func, >@@ -81,6 +91,12 @@ > GList * g_hash_table_get_keys (GHashTable *hash_table); > GList * g_hash_table_get_values (GHashTable *hash_table); > >+void g_hash_table_iter_init (GHashTable *hash_table, >+ GHashTableIter *iter); >+gboolean g_hash_table_iter_next (GHashTableIter *iter, >+ gpointer *key, >+ gpointer *value); >+ > /* keeping hash tables alive */ > GHashTable* g_hash_table_ref (GHashTable *hash_table); > void g_hash_table_unref (GHashTable *hash_table);
Not a huge fan. Hash tables are not designed primarily for iteration, and there are so many other callbacks in glib that can use an iterator too. In any case, the iter should be made private. No more opaque types please.
I think this is a nice feature... lists don't need iters since you can just do the "for (l = list; l; l = l->next)" loop, and trees are rarely used, so this is the only commonly-used place in GLib that I end up repeatedly using a foreach func. Some suggestions: - it needs to be documented when the iter becomes valid; e.g. is it OK to remove items while iterating (in the current implementation it looks like it is not) - optionally, perhaps you could do a change serial to detect and warn about changes to the hash table while iterating; but this may add too much overhead I suppose - I would do the iter like GtkTextIter, with "dummy1" or "__dummy1" field names. I don't think it can be truly opaque, since having to malloc it would be too inconvenient. But giving nice names and types to the public fields means that people will try to use them.
(In reply to comment #4) > - it needs to be documented when the iter becomes valid; e.g. is it OK > to remove items while iterating (in the current implementation > it looks like it is not) Modifying the hash table while iterating yields undefined behaviour, like g_hash_table_foreach(). > - optionally, perhaps you could do a change serial to detect and > warn about changes to the hash table while iterating; but this may > add too much overhead I suppose The overhead is small, but I decided not to add this feature since g_hash_table_foreach() does not bother with it either. > - I would do the iter like GtkTextIter, with "dummy1" or "__dummy1" > field names. I don't think it can be truly opaque, > since having to malloc it would be too inconvenient. But > giving nice names and types to the public fields means that people > will try to use them. Good point. I did this (and also added the customary entry assertions that I had forgotten in g_hash_table_iter_init()) in the updated patch.
Created attachment 99892 [details] [review] updated patch
A) Unlike Behdad, I'd be a huge fan of this feature; I've written a *lot* of little closure structs over the years. B) g_hash_table_iter_init should either be init_iter() or have it's arguments reversed. (I'd probably do the second to keep it grouped with iter_next() C) + ri->node = ri->hash_table->nodes[ri->position++]; Would be distinctly clearer with ri->position++ on a separate line. Also clearer if 'position' was 'next_position' D) Would it make sense to add g_hash_table_iter_remove/steal() that remove/steal the *last* node? (Means adding an extra field to the iter to store the last value of ri->node) foreach_remove() is certainly less common than foreach(), but does come up reasonably often. One thing you couldn't do with that is the final G_HASH_TABLE_RESIZE (hash_table) but that isn't really a big deal.... there's not a whole lot of gain from resizing hash tables smaller.
Created attachment 99925 [details] [review] updated following Owen Taylor's comments
(In reply to comment #7) > A) Unlike Behdad, I'd be a huge fan of this feature; I've written a *lot* of > little closure structs over the years. Who hasn't! > B) g_hash_table_iter_init should either be init_iter() or have it's arguments > reversed. (I'd probably do the second to keep it grouped with iter_next() Fixed. > + ri->node = ri->hash_table->nodes[ri->position++]; > > Would be distinctly clearer with ri->position++ on a separate line. > Also clearer if 'position' was 'next_position' Fixed. > D) Would it make sense to add > > g_hash_table_iter_remove/steal() > > that remove/steal the *last* node? (Means adding an extra field to the > iter to store the last value of ri->node) foreach_remove() is certainly > less common than foreach(), but does come up reasonably often. I think you're right. I checked in my projects and found 3 occurrences in one. I've added these functions. Note that it also required another extra bool in the iter, but it's not like it matters a lot. > One thing you couldn't do with that is the final > G_HASH_TABLE_RESIZE (hash_table) > but that isn't really a big deal.... there's not a whole lot of gain > from resizing hash tables smaller. I could actually do it when _iter_next() returns FALSE (with yet another extra field "node_removed" in the iter), but that would assume that the user finishes walking the hash table after removing an entry. I'm not sure about the usefulness.
Created attachment 99926 [details] [review] Oops. Forgot updating hash_table->nnodes after removing a node.
I'd welcome this feature, too. Things that are still missing in the patch: - Docs. We could probably handle this after the initial commit. - What happens if you call remove() twice in a row ? Considering the pre-advancing, I guess it removes the next node. The docs should make that clear, too. - Symbols. Need to add the new functions to glib.symbols and docs/reference/glib/glib-symbols.txt
Created attachment 100109 [details] [review] fixed remove() and steal(); added docs, tests and updated glib.symbols
(In reply to comment #11) > Things that are still missing in the patch: > > - Docs. We could probably handle this after the initial commit. > > - What happens if you call remove() twice in a row ? Considering the > pre-advancing, I guess it removes the next node. The docs should make > that clear, too. > > - Symbols. Need to add the new functions to glib.symbols and > docs/reference/glib/glib-symbols.txt I've addressed these points (I cannot find glib-symbols.txt though) and also added regression tests. They pointed out a bug in the remove code, which corrupted the hash table when the removed node was a head node (I did not update hash_table->nodes).
Looks great, please commit. Cool that you added tests too. The file I meant is actually called docs/reference/glib/glib-sections.txt, but I can handle that myself, too.
+struct _GHashTableIter +{ + /*< private >*/ + gpointer dummy1; + gpointer dummy2; + gpointer dummy3; + int dummy4; + gboolean dummy5; +}; I still prefer an opaque struct and gslice, but if you're doing this, at least add one or two paddings for future use.
Created attachment 100136 [details] [review] added one pointer of padding; added new funcs to glib-sections.txt
(In reply to comment #15) > +struct _GHashTableIter > +{ > + /*< private >*/ > + gpointer dummy1; > + gpointer dummy2; > + gpointer dummy3; > + int dummy4; > + gboolean dummy5; > +}; > > I still prefer an opaque struct and gslice, but if you're doing this, at least > add one or two paddings for future use. I've added one pointer. Future extension can also use several bits of the position int, since I see that HASH_TABLE_MAX_SIZE is rather small. And the pre_advanced bool can also be moved into a bit of the position int.
(In reply to comment #17) > (In reply to comment #15) > > +struct _GHashTableIter > > +{ > > + /*< private >*/ > > + gpointer dummy1; > > + gpointer dummy2; > > + gpointer dummy3; > > + int dummy4; > > + gboolean dummy5; > > +}; > > > > I still prefer an opaque struct and gslice, but if you're doing this, at least > > add one or two paddings for future use. > > I've added one pointer. Future extension can also use several bits of the > position int, since I see that HASH_TABLE_MAX_SIZE is rather small. And the > pre_advanced bool can also be moved into a bit of the position int. Sure, there are many hacks one can use to work around such issues. My question is still unanswered though: why should we introduce a new public struct when its publicness is not needed? It's even easier to use a pointer as you don't have to make pointer (&iter) all the time. The allocation is shortlived and small, perfect for gslice.
(In reply to comment #18) > Sure, there are many hacks one can use to work around such issues. My question > is still unanswered though: why should we introduce a new public struct when > its publicness is not needed? It's even easier to use a pointer as you don't > have to make pointer (&iter) all the time. The allocation is shortlived and > small, perfect for gslice. It is needed in order to avoid the burden (need to free after use) and performance penalty of heap allocation. Iterators have value semantics and allocating them on the stack is a commonly accepted idiom (see for instance GTK+ or C#).
(In reply to comment #19) > (In reply to comment #18) > > Sure, there are many hacks one can use to work around such issues. My question > > is still unanswered though: why should we introduce a new public struct when > > its publicness is not needed? It's even easier to use a pointer as you don't > > have to make pointer (&iter) all the time. The allocation is shortlived and > > small, perfect for gslice. > > It is needed in order to avoid the burden (need to free after use) and On the other hand, _new/_free is a very common glib idiom. > performance penalty of heap allocation. Nonsense. gslice is faster than you can pick on it. > Iterators have value semantics and allocating them on the stack is a > commonly accepted idiom (see for instance GTK+ or C#). I'm ok if it was a documented public struct like GList that users could use directly instead of calling functions on to access key/value. But I see no point in what you are doing. Is new in glib for the least. What C# does is irrelevant too btw. Abstract talking doesn't help.
The thing that makes me prefer stack allocation here is because of the usage pattern .. you are always going to use this scoped to a single function and a loop. Having to free() is just annoying and easy to get wrong. This API is being added to be convenient. I see no reason for it to be annoying, even if the annoyance is slight. While something like PangoLayoutIter there's a big win to opacity, GHashTable and iteration over it is really simple and straightforward, and unlikely to ever change. (Sure, GHashTable *could* be reimplemented in a different way, but that hasn't happened in a decade...)
(In reply to comment #20) > (In reply to comment #19) > > (In reply to comment #18) > > > Sure, there are many hacks one can use to work around such issues. My question > > > is still unanswered though: why should we introduce a new public struct when > > > its publicness is not needed? It's even easier to use a pointer as you don't > > > have to make pointer (&iter) all the time. The allocation is shortlived and > > > small, perfect for gslice. > > > > It is needed in order to avoid the burden (need to free after use) and > > On the other hand, _new/_free is a very common glib idiom. Sure, they are often needed. But in this case they are not. > > performance penalty of heap allocation. > > Nonsense. gslice is faster than you can pick on it. It obviously cannot be faster than static stack allocation. It wouldn't even beat dynamic stack allocation. I don't see where the nonsense is. > > Iterators have value semantics and allocating them on the stack is a > > commonly accepted idiom (see for instance GTK+ or C#). > > I'm ok if it was a documented public struct like GList that users could use > directly instead of calling functions on to access key/value. But I see no > point in what you are doing. Is new in glib for the least. What C# does is > irrelevant too btw. Abstract talking doesn't help. I have already stated the two main reasons (no burden, no performance penalty) for choosing stack allocation in my previous message. Regarding C#, I mentioned it and GTK+ to suggest that I did not wake up one morning with the odd idea of allocating iterators on the stack. Well-designed systems do it.
> > > performance penalty of heap allocation. > > > > Nonsense. gslice is faster than you can pick on it. > > It obviously cannot be faster than static stack allocation. It wouldn't even > beat dynamic stack allocation. I don't see where the nonsense is. If you can't measure it, it's not optimization/penalty. > > > Iterators have value semantics and allocating them on the stack is a > > > commonly accepted idiom (see for instance GTK+ or C#). > > > > I'm ok if it was a documented public struct like GList that users could use > > directly instead of calling functions on to access key/value. But I see no > > point in what you are doing. Is new in glib for the least. What C# does is > > irrelevant too btw. Abstract talking doesn't help. > > I have already stated the two main reasons (no burden, no performance penalty) > for choosing stack allocation in my previous message. > > Regarding C#, I mentioned it and GTK+ to suggest that I did not wake up one > morning with the odd idea of allocating iterators on the stack. Well-designed > systems do it. Last time I had to add a member to a public struct in Pango (PangoAnalysis), I ended up relying on the padding the compiler added in to align a bitfield. Worse, we then found that gcc and Visual Studio pack bitfields differently, so had to get rid of the bitfield completely. This is the kind of issues I'm trying to avoid. Not all well-designed systems have the constraints we do. My job is to be critical. If Owen says it won't be a problem, who am I to disagree. Just don't act like it doesn't exist.
I have to agree with Owen here. Freeing iterators is just annoying. Do we need to bring this up for vote at tomorrow team meeting, Behdad ?
Nah, go for it if you are comfortable. As for freeing being annoying, well, this is not the first case :).
(In reply to comment #23) > > > > performance penalty of heap allocation. > > > > > > Nonsense. gslice is faster than you can pick on it. > > > > It obviously cannot be faster than static stack allocation. It wouldn't even > > beat dynamic stack allocation. I don't see where the nonsense is. > > If you can't measure it, it's not optimization/penalty. This mentality is not good. If you push it too much you end up rewriting glib in python or some other toy. Seriously, I know that the performance penalty is perfectly neglectible in the common case (just one alloc), but what about iterating over a small hash table inside a few large loops? > > > > Iterators have value semantics and allocating them on the stack is a > > > > commonly accepted idiom (see for instance GTK+ or C#). > > > > > > I'm ok if it was a documented public struct like GList that users could use > > > directly instead of calling functions on to access key/value. But I see no > > > point in what you are doing. Is new in glib for the least. What C# does is > > > irrelevant too btw. Abstract talking doesn't help. > > > > I have already stated the two main reasons (no burden, no performance penalty) > > for choosing stack allocation in my previous message. > > > > Regarding C#, I mentioned it and GTK+ to suggest that I did not wake up one > > morning with the odd idea of allocating iterators on the stack. Well-designed > > systems do it. > > Last time I had to add a member to a public struct in Pango (PangoAnalysis), I > ended up relying on the padding the compiler added in to align a bitfield. > Worse, we then found that gcc and Visual Studio pack bitfields differently, so > had to get rid of the bitfield completely. This is the kind of issues I'm > trying to avoid. Not all well-designed systems have the constraints we do. > > My job is to be critical. If Owen says it won't be a problem, who am I to > disagree. Just don't act like it doesn't exist. We all got bitten by compiler layout issues one day or another. It doesn't mean that public structs should be considered obscene regardless of the context. My little public struct is well-behaved and deserves to live.
Discussed this some more with behdad on irc, and he pointed at gsequence for an example of iterators in glib that are not stack-allocated, yet don't require freeing. I believe we can do pretty much the same for hash tables. In fact, Behdad may have a patch almost ready.
While it's technically possible, it's very hard to come up with maintainable code for that without enlarging GHashNode. The hardest part is implementing remove/steal, but even without that it involves black pointer magic that spreads all over the code (not only iter). Also, the hash table needs to be passed to all iter functions. And iter_next returns the new iterator that should be saved (ala g_list_append, etc). Seems like it's not worth the extra effort and inconvenience. I'm giving up.
Created attachment 100181 [details] [review] added g_hash_table_iter_get_hash_table() I saw g_sequence_iter_get_sequence() and realized I should also add a getter here. Hopefully there will be no objections.
> it involves black pointer magic that spreads all over the code hmm, didn't look that hard to me at first sight. I'll give it a try tonight. If it doesn't work out I'll concur and go with the stack-allocated iter approach.
I think you guys might be overthinking it. There is just no way we'd need to extend this struct, and if there ever is, having an extra pointer of padding should allow it. Proof: the existing GHashTable API is fine, this is just convenience API, and so by definition there can't be a truly _necessary_ extension to it, at best an extension would be some kind of small additional convenience. So, your worst case for the non-opaque struct is you want to add some small additional convenience and can't. There's no way that contingency is worth jumping through hoops to plan for. Especially when nobody has even a speculative example of a possible extension... I mean, it's an iterator. This is pretty well-understood. I say keep it simple.
Probably :-)
Yeah, lets go with the current approach. It might be worthwhile to add a small not to the g_hash_table_foreach docs that points out the iter api.
I don't see the rules of the game spelled out anywhere. Can you delete hash elements without invalidating the iterator? Insert? Have two iterators for the same hash? Also, a large fraction of hash iterations are for the special cases of getting a list of the domain or range. We really ought to have special functions for that.
> Also, a large fraction of hash iterations are for the special cases of > getting a list of the domain or range. We really ought to have special > functions for that. You mean like ? GList * g_hash_table_get_keys (GHashTable *hash_table); GList * g_hash_table_get_values (GHashTable *hash_table); I believe deletion of hash elements is covered reasonably well with the g_hash_table_iter_remove()/steal() functions. We should probably document somewhere that calling g_hash_table_insert/_remove() makes you lose any guarantees about ongoing iterators. And we should probably have some timestamp mechanism to prevent hard-to-find bugs due to this. Has this been discussed before ?
> GList * g_hash_table_get_keys (GHashTable *hash_table); > GList * g_hash_table_get_values (GHashTable *hash_table); Hmm... I am so behind the times, it seems. But GList was the wrong choice -- there is no ordering so going backwards makes no sense. Oh well, too late. The above have some impact on bug 60198.
(In reply to comment #35) > I believe deletion of hash elements is covered reasonably well with > the g_hash_table_iter_remove()/steal() functions. We should probably > document somewhere that calling g_hash_table_insert/_remove() makes you > lose any guarantees about ongoing iterators. Maybe, yes. I just thought it was so obvious that it did not need to be mentioned. > And we should probably have some > timestamp mechanism to prevent hard-to-find bugs due to this. Has this > been discussed before ? I haven't added timestamping because g_hash_table_foreach() does not have it either. I imagine that if deemed desirable, it should be added inside !G_DISABLE_ASSERT sections?
Obvious? Let's check the C++ equivalent: "Map has the important property that inserting a new element into a map does not invalidate iterators that point to existing elements. Erasing an element from a map also does not invalidate any iterators, except, of course, for iterators that actually point to the element that is being erased." (http://www.sgi.com/tech/stl/Map.html) So if you want to be obvious, you should perhaps make it work. At the very least it should be documented.
Morton: I seriously doubt, that C++ is useful for measuruing obviousness... ;-) Also Jean-Yves is very right in claiming obviousness, as g_hash_table_foreach also forbids modifications to the iterated hash table. Additionally in Java, C# and Python for instance, modifications to hash tables invalidate iterators. Java (http://java.sun.com/j2se/1.4.2/docs/api/java/util/HashMap.html): »The iterators returned by all of this class's "collection view methods" are fail-fast: if the map is structurally modified at any time after the iterator is created, in any way except through the iterator's own remove or add methods, the iterator will throw a ConcurrentModificationException. Thus, in the face of concurrent modification, the iterator fails quickly and cleanly, rather than risking arbitrary, non-deterministic behavior at an undetermined time in the future. Note that the fail-fast behavior of an iterator cannot be guaranteed as it is, generally speaking, impossible to make any hard guarantees in the presence of unsynchronized concurrent modification. Fail-fast iterators throw ConcurrentModificationException on a best-effort basis. Therefore, it would be wrong to write a program that depended on this exception for its correctness: the fail-fast behavior of iterators should be used only to detect bugs.« C# (http://msdn2.microsoft.com/en-us/library/ms132090.aspx): Dictionary.System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<TKey,TValue>>.GetEnumerator Method »An enumerator remains valid as long as the collection remains unchanged. If changes are made to the collection, such as adding, modifying, or deleting elements, the enumerator is irrecoverably invalidated and its behavior is undefined. The enumerator does not have exclusive access to the collection; therefore, enumerating through a collection is intrinsically not a thread-safe procedure. To guarantee thread safety during enumeration, you can lock the collection during the entire enumeration. To allow the collection to be accessed by multiple threads for reading and writing, you must implement your own synchronization.« Python (http://www.python.org/dev/peps/pep-0234/): »Dictionaries implement a tp_iter slot that returns an efficient iterator that iterates over the keys of the dictionary. During such an iteration, the dictionary should not be modified, except that setting the value for an existing key is allowed (deletions or additions are not, nor is the update() method). This means that we can write«
Right. So I guess my question is: do we want to have similar best-effort warnings ? I think we probably do.
(In reply to comment #37) > > And we should probably have some > > timestamp mechanism to prevent hard-to-find bugs due to this. Has this > > been discussed before ? > > I haven't added timestamping because g_hash_table_foreach() does not have it > either. I imagine that if deemed desirable, it should be added inside > !G_DISABLE_ASSERT sections? g_hash_table_foreach() is much harder to misuse.
Will someone commit this as-is, or should I add timestamping? I don't know when 2.16 will be released, but I'd like to make sure this will be part of it.
If you are willing to work on the timestamping, that would be great.
Created attachment 100866 [details] [review] update to SVN trunk; add versioning; mention that modifying the hash table invalidates the iter
Looks fine to me; I somewhat dislike the amount of #ifndef G_DISABLE_ASSERT but I'm not sure there is much we can do about it. Is it really worth bothering with the combined flags_and_version field ? It would probably simplify the code a bit to just alway allocate the version field.
Created attachment 100940 [details] [review] removed the flags business You're right, sometimes I just like to toy too much. For now the version will live in the padding pointer. If more space is needed later on, it'll always be time to compact things a bit to recover the pointer.
2007-12-14 Matthias Clasen <mclasen@redhat.com> * glib/glib.symbols: * glib/ghash.[hc]: Add hash table iterators. (#500507, Jean-Yves Lefort) * tests/hash-test.c: Test iterators.