GNOME Bugzilla – Bug 627483
Use a proper ordered set datatype for IMable.im_addresses
Last modified: 2011-01-21 21:29:59 UTC
The current datatype for IMable.im_addresses is HashTable<string, GenericArray<string>>; a mapping from protocol name to an ordered array of IM addresses. We do not want duplicates in this list, so there is currently some code in IndividualAggregator and Kf.PersonaStore to remove duplicates before setting the property. Ideally, the datatype used for IMable.im_addresses should be changed to one which uses a proper ordered set datatype: HashTable<string, OrderedSet<string>>. This would have the nice side-effect that we don't have to use the ugly GenericArray datatype any more (or a GLib.List, which would be even worse). Gee doesn't seem to contain a suitable datatype for OrderedSet, from a quick look, so we'll probably have to write it ourselves, taking care to make the C API for it nice.
(Mass changing milestones; please search for this phrase to delete the bug spam.)
In a branch that makes Individual implement IMable, this issue comes up again (I've added a FIXME comment pointing to this bug).
The ideal solution for this is implementing an ordered set type in GLib, getting it bound in Vala, then using that. Alternately, we could just create our own type and punt a more generic type to the distant future (last-minute breaks before stabilizing for 1.0 or indefinitely (ie, 2.x)) Ryan, any thoughts on this?
I'll take this on. As we discussed on IRC, this will be written in Vala as part of folks. GLib would be nice, but we could get this done faster directly in folks.
Ok. Implemented an ordered set with a linked list. Similar to Java's LinkedHashSet, I think. If anything, this should be put in libgee, not glib since glib does not even have a vanilla hash set (not counting null value hash tables). Up for review! http://git.collabora.co.uk/?p=user/eitan/folks.git;a=shortlog;h=refs/heads/linked-hash-set
In future, could you please attach a squashed patch of the branch to the bug report, so that it can be reviewed more easily? Thanks. + public override void set (int index, G item) + { + /* FIXME: What should happen here? */ + } How about making this add the item at the given index, checking beforehand that the item doesn't already exist in the set. If it does exist, it should be moved from its old index to the new index. + public override void insert (int index, G item) + { + /* FIXME: What should happen here? */ + } This could then complement LinkedHashSet.set() by doing nothing if the item already exists in the set (and inserting it at the given index otherwise). + public override bool add_all (Collection<G> collection) + { + bool modified = false; + + foreach (G item in collection) + modified |= add(item); + + return modified; + } Could this not be made more efficient in the case that collection is an instance of a LinkedList, by appending the list for collection to the set's list, rather than appending each element separately? That said, having taken a quick look at the API for LinkedList, I can't immediately see a method which allows appending one linked list to another. :-( + * If not provided, the functions parameters are requested to the + * {@link Functions} function factory methods. That's not very clear. * @since 0.1.13 */ - public abstract HashTable<string, GenericArray<string>> im_addresses + public abstract HashTable<string, LinkedHashSet<string>> im_addresses Might be an idea to bump the @since tag on any API which is broken. - this._im_addresses.insert (cur_protocol, im_array); + if (old_im_set != null) + cur_addresses.add_all (old_im_set); Shouldn't the mutation of this._im_addresses be kept (in some form) here? The rest looks good to me. I presume `make check` still passes? Please list the API breaks made by each commit in a change to NEWS in each commit which breaks API.
Created attachment 178481 [details] [review] Made IMable.im_addresses a LinkedHashSet.
Comment on attachment 178481 [details] [review] Made IMable.im_addresses a LinkedHashSet. Oops, premature.
Created attachment 178482 [details] [review] Made IMable.im_addresses a LinkedHashSet. As requested, here is a squashed version of all changes to master.
Created attachment 178483 [details] [review] Made IMable.im_addresses a LinkedHashSet. As requested, here is a squashed version of all changes to master.
Comment on attachment 178483 [details] [review] Made IMable.im_addresses a LinkedHashSet. *sigh* bugzilla is acting up.
I'll be squashing the commits of the various tweaks I did below to the original few. (In reply to comment #6) > In future, could you please attach a squashed patch of the branch to the bug > report, so that it can be reviewed more easily? Thanks. > > + public override void set (int index, G item) > + { > + /* FIXME: What should happen here? */ > + } > > How about making this add the item at the given index, checking beforehand that > the item doesn't already exist in the set. If it does exist, it should be moved > from its old index to the new index. > > + public override void insert (int index, G item) > + { > + /* FIXME: What should happen here? */ > + } > > This could then complement LinkedHashSet.set() by doing nothing if the item > already exists in the set (and inserting it at the given index otherwise). Implemented in 0409ed4. I am not entirely sure this is the best idea. When one does a set() they expect the collection to retain it's size, and when they do an insert() they expect it to grow by one. I am tempted to simply put in a assert_not_reached for these methods. But it's not too important. > > + public override bool add_all (Collection<G> collection) > + { > + bool modified = false; > +0409ed4 > + foreach (G item in collection) > + modified |= add(item); > + > + return modified; > + } > > Could this not be made more efficient in the case that collection is an > instance of a LinkedList, by appending the list for collection to the set's > list, rather than appending each element separately? That said, having taken a > quick look at the API for LinkedList, I can't immediately see a method which > allows appending one linked list to another. :-( Well, currently, the Gee version of a linked list does a separate add for each one (it does not have it's own implementation, it inherits from AbstractCollection.add_all). We don't have access to the LinkedList implementation, so we can't optimize it, the alternative would be to use the GLib one I guess. Second, the given argument is not guaranteed to be a linked list, but simply an iterable Collection. Third, each element is conditionally added, so we can't just blindly link the last element of one list with the first of another. We would end up iterating through the extended list removing duplicates. No optimization there. > > + * If not provided, the functions parameters are requested to the > + * {@link Functions} function factory methods. > > That's not very clear. Hope this is clearer: 589d8ce > > * @since 0.1.13 > */ > - public abstract HashTable<string, GenericArray<string>> im_addresses > + public abstract HashTable<string, LinkedHashSet<string>> im_addresses > > Might be an idea to bump the @since tag on any API which is broken. 9764269 > > - this._im_addresses.insert (cur_protocol, im_array); > + if (old_im_set != null) > + cur_addresses.add_all (old_im_set); > > Shouldn't the mutation of this._im_addresses be kept (in some form) here? > Not sure what you mean, cur_addresses is a reference to the object in the im_addresses mapping. So it is implicitly set. > > The rest looks good to me. I presume `make check` still passes? Yes! > Please list the > API breaks made by each commit in a change to NEWS in each commit which breaks > API. 9764269 I'll squash it into the proper commit once your r+ this.
(In reply to comment #12) > I'll be squashing the commits of the various tweaks I did below to the original > few. > > (In reply to comment #6) > > In future, could you please attach a squashed patch of the branch to the bug > > report, so that it can be reviewed more easily? Thanks. > > > > + public override void set (int index, G item) > > + { > > + /* FIXME: What should happen here? */ > > + } > > > > How about making this add the item at the given index, checking beforehand that > > the item doesn't already exist in the set. If it does exist, it should be moved > > from its old index to the new index. > > > > + public override void insert (int index, G item) > > + { > > + /* FIXME: What should happen here? */ > > + } > > > > This could then complement LinkedHashSet.set() by doing nothing if the item > > already exists in the set (and inserting it at the given index otherwise). > > Implemented in 0409ed4. I am not entirely sure this is the best idea. When one > does a set() they expect the collection to retain it's size, and when they do > an insert() they expect it to grow by one. I am tempted to simply put in a > assert_not_reached for these methods. But it's not too important. You make a good point. I think it would be better to implement set() and insert() now, rather than at some point in the future when this discussion has been forgotten. Could you please re-implement them with the semantics you describe? > > > > + public override bool add_all (Collection<G> collection) > > + { > > + bool modified = false; > > +0409ed4 > > + foreach (G item in collection) > > + modified |= add(item); > > + > > + return modified; > > + } > > > > Could this not be made more efficient in the case that collection is an > > instance of a LinkedList, by appending the list for collection to the set's > > list, rather than appending each element separately? That said, having taken a > > quick look at the API for LinkedList, I can't immediately see a method which > > allows appending one linked list to another. :-( > > Well, currently, the Gee version of a linked list does a separate add for each > one (it does not have it's own implementation, it inherits from > AbstractCollection.add_all). We don't have access to the LinkedList > implementation, so we can't optimize it, the alternative would be to use the > GLib one I guess. > Second, the given argument is not guaranteed to be a linked list, but simply an > iterable Collection. > Third, each element is conditionally added, so we can't just blindly link the > last element of one list with the first of another. We would end up iterating > through the extended list removing duplicates. No optimization there. 1. Using the GLib linked list implementation is a trap. folks has been plagued by problems caused by the fact that Vala can't properly wrap GList and get the reference counting right. 2. I should've been more explicit: I meant adding a special case "if (collection is LinkedList)" (see http://live.gnome.org/Vala/Tutorial#Run-Time_Type_Information), and falling back to the current code if that test fails. 3. I forgot about that, though the way I was thinking about it, running once through the appended list and checking each element against the hash table would still be O(n), compared to the O(n^2) cost of traversing the whole list for each element we append. Having just thought about it some more, I remembered that libgee's LinkedList keeps a pointer to the end of the list, so the way you've implemented it is O(n) in the size of the collection anyway, so I'll just shut up and apologise for wasting your time. :-( > > + * If not provided, the functions parameters are requested to the > > + * {@link Functions} function factory methods. > > > > That's not very clear. > > Hope this is clearer: 589d8ce Great. > > * @since 0.1.13 > > */ > > - public abstract HashTable<string, GenericArray<string>> im_addresses > > + public abstract HashTable<string, LinkedHashSet<string>> im_addresses > > > > Might be an idea to bump the @since tag on any API which is broken. > > 9764269 Also great. > > - this._im_addresses.insert (cur_protocol, im_array); > > + if (old_im_set != null) > > + cur_addresses.add_all (old_im_set); > > > > Shouldn't the mutation of this._im_addresses be kept (in some form) here? > > > > Not sure what you mean, cur_addresses is a reference to the object in the > im_addresses mapping. So it is implicitly set. The Individual._update_im_addresses() method is supposed to be updating Individual._im_addresses, ensuring it's the (up to date) union of the sets of IM addresses of the Individual's Personas. cur_addresses, however, is the IM address set belonging to a Persona, so updating it is not what we want to do. We want to update Individual._im_addresses. The code should probably also be clearing Individual._im_addresses before it loops through the Individual's Personas, which is something it's currently failing to do. :-( > > The rest looks good to me. I presume `make check` still passes? > > Yes! Yay! > > Please list the > > API breaks made by each commit in a change to NEWS in each commit which breaks > > API. > > 9764269 I'll squash it into the proper commit once your r+ this. You should probably also list the addition of LinkedHashSet as a new public API in NEWS. Thanks!
Note that once this has been committed, there are various places in the IndividualAggregator which could be ported to use the new LinkedHashSet. From a quick grep for HashSet, there's at least the following: • candidate_inds and candidate_ind_set in _add_personas() • relinked_personas and relinked_personas_set in _personas_changed_cb()
Created attachment 178606 [details] [review] Made IMable.im_addresses a LinkedHashSet. Latest review round
(In reply to comment #13) > (In reply to comment #12) > > I'll be squashing the commits of the various tweaks I did below to the original > > few. > > > > (In reply to comment #6) > > > In future, could you please attach a squashed patch of the branch to the bug > > > report, so that it can be reviewed more easily? Thanks. > > > > > > + public override void set (int index, G item) > > > + { > > > + /* FIXME: What should happen here? */ > > > + } > > > > > > How about making this add the item at the given index, checking beforehand that > > > the item doesn't already exist in the set. If it does exist, it should be moved > > > from its old index to the new index. > > > > > > + public override void insert (int index, G item) > > > + { > > > + /* FIXME: What should happen here? */ > > > + } > > > > > > This could then complement LinkedHashSet.set() by doing nothing if the item > > > already exists in the set (and inserting it at the given index otherwise). > > > > Implemented in 0409ed4. I am not entirely sure this is the best idea. When one > > does a set() they expect the collection to retain it's size, and when they do > > an insert() they expect it to grow by one. I am tempted to simply put in a > > assert_not_reached for these methods. But it's not too important. > > You make a good point. I think it would be better to implement set() and > insert() now, rather than at some point in the future when this discussion has > been forgotten. Could you please re-implement them with the semantics you > describe? 261765e > > > > > > > + public override bool add_all (Collection<G> collection) > > > + { > > > + bool modified = false; > > > +0409ed4 > > > + foreach (G item in collection) > > > + modified |= add(item); > > > + > > > + return modified; > > > + } > > > > > > Could this not be made more efficient in the case that collection is an > > > instance of a LinkedList, by appending the list for collection to the set's > > > list, rather than appending each element separately? That said, having taken a > > > quick look at the API for LinkedList, I can't immediately see a method which > > > allows appending one linked list to another. :-( > > > > Well, currently, the Gee version of a linked list does a separate add for each > > one (it does not have it's own implementation, it inherits from > > AbstractCollection.add_all). We don't have access to the LinkedList > > implementation, so we can't optimize it, the alternative would be to use the > > GLib one I guess. > > Second, the given argument is not guaranteed to be a linked list, but simply an > > iterable Collection. > > Third, each element is conditionally added, so we can't just blindly link the > > last element of one list with the first of another. We would end up iterating > > through the extended list removing duplicates. No optimization there. > > 1. Using the GLib linked list implementation is a trap. folks has been plagued > by problems caused by the fact that Vala can't properly wrap GList and get the > reference counting right. > > 2. I should've been more explicit: I meant adding a special case "if > (collection is LinkedList)" (see > http://live.gnome.org/Vala/Tutorial#Run-Time_Type_Information), and falling > back to the current code if that test fails. > > 3. I forgot about that, though the way I was thinking about it, running once > through the appended list and checking each element against the hash table > would still be O(n), compared to the O(n^2) cost of traversing the whole list > for each element we append. Having just thought about it some more, I > remembered that libgee's LinkedList keeps a pointer to the end of the list, so > the way you've implemented it is O(n) in the size of the collection anyway, so > I'll just shut up and apologise for wasting your time. :-( > > > > + * If not provided, the functions parameters are requested to the > > > + * {@link Functions} function factory methods. > > > > > > That's not very clear. > > > > Hope this is clearer: 589d8ce > > Great. > > > > * @since 0.1.13 > > > */ > > > - public abstract HashTable<string, GenericArray<string>> im_addresses > > > + public abstract HashTable<string, LinkedHashSet<string>> im_addresses > > > > > > Might be an idea to bump the @since tag on any API which is broken. > > > > 9764269 > > Also great. > > > > - this._im_addresses.insert (cur_protocol, im_array); > > > + if (old_im_set != null) > > > + cur_addresses.add_all (old_im_set); > > > > > > Shouldn't the mutation of this._im_addresses be kept (in some form) here? > > > > > > > Not sure what you mean, cur_addresses is a reference to the object in the > > im_addresses mapping. So it is implicitly set. > > The Individual._update_im_addresses() method is supposed to be updating > Individual._im_addresses, ensuring it's the (up to date) union of the sets of > IM addresses of the Individual's Personas. > > cur_addresses, however, is the IM address set belonging to a Persona, so > updating it is not what we want to do. We want to update > Individual._im_addresses. > > The code should probably also be clearing Individual._im_addresses before it > loops through the Individual's Personas, which is something it's currently > failing to do. :-( Oops, it was confusing to read. I hope it does the right thing now. ad6eefa > > > > The rest looks good to me. I presume `make check` still passes? > > > > Yes! > > Yay! > > > > Please list the > > > API breaks made by each commit in a change to NEWS in each commit which breaks > > > API. > > > > 9764269 I'll squash it into the proper commit once your r+ this. > > You should probably also list the addition of LinkedHashSet as a new public API > in NEWS. Thanks! 99a6ce1
Squash commit ad6eefa139 onto 30a50288e6, c689427c1a onto 30a50288e6, etc., then push and make a closing comment here with the final 'git log --stat' output for the final commits.
Merged to master! commit 14b8da134abffcf0272c06c72d5f6df6036e06e7 Author: Eitan Isaacson <eitan@monotonous.org> Date: Sun Jan 16 11:48:24 2011 +0200 Modified backends to use LinkedHashSet in IMable.im_addresses. backends/key-file/kf-persona-store.vala | 4 +- backends/key-file/kf-persona.vala | 75 ++++++++++++------------------- backends/telepathy/lib/tpf-persona.vala | 12 +++--- 3 files changed, 37 insertions(+), 54 deletions(-) commit f1f826688cfec808d98bf3e588735812cde6160c Author: Eitan Isaacson <eitan@monotonous.org> Date: Sun Jan 16 11:47:50 2011 +0200 Modified libfolks to use LinkedHashSet in im_addresses. NEWS | 1 + folks/imable.vala | 4 ++-- folks/individual-aggregator.vala | 34 ++++++++-------------------------- folks/individual.vala | 29 +++++++++-------------------- 4 files changed, 20 insertions(+), 48 deletions(-) commit da5e0d35d5800bb03ff1e0818c1b51f4284e7c76 Author: Eitan Isaacson <eitan@monotonous.org> Date: Sun Jan 16 11:46:27 2011 +0200 Added a LinkedHashSet ordered set. NEWS | 3 + folks/Makefile.am | 1 + folks/linked-hash-set.vala | 203 ++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 207 insertions(+), 0 deletions(-)
I started writing some tests for LinkedHashSet and noticed a couple things: * LinkedHashSet.set() and .insert() just call assert_not_reached(). I thought they were going to be implemented as suggested above * we're making libgee a requirement for our clients due to the types LinkedHashSet exposes in its API (which I guess is unavoidable if we want LinkedHashSet to implement Gee.AbstractList and Gee.Set). I think we should make sure Empathy is OK with adding libgee as a requirement. Guillaume, what do you think? libgee is already a blessed external dependency.
(In reply to comment #19) > I started writing some tests for LinkedHashSet and noticed a couple things: > > * LinkedHashSet.set() and .insert() just call assert_not_reached(). I thought > they were going to be implemented as suggested above > Implementing set() and insert() seemed like an antipattern. The few other ordered set implementations out there don't have similar methods either, afaik. The reason being, when you set(), you expect the collection to remain the same size, but in a LinkedHashSet it will actually shrink by one if the item exists somewhere else in the set. When you insert you have the opposite problem, the collection is expected to grow by one, but if the item is somewhere else in the set, it instead stays the same size. I guess we could document that for each of those methods, I just thought it would be good to leave out we don't depend on it. Phillip gave his nod after I actually implemented it already in one of the revisions above. > * we're making libgee a requirement for our clients due to the types > LinkedHashSet exposes in its API (which I guess is unavoidable if we want > LinkedHashSet to implement Gee.AbstractList and Gee.Set). I think we should > make sure Empathy is OK with adding libgee as a requirement. > > Guillaume, what do you think? libgee is already a blessed external dependency. From a quick look, the external API usually exposes an immutable list in signals and such. We could possibly export it to an array through Gee.Collection.to_array(), or make up another method that gives us a smarter container, like a GenericArray or a *gasp* GLib.List.
(In reply to comment #20) > (In reply to comment #19) > > I started writing some tests for LinkedHashSet and noticed a couple things: > > > > * LinkedHashSet.set() and .insert() just call assert_not_reached(). I thought > > they were going to be implemented as suggested above > > > > Implementing set() and insert() seemed like an antipattern. The few other > ordered set implementations out there don't have similar methods either, afaik. > > The reason being, when you set(), you expect the collection to remain the same > size, but in a LinkedHashSet it will actually shrink by one if the item exists > somewhere else in the set. Why would it shrink in that case? I'd think it'd stay the same size? It's fair enough that it doesn't quite make sense to "set" in a list, though. > When you insert you have the opposite problem, the collection is expected to > grow by one, but if the item is somewhere else in the set, it instead stays the > same size. Right. > I guess we could document that for each of those methods, I just thought it > would be good to leave out we don't depend on it. Phillip gave his nod after I > actually implemented it already in one of the revisions above. Yeah, that was my only confusion - I was double-checking that you didn't accidentally merge the wrong commits. > > * we're making libgee a requirement for our clients due to the types > > LinkedHashSet exposes in its API (which I guess is unavoidable if we want > > LinkedHashSet to implement Gee.AbstractList and Gee.Set). I think we should > > make sure Empathy is OK with adding libgee as a requirement. > > > > Guillaume, what do you think? libgee is already a blessed external dependency. > > From a quick look, the external API usually exposes an immutable list in > signals and such. We could possibly export it to an array through > Gee.Collection.to_array(), or make up another method that gives us a smarter > container, like a GenericArray or a *gasp* GLib.List. Those are both decent options. I'll look into them. The upshot of using GenericArray is that it's a fairly reasonable type in C (GPtrArray) yet shouldn't have GLib.List's memory management issues where we use the return value from Vala.