GNOME Bugzilla – Bug 646948
Support for potential matches
Last modified: 2011-04-09 00:48:46 UTC
I've been working on adding an API to infer the likeliness of two individuals actually being the same Person (as in human being - not as in Folks.Persona). The following branch: http://git.collabora.co.uk/?p=user/rgs/folks/.git;a=shortlog;h=refs/heads/potential-matches adds a PotentialMatch class which provides a public method to explore a potential match between two given individuals. The rules currently used judge a potential match are: - same IM addresses - same or similar e-mail addresses (some heuristics to detect e-mail aliases is used) - same or similar phone numbers (I normalise phone numbers, as per Marco Barisione's approach, before comparing) - fuzzy comparison of individual's name (full-name, nickname, etc) using Jaro's distance [1] This could further be extended to provide some convenience methods (to the IndividualAggregator) to explore all the potential matches among the Individuals known at the time. [1] http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
Most of the new tests fail for me. See the attached test output. folks/potential-match.vala ========================== + * Likely-ness of a potential match. Likeliness +public enum Folks.MatchResult +{ + VERY_UNLIKELY, + LOW, + MEDIUM, + HIGH, + VERY_HIGH, +} Change VERY_UNLIKELY to VERY_LOW for consistency. +public enum Folks.MatchCriteria A type name should always be singular (MatchCriterion) +public class Folks.PotentialMatchDecision : Object I'm really not convinced we need this class. As far as I can tell, it's mainly used to print out debugging information on demand (whereas we can just add debug() statements that print as comparisons are made) and complicates the code quite a bit. As I said in my original outline of this API, I don't think it's worth trying to hang on to the exact reasoning why a potential match got the rating it did. The only time that's usable is when you're debugging/tuning this code, in which case you can set FOLKS_DEBUG=folks to get detailed information and/or add debug statements as necessary when you're working on it. (And, honestly, I think the amount of extra code this class adds would make it harder to debug in general). Please remove this class and have the remaining code print debug() statements as it runs through the algorithm instead of tracking the details at this level. +{ ... + public GLib.Value value; Rename this for clarity. Is it the likeliness of the match in a range of [0,1]? + public void set_ignore_reason (string reason) + { + this.ignored = true; + this.ignore_reason = reason; + } This would probably be better as the setter for the property ignore-reason + c = "Name similitude"; Technically this is a word, but I had to spellcheck it to be sure :) "Similar names" would be better here. + * Whether two individuals are likely to be the same one. Replace "one" with "person". + if (this._individual_a.gender != this._individual_b.gender) + { + return this._result; + } Be sure to add in (this._individual_a_.gender != Gender.UNKNOWN && this._individual_b_.gender != Gender.UNKNOWN) to the conditional. This field is frequently unset on most services, so you don't want to rule out a match based on A.gender == UNSPECIFIED, B.gender == FEMALE. + int num_im_addrs = this.num_of_equal_im_addrs (a, b); Is this number ever used? We're probably better off using a boolean result so we can exit early as soon as we get the first match, since calculating potential matches probably gets CPU and I/O intensive when we've got a large-enough set of Individuals (so every shortcut helps). Same goes for num_of_equal_phone_numbers(), etc. This algorithm should do the same comparisons for Individual A's (aliases, nicknames, full_names) (all thrown into the same pool) and Individual B's (aliases, nicknames, full_names), since they're often good indicators of a match. It's probably fair to have this comparison increase the result as well (ie, if two Individuals have closely-matching structured names and closely-matching aliases/nicknames/full_names, result should get incremented twice). + public void debug_result () + { + /* + * For some reason FOLKS_DEBUG=tracker is not triggering + * debug () here so I am using printf as a fallback. + */ + GLib.stdout.printf ("Decision path: %s\n", this.decision_path ()); + GLib.stdout.printf ("Result: %s\n", this.match_to_str (this._result)); + } Please replace all these printf() calls with debug(). To get these as output, include "folks" in yoru FOLKS_DEBUG (eg, FOLKS_DEBUG=tracker,folks). + /* Approach: + * - taking in account family, given, prefix, suffix and additional names + * we give some points for each non-empty match + * + * TODO: + * - normalise strings (drop accents, extra spaces, etc) + * - for the time being we are doing strict matches, we should do + * fuzzy matching + * + * @since UNRELEASE + */ Add a D to the end of UNRELEASE + public double name_equalness (Individual a_ind, Individual b_ind) Make this private -- we can make it public later if it's really useful. Rename it _name_similarity(). + /* If individuals share common e-mails */ + this._inspect_emails (); It's a little odd that some of these function calls modify this._result directly, and some return the value instead. + private void _increment_result () + { + this._result = this._inc_match_level (this._result, 1); + } Just inline this. + private void _inspect_emails () + { + const string ignore_reason = "We already have a high score so this " + + "alias might mean something after all."; Extra space after "alias". This should probably also be a translatable string. + string[] info_a = fd_a.value.split ("@"); + string[] info_b = fd_b.value.split ("@"); Rename these to something like "email_split_a" and "email_split_b". + string[] tokens_a = info_a[0].split_set ("._-"); + string[] tokens_b = info_b[0].split_set ("._-"); Add + to this set of tokens. + /* Do we have: first.middle.last@ ~= fml@ ? */ I'm skeptical that this one is worth the processing. Looking at my contact lists, this never occurs. However, <first_name[0]><last_name>.*@.* is very common, as is <first_name>[.]?<last_name>.*@.*. Let's leave this out for now and add it back in if we get some evidence that it's worthwhile (eg, maybe it's common in some locales I'm not familiar with). + if (this._check_monogram_expansion (tokens_a, tokens_b)) Hmm, "monogram" might technically be the right word, but I'd use "initials" (since "monogram" usually refers to initials added to something you own, like a towel). + { + this._result = MatchResult.MEDIUM; Only do this if this._result was < MEDIUM (it could be HIGH if IM addresses already matched. In other words, matching in some ways should never reduce the total likeliness of a match. + var decision = Folks.PotentialMatchDecision.from_int + (Folks.MatchCriteria.EMAIL_ADDRESSES, 1); + decision.description = "Monogram expansion found."; + this.decisions.add ((owned) decision); + } + /* Is a set of tokens a subset of the other ? Whitespace between "other" and "?" + /* As in: first.middle.last@ ~= [first,middle,..]@ ? */ You're missing a closing */ -- probably best to remove the / before "* As" + else if (this._check_subset (tokens_a, tokens_b)) + { + this._result = MatchResult.MEDIUM; Same as above. + var decision = Folks.PotentialMatchDecision.from_int + (Folks.MatchCriteria.EMAIL_ADDRESSES, 1); + decision.description = "Subset of tokens found."; + this.decisions.add ((owned) decision); + } + /* If this are looking good we might want to give some extra + * points away for: + * - same domain + * - example.org ~= example.org.tld + */ I think this this will give us way too many false positives. Most of my friends have gmail accounts and I'm sure most people have a similar situation between some of the top web mail providers. It's also very common for people to have a lot of work contacts, which are all different people and have the same domain (or small set of domains). So, please cut this. + private MatchResult _inc_match_level ( + MatchResult current_level, int times = 1) + { + if (times == 0) + return current_level; + + MatchResult ret = current_level; + switch (current_level) + { + case MatchResult.VERY_UNLIKELY: + ret = MatchResult.LOW; + break; + case MatchResult.LOW: + ret = MatchResult.MEDIUM; + break; + case MatchResult.MEDIUM: + ret = MatchResult.HIGH; + break; + case MatchResult.HIGH: + ret = MatchResult.VERY_HIGH; + break; + } + + if (times > 1) + ret = this._inc_match_level (ret, times - 1); + + return ret; + } Simplify this function to: ret = current_level + times; if(ret > MatchResult.MAX) ret = MatchResult.MAX; return ret; And define MatchResult.MAX = MatchResult.HIGH under the enum (or something to that effect). Add a note above the enum that it needs to stay in ascending order. + private bool _check_monogram_expansion (string[] tokens_a, string[] tokens_b) + private bool _do_check_monogram_expansion (string[] expanded_name, + string monogram) + private bool _check_subset (string[] tokens_a, string[] tokens_b) + private bool _do_check_subset (string[] superset, string[] subset) These names should be more distinct/clear in what they do. + if (tokens_a.length > tokens_b.length) + return this._do_check_subset (tokens_a, tokens_b); + else + return this._do_check_subset (tokens_b, tokens_a); Why does the ordering matter? You're not using the same index for both of them, so it's not a matter of iterating past the end. + for (var i=0; i<subset.length; i++) + { + for (var j=0; j<superset.length; j++) These names are misleading -- we only know that one has size <= size of the other, not that it's a subset. + private void _dump_tokens (string title, string[] tokens) + { + GLib.stdout.printf ("Dumping: %s\n\n", title); + + for (int i=0; i<tokens.length; i++) + { + GLib.stdout.printf ("%d: %s\n", i, tokens[i]); + } + } Use debug() for these. + /* + * We should probably discriminate how much of subset one set + * is from the other (i.e.: subset / superset) ? + */ Why? There's nothing positional about the two sets of tokens we're comparing. + /* Watch out for common shared aliases: + * admin, webmaster, abuse */ + private bool _common_email_alias (string email) + { + return this.known_email_aliases.contains (email); + } No need for this to be its own function. Just inline the body. + private bool _are_equal (string a, string b) Rename to _str_equal_safe and move to a common source file. Please add a couple utility functions (and at least one test case for each): /* to get all matches for a given Individual */ /* use case: in an overview of a single individual, offer suggestions of related individuals */ HashMap<Individual, MatchResult> get_potential_matches (Individual matchee, MatchResult min_threshold = MatchResult.VERY_HIGH); /* to get all combinations between all Individuals (should be more efficient when calculating many matches) */ /* use case: utility program to quickly link up potentially-matching Individuals */ HashMap<Individual, HashMap<Individual, MatchResult>> get_all_potential_matches (MatchResult min_threshold = MatchResult.VERY_HIGH); ================================= ================================= tests/tracker/match-im-addresses.vala ===================================== + /* Since we are matching two individuals with the + * same e-mail address we are expecting the match + * to be HIGH. Maybe it would be better to check + * for >= HIGH since this might add up when we + * start doing fuzzy matching with the names. */ + assert (this._match == Folks.MatchResult.HIGH); Agreed -- please adjust the code as suggested and remove/adjust the comment. In fact, all assertions of the match should be a range (>= HIGH, < MEDIUM, etc.) so it's more future-proof as we adjust our matching algorithms. tests/tracker/match-known-emails.vala ===================================== + assert (this._match == Folks.MatchResult.LOW); Change to <= LOW tests/tracker/match-email-monogram.vala ======================================= If removing the monogram portion of the matching algorithm breaks this test, just remove it. tests/tracker/match-domain-alias.vala ===================================== + /* someuser@example.org =~ someuser@example.org.uk */ I think it's extremely rare that a single person would have this type of a match. Remove this test along with the portion of the algorithm that supports it. folks/potential-match.vala ========================== + if ((int)m > 0) Space between cast and variable + for (int i=0; i<s1.length; i++) + for (int i=pos-max_dist; i<=pos+max_dist; i++) Space on both sides of binary operators tests/tracker/match-name.vala ============================= Please add some more examples of names that should or shouldn't match and check that we get the results we expect.
Created attachment 185384 [details] Failed Potential Match test failures
Created attachment 185556 [details] [review] Add Folks.Utils
Created attachment 185557 [details] [review] Add helper methods to deal with phone numbers
Created attachment 185558 [details] [review] Support for Potential Matches
Created attachment 185559 [details] [review] Test matches via IM Addresses
Created attachment 185560 [details] [review] Test matches via e-mail addresses
Created attachment 185561 [details] [review] Avoid match via known emails
Created attachment 185562 [details] [review] Match via phone numbers
Created attachment 185563 [details] [review] Match via name similarity
Created attachment 185564 [details] [review] Test IndividualAggregator's methods to for matching individuals
Review of attachment 185556 [details] [review]: + internal static bool str_equal_safe (string a, string b) + { + return (a != "" && b != "" && a.down () == b.down ()); + } +} Name should start with a _, since it's internal. After that, apply.
Review of attachment 185557 [details] [review]: Looks good.
Review of attachment 185558 [details] [review]: + get_all_potential_matches() ... + for (int i=0; i < individuals.length () - 1; i++) + { + var a = individuals.nth_data (i); ... + var b = individuals.nth_data (j); These are inefficient, since it walks down the list (which we're already doing). Please use the standard for (unowned var l = list; l != null; l = l.next) { ... } pattern for GLib.List. + /* If individuals share common im-addresses */ + this._inspect_im_addresses (); + if (this._result == MatchResult.HIGH) + return this._result; + + /* If individuals share common e-mails */ + this._inspect_emails (); + if (this._result == MatchResult.HIGH) + return this._result; + + /* If individuals share common phone numbers */ + this._inspect_phone_numbers (); + if (this._result == MatchResult.HIGH) + return this._result; + + /* they have the same (normalised) name? */ + this._name_similarity (); + if (this._result == MatchResult.HIGH) + return this._result; These should definitely be >=, not == (assuming we're comparing with something < MatchResult.MAX; see the following). I appreciate the idea of exiting as soon as we find a strong match, but it makes all the values > MatchResult.HIGH less meaningful if the algorithm stops before it has a chance to settle on a result of MAX. It also has the undesirable property that re-ordering these sections could change the results of the function (though they should all be fairly independent). + public string match_to_str (MatchResult result) Rename "result_to_string" Otherwise, this is good.
Review of attachment 185559 [details] [review]: Looks good.
Review of attachment 185560 [details] [review]: + /* Since we are matching two individuals with the + * same e-mail address we are expecting the match + * to be HIGH. Maybe it would be better to check + * for >= HIGH since this might add up when we + * start doing fuzzy matching with the names. */ + assert (this._match == Folks.MatchResult.HIGH); Cut the comment, make this >=, then commit
Review of attachment 185561 [details] [review]: Looks good.
Review of attachment 185562 [details] [review]: Looks good.
Review of attachment 185563 [details] [review]: Looks good.
Review of attachment 185564 [details] [review]: Looks good
Merged: commit 3ac6dbb0dad5878efc31d92fd9d960447e9ebca4 Author: Raul Gutierrez Segales <raul.gutierrez.segales@collabora.co.uk> Date: Fri Apr 8 23:59:46 2011 +0100 Add test for get_{all}_potential_matches APIs tests/tracker/Makefile.am | 5 + tests/tracker/match-all.vala | 248 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 253 insertions(+), 0 deletions(-) commit 679a8dfb8fb8bd0abf77131abaf832c72dd67466 Author: Raul Gutierrez Segales <raul.gutierrez.segales@collabora.co.uk> Date: Fri Apr 8 23:59:00 2011 +0100 Add test to match via name similarity tests/tracker/Makefile.am | 6 + tests/tracker/match-name.vala | 268 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 274 insertions(+), 0 deletions(-) commit 6aa5289cfc57cc183edc7d9c99632f2201a61a9f Author: Raul Gutierrez Segales <raul.gutierrez.segales@collabora.co.uk> Date: Fri Apr 8 23:58:18 2011 +0100 Add test to match via phone numbers tests/tracker/Makefile.am | 7 + tests/tracker/match-phone-number.vala | 225 +++++++++++++++++++++++++++++++++ 2 files changed, 232 insertions(+), 0 deletions(-) commit 53edbc0b63ef45629cdda7ea2478448efadce2eb Author: Raul Gutierrez Segales <raul.gutierrez.segales@collabora.co.uk> Date: Fri Apr 8 23:57:10 2011 +0100 Add test to avoid matching known e-mail addresses tests/tracker/Makefile.am | 6 + tests/tracker/match-known-emails.vala | 231 +++++++++++++++++++++++++++++++++ 2 files changed, 237 insertions(+), 0 deletions(-) commit 1815d9300d2b4d2dc1b145148f003d34d0e04642 Author: Raul Gutierrez Segales <raul.gutierrez.segales@collabora.co.uk> Date: Fri Apr 8 23:56:09 2011 +0100 Add test to match via e-mail addresses tests/tracker/Makefile.am | 6 + tests/tracker/match-email-addresses.vala | 223 ++++++++++++++++++++++++++++++ 2 files changed, 229 insertions(+), 0 deletions(-) commit 5cd51be20a9a0bdd318bb9d123afa9e50f8fb794 Author: Raul Gutierrez Segales <raul.gutierrez.segales@collabora.co.uk> Date: Fri Apr 8 23:55:19 2011 +0100 Add test to match via IM Addresses tests/tracker/Makefile.am | 6 + tests/tracker/match-im-addresses.vala | 230 +++++++++++++++++++++++++++++++++ 2 files changed, 236 insertions(+), 0 deletions(-) commit 03a84c32ade9d303caaa438dc8f9572421a73245 Author: Raul Gutierrez Segales <raul.gutierrez.segales@collabora.co.uk> Date: Fri Apr 8 23:52:54 2011 +0100 Add support to perform potential matches among Individuals folks/Makefile.am | 1 + folks/individual-aggregator.vala | 75 ++++++ folks/potential-match.vala | 462 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 538 insertions(+), 0 deletions(-) commit a21a5408c5718e9e95342fced2791ca5fa77206d Author: Raul Gutierrez Segales <raul.gutierrez.segales@collabora.co.uk> Date: Mon Apr 4 15:51:49 2011 +0100 Add helper method to normalise and compare phone numbers folks/phone-details.vala | 109 ++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 109 insertions(+), 0 deletions(-) commit ff67bfcf4250952c18138344ebb888b6e2b4ee9c Author: Raul Gutierrez Segales <raul.gutierrez.segales@collabora.co.uk> Date: Fri Apr 8 20:51:32 2011 +0100 Add Folks.Utils to group helper methods folks/Makefile.am | 1 + folks/utils.vala | 28 ++++++++++++++++++++++++++++ 2 files changed, 29 insertions(+), 0 deletions(-)