GNOME Bugzilla – Bug 678474
potential-match should be smarter with accents
Last modified: 2012-06-26 08:01:39 UTC
Let's say I have an individual called "Frederic Peters" and another one "Frédéric Peters". They are obviously the same person but gnome-contacts does not suggest me to link them (MatchResult.HIGH): (gnome-contacts:13363): folks-DEBUG: potential-match.vala:593: Frédéric Peters and Frederic Peters have 16 matches and 5,000000 / 2 transpositions (gnome-contacts:13363): folks-DEBUG: potential-match.vala:494: [jaro_dist] Distance for Frédéric Peters and Frederic Peters: 0,928309 (gnome-contacts:13363): folks-DEBUG: potential-match.vala:250: [name_similarity] Got 0,700000 (numbers can be wrong, this example is not actually from Fred, I just used his name because he's public enough to not care having his name in this bug :) Looks like Folks uses the Jaro distance to decide if 2 names are similar enough. I think it should first pre-process the strings to 'reduce' them to their non accentuated form like Empathy does in live search to ignore accents.
Here is the Empathy code, most of it can probably be reused. http://git.gnome.org/browse/empathy/tree/libempathy-gtk/empathy-live-search.c#n134
This was already implemented in http://git.gnome.org/browse/folks/diff/?id=f8e59f0d0a051ed9dcfeab2467b78ca1aabec741 (bug #670872) using the code from Empathy’s live search. There are two possibilities: • The code is broken. • The code is successfully stripping the accents, but that still doesn’t give a high enough matching score to get MatchResult.HIGH. IIRC this could happen if the pair of names are really short. Could you provide an actual pair of names which are a problem?
Created attachment 217066 [details] [review] Fix character stripping code in PotentialMatch https://www.gitorious.org/folks/folks/commits/678474-name-matching Turns out there were indexing problems in the character stripping code, which this patch fixes. It also adds a test case.
Review of attachment 217066 [details] [review]: Looks good to me and fixed my issue; thanks!
May be another bug, but Contacts didn't suggest me to link my contact "Alice Badger" and "alice.badger".
(In reply to comment #5) > May be another bug, but Contacts didn't suggest me to link my contact "Alice > Badger" and "alice.badger". Yeah, I think that's a different issue. We should probably add a step where we replace all non-letters with spaces to find more related names.
Comment on attachment 217066 [details] [review] Fix character stripping code in PotentialMatch commit 697b611adfbfffdb37a817feb5482f43bb715e1e Author: Philip Withnall <philip@tecnocode.co.uk> Date: Sat Jun 23 11:24:53 2012 +0100 core: Fix indexing for unichars in PotentialMatch’s Jaro distance function The indexing was getting out of sync between the two strings if one contained a different number of non-ASCII characters to the other (as the indexing was done in terms of bytes, rather than characters). This re-works the Jaro distance function to operate on unichar arrays, and index in terms of characters. This also means that exact matches now work over stripped non-ASCII characters, which they didn’t before. This adds a new test case. Yay. Closes: https://bugzilla.gnome.org/show_bug.cgi?id=678474 NEWS | 1 + folks/potential-match.vala | 89 +++++++++++++++++++++++++---------------- tests/tracker/match-name.vala | 9 ++++ 3 files changed, 64 insertions(+), 35 deletions(-) commit 6dee26ca93460f1518f13c3af09cf36fff88b8f3 Author: Philip Withnall <philip@tecnocode.co.uk> Date: Sat Jun 23 10:09:14 2012 +0100 core: Tidy up MatchResult usage in PotentialMatch a little There’s no need for the result to be an instance variable. folks/potential-match.vala | 74 +++++++++++++++++++++++--------------------- 1 files changed, 39 insertions(+), 35 deletions(-)
Created attachment 217234 [details] [review] Replace symbols with spaces in PotentialMatch https://www.gitorious.org/folks/folks/trees/678474-name-characters Have a patch.
Review of attachment 217234 [details] [review]: Without reading the Unicode specification to understand all these types, it looks good. My only suggestion would be adding this to a different backend's test suite, if possible, since I still do plan to move the Tracker backend to its own module.
Comment on attachment 217234 [details] [review] Replace symbols with spaces in PotentialMatch commit 2341b9772bfba76ff153d3b36aa88cb3409e8681 Author: Philip Withnall <philip@tecnocode.co.uk> Date: Mon Jun 25 20:07:21 2012 +0100 core: Replace symbols with spaces when matching names in PotentialMatch This allows, for example, “Alice Badger” and “alice.badger” to have a HIGH match. This is a common situation when trying to pair up IM personas with other personas. This also includes a test case. Closes: https://bugzilla.gnome.org/show_bug.cgi?id=678474 folks/potential-match.vala | 17 ++++++++++------- tests/tracker/match-name.vala | 9 +++++++++ 2 files changed, 19 insertions(+), 7 deletions(-)
(In reply to comment #9) > My only suggestion would be adding this to a different backend's test suite, if > possible, since I still do plan to move the Tracker backend to its own module. I don’t think that’s possible until we’ve got a dummy backend which we can use for tests. Once we’ve got that, I expect a load of the Tracker tests can be moved into a core test suite. Thanks for the fast review.
That fixed it, thanks!