After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 678474 - potential-match should be smarter with accents
potential-match should be smarter with accents
Status: RESOLVED FIXED
Product: folks
Classification: Platform
Component: general
git master
Other Linux
: Normal enhancement
: Unset
Assigned To: folks-maint
folks-maint
Depends on:
Blocks:
 
 
Reported: 2012-06-20 12:41 UTC by Guillaume Desmottes
Modified: 2012-06-26 08:01 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Fix character stripping code in PotentialMatch (13.98 KB, patch)
2012-06-23 10:39 UTC, Philip Withnall
committed Details | Review
Replace symbols with spaces in PotentialMatch (3.15 KB, patch)
2012-06-25 19:09 UTC, Philip Withnall
committed Details | Review

Description Guillaume Desmottes 2012-06-20 12:41:15 UTC
Let's say I have an individual called "Frederic Peters" and another one "Frédéric Peters". They are obviously the same person but gnome-contacts does not suggest me to link them (MatchResult.HIGH):

(gnome-contacts:13363): folks-DEBUG: potential-match.vala:593: Frédéric Peters and Frederic Peters have 16 matches and 5,000000 / 2 transpositions
(gnome-contacts:13363): folks-DEBUG: potential-match.vala:494: [jaro_dist] Distance for Frédéric Peters and Frederic Peters: 0,928309
(gnome-contacts:13363): folks-DEBUG: potential-match.vala:250: [name_similarity] Got 0,700000

(numbers can be wrong, this example is not actually from Fred, I just used his name because he's public enough to not care having his name in this bug :)

Looks like Folks uses the Jaro distance to decide if 2 names are similar enough. I think it should first pre-process the strings to 'reduce' them to their non accentuated form like Empathy does in live search to ignore accents.
Comment 1 Guillaume Desmottes 2012-06-20 12:43:35 UTC
Here is the Empathy code, most of it can probably be reused.
http://git.gnome.org/browse/empathy/tree/libempathy-gtk/empathy-live-search.c#n134
Comment 2 Philip Withnall 2012-06-22 11:40:51 UTC
This was already implemented in http://git.gnome.org/browse/folks/diff/?id=f8e59f0d0a051ed9dcfeab2467b78ca1aabec741 (bug #670872) using the code from Empathy’s live search.

There are two possibilities:
 • The code is broken.
 • The code is successfully stripping the accents, but that still doesn’t give a high enough matching score to get MatchResult.HIGH. IIRC this could happen if the pair of names are really short.

Could you provide an actual pair of names which are a problem?
Comment 3 Philip Withnall 2012-06-23 10:39:57 UTC
Created attachment 217066 [details] [review]
Fix character stripping code in PotentialMatch

https://www.gitorious.org/folks/folks/commits/678474-name-matching

Turns out there were indexing problems in the character stripping code, which this patch fixes. It also adds a test case.
Comment 4 Guillaume Desmottes 2012-06-25 08:26:11 UTC
Review of attachment 217066 [details] [review]:

Looks good to me and fixed my issue; thanks!
Comment 5 Guillaume Desmottes 2012-06-25 08:31:19 UTC
May be another bug, but Contacts didn't suggest me to link my contact "Alice Badger" and "alice.badger".
Comment 6 Travis Reitter 2012-06-25 16:16:59 UTC
(In reply to comment #5)
> May be another bug, but Contacts didn't suggest me to link my contact "Alice
> Badger" and "alice.badger".

Yeah, I think that's a different issue. We should probably add a step where we replace all non-letters with spaces to find more related names.
Comment 7 Philip Withnall 2012-06-25 18:47:53 UTC
Comment on attachment 217066 [details] [review]
Fix character stripping code in PotentialMatch

commit 697b611adfbfffdb37a817feb5482f43bb715e1e
Author: Philip Withnall <philip@tecnocode.co.uk>
Date:   Sat Jun 23 11:24:53 2012 +0100

    core: Fix indexing for unichars in PotentialMatch’s Jaro distance function
    
    The indexing was getting out of sync between the two strings if one contained
    a different number of non-ASCII characters to the other (as the indexing was
    done in terms of bytes, rather than characters). This re-works the Jaro
    distance function to operate on unichar arrays, and index in terms of
    characters. This also means that exact matches now work over stripped non-ASCII
    characters, which they didn’t before.
    
    This adds a new test case. Yay.
    
    Closes: https://bugzilla.gnome.org/show_bug.cgi?id=678474

 NEWS                          |    1 +
 folks/potential-match.vala    |   89 +++++++++++++++++++++++++----------------
 tests/tracker/match-name.vala |    9 ++++
 3 files changed, 64 insertions(+), 35 deletions(-)

commit 6dee26ca93460f1518f13c3af09cf36fff88b8f3
Author: Philip Withnall <philip@tecnocode.co.uk>
Date:   Sat Jun 23 10:09:14 2012 +0100

    core: Tidy up MatchResult usage in PotentialMatch a little
    
    There’s no need for the result to be an instance variable.

 folks/potential-match.vala |   74 +++++++++++++++++++++++---------------------
 1 files changed, 39 insertions(+), 35 deletions(-)
Comment 8 Philip Withnall 2012-06-25 19:09:54 UTC
Created attachment 217234 [details] [review]
Replace symbols with spaces in PotentialMatch

https://www.gitorious.org/folks/folks/trees/678474-name-characters

Have a patch.
Comment 9 Travis Reitter 2012-06-25 21:45:47 UTC
Review of attachment 217234 [details] [review]:

Without reading the Unicode specification to understand all these types, it looks good.

My only suggestion would be adding this to a different backend's test suite, if possible, since I still do plan to move the Tracker backend to its own module.
Comment 10 Philip Withnall 2012-06-25 21:50:09 UTC
Comment on attachment 217234 [details] [review]
Replace symbols with spaces in PotentialMatch

commit 2341b9772bfba76ff153d3b36aa88cb3409e8681
Author: Philip Withnall <philip@tecnocode.co.uk>
Date:   Mon Jun 25 20:07:21 2012 +0100

    core: Replace symbols with spaces when matching names in PotentialMatch
    
    This allows, for example, “Alice Badger” and “alice.badger” to have a HIGH
    match. This is a common situation when trying to pair up IM personas with
    other personas.
    
    This also includes a test case.
    
    Closes: https://bugzilla.gnome.org/show_bug.cgi?id=678474

 folks/potential-match.vala    |   17 ++++++++++-------
 tests/tracker/match-name.vala |    9 +++++++++
 2 files changed, 19 insertions(+), 7 deletions(-)
Comment 11 Philip Withnall 2012-06-25 21:51:45 UTC
(In reply to comment #9)
> My only suggestion would be adding this to a different backend's test suite, if
> possible, since I still do plan to move the Tracker backend to its own module.

I don’t think that’s possible until we’ve got a dummy backend which we can use for tests. Once we’ve got that, I expect a load of the Tracker tests can be moved into a core test suite.

Thanks for the fast review.
Comment 12 Guillaume Desmottes 2012-06-26 08:01:39 UTC
That fixed it, thanks!