Bug 528498 – Searching and sorting should ignore accents/use UTF equivalents

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 528498 - Searching and sorting should ignore accents/use UTF equivalents


Summary:	Searching and sorting should ignore accents/use UTF equivalents


Status:	RESOLVED FIXED

Product:	banshee
Classification:	Other
Component:	User Interface
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	1.2
Assigned To:	Banshee Maintainers
QA Contact:	Banshee Maintainers

URL:
Whiteboard:

Duplicates:	533871 534915 535216 583328 583379 (view as bug list)
Depends on:	499650 568787
Blocks:

Reported:	2008-04-17 00:47 UTC by Sandy Armstrong
Modified:	2009-05-21 02:52 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Proof of concept for a function to safely strip accents from Latin text (891 bytes, patch) 2009-02-04 21:22 UTC, John Millikin	none	Details \| Review
Ignore accents when searching (v1) (10.73 KB, patch) 2009-02-07 00:25 UTC, John Millikin	none	Details \| Review
Ignore accents when searching (v1) (12.52 KB, patch) 2009-02-07 00:40 UTC, John Millikin	none	Details \| Review
Ignore accents when searching (v2) (12.13 KB, patch) 2009-02-07 05:22 UTC, John Millikin	needs-work	Details \| Review
Ignore accents when searching (v3) (15.78 KB, patch) 2009-02-10 21:38 UTC, John Millikin	committed	Details \| Review

Description Sandy Armstrong 2008-04-17 00:47:14 UTC

Searching takes accented letters into account, but I don't know how to type them and shouldn't have to learn.

Steps to reproduce:

1. Download awesome Bézèd'h album (free) here: http://www.jamendo.com/en/album/135
2. Import said album into Banshee
3. Attempt to find your awesome new album by typing "bez" in the search field

Expected results:

Awesome new Bézèd'h album appears, for your listening pleasure.

Actual results:

No music appears.  Baby jesus cries at lack of french celtic rock.

Let's just ignore those pesky accents, m'kay?

Comment 1 Gabriel Burt 2008-05-22 00:10:44 UTC

*** Bug 533871 has been marked as a duplicate of this bug. ***

Comment 2 Benjamín Valero Espinosa 2008-05-26 14:44:44 UTC

The solution could be that the 'lowername' in the table of tracks in the database contained the name lowered and without accents.

Comment 3 Gabriel Burt 2008-05-26 18:10:47 UTC

*** Bug 534915 has been marked as a duplicate of this bug. ***

Comment 4 Gabriel Burt 2008-05-28 17:37:35 UTC

*** Bug 535216 has been marked as a duplicate of this bug. ***

Comment 5 Mikayla Hutchinson 2008-06-16 21:15:59 UTC

Related to this is is the possibility that users may have more complex transliteration stored in metadata, e.g. Japanese->English such as (Yoko Kanno, Kanno Yōko, 菅野よう子)

Comment 6 Michael Martin-Smucker 2008-12-27 17:12:00 UTC

Bug 499650 and Bug 561380 both mention using "sortartist" to change how artists are grouped, but sortartist may be able to fix this bug as well.  If there was a separate field for sortartist, that field could use only UTF characters, and search could look at that field instead of the regular artist field.  Then Bézèd'h would be grouped with "be" and you could find it when you search for "bez"

Comment 7 André Klapper 2009-02-02 23:22:08 UTC

Shouldn't this have been fixed by bug 458941?

For the records, also see bug 343505 about the same issue in Evolution's search (written in C).

Comment 8 Sandy Armstrong 2009-02-02 23:26:24 UTC

I am still experiencing this problem (which was reported *after* bug 458941 was fixed).

Comment 9 Gabriel Burt 2009-02-02 23:45:35 UTC

John Millikin is working on getting us custom Sqlite functions, so we can hook in proper/fully-unicode-aware collaction and case methods.

Comment 10 John Millikin 2009-02-04 20:13:24 UTC

It is probably easy to write a normalization function that will strip all diacritics from two strings and compare them. Perhaps it should be limited to Latin base characters, otherwise it might be broken on other languages with combining diacritics. For example, か and が would match.

> sortartist may be able to fix this bug as well.  If there was
> a separate field for sortartist, that field could use only UTF characters, and
> search could look at that field instead of the regular artist field.  Then
> Bézèd'h would be grouped with "be" and you could find it when you search for
> "bez"

This is an improper use of the sortnames, since
1) Ordering may be dependent on diacritics in many locales.
2) Sortnames and real names can differ significantly, so searching "the beatles" won't match the sortname "Beatles, The".

Comment 11 John Millikin 2009-02-04 21:22:04 UTC

Created attachment 127956 [details] [review]
Proof of concept for a function to safely strip accents from Latin text

Strips diacritical marks from Latin letters, and some kinds of punctuation. To perform a search, check (strip(search_term) == strip(track.Title)).

It might be best to have variants of this function, depending on the user's current locale.

Comment 12 Jakub 'Livio' Rusinek 2009-02-04 21:33:32 UTC

I don't understand the way this function needs so I decided to give you the letters with diacritics and "clean ASCII" equivalent.

You'll probably need this.

For pl_PL(.UTF8):

ą → a
Ą → A

ć → c
Ć → C

ę → e
Ę → E

ł → L
Ł → L

ń → n
Ń → N

ó → o
Ó → O

ś → s
Ś → S

ź → z
Ź → Z

ż → z
Ż → Z

Comment 13 Benjamín Valero Espinosa 2009-02-04 22:11:25 UTC

Well, in Spanish accented vowels (á, é, í, ó, ú, ü) are sorted like the simple vowels, but 'ñ' is sorted after 'n'.

That is simply an example of how difficult such a function can be if we have to take every Latin language. There's no method already implemented that do that?

Comment 14 John Millikin 2009-02-04 22:18:43 UTC

Sorting of non-Latin characters will probably be fixed as a side effect of bug 499650 , using Mono's collation algorithms. Such algorithms handle sorting based on the current locale, and should provide proper handling of accented and non-Latin characters.

Ignoring accents while searching is a separate issue, and Mono doesn't have any built-in support for it. If a list of conflated characters can be compiled, searching can be modified to ignore them.

Comment 15 Jakub 'Livio' Rusinek 2009-02-04 22:31:17 UTC

Oh, I told about searching in Polish but nothing about sorting.

All these letters with diacritis in Polish are after their "clean ASCII" equivalents.

So the Polish alphabet (and sorting schema) looks like this: a, ą, b, c, ć, d, e, ę, f, g, h, i, j, k, l, ł, m, n, ń, o, ó, p, r, s, ś, t, u, w, y, z, ź, ż.

Comment 16 John Millikin 2009-02-07 00:25:16 UTC

Created attachment 128130 [details] [review]
Ignore accents when searching (v1)

Let searches on the text-based fields be calculated based on a normalized string, free of accents and some types of punctuation. This patch completely screws up the current sorting system, so bug 499650 has to be fixed first.

Comment 17 John Millikin 2009-02-07 00:40:50 UTC

Created attachment 128132 [details] [review]
Ignore accents when searching (v1)

Forgot to add FuzzyStringQueryValue.cs to the diff

Comment 18 Gabriel Burt 2009-02-07 00:58:17 UTC

John, thanks for the patch.  Comments:

1. please follow HACKING with your if/foreach statements.

2. Instead of "if (ignored_special_cases == null) { BuildSpecialCases (); }" you can put just make BuildSpecialCases the static constructor.

3. We have the normalized/stripped values for artist/album/title cached (in *Lowered columns) but for Genre, Composer etc we need to call LowerAndStripIgnored (which I'd rather we called Normalize) on in the actual SQL, no?

4. Instead of bumping the MetadataVersion, let's just add a migration and call "UPDATE CoreTracks SET TitleLowered = BANSHEE_NORMALIZE(Title)", no?

5. I'd rather not add the FuzzyStringQueryValue class - instead we should just change the existing ToLower/LOWER code in src/Libraries/Hyena/Hyena.Query/QueryField.cs's ToSql method.  This *will* normalize all strings, but...maybe we can add a Normalize property to StringQueryValue that default to false to be able to turn it off.  Are there any strings for which we don't want normalization done?

Comment 19 Gabriel Burt 2009-02-07 01:01:02 UTC

Ok, made a couple typos/thinkos in that comment.

#4 we need to call it for all *Lowered columns, of course, and it should be a db migration.

#5 I mean the Normalize value would default to true

Comment 20 Gabriel Burt 2009-02-07 01:06:50 UTC

has_ascii = (c >= 'a' && c <= 'z') should be |=, no?

Can you write some unit tests for LowerAndStripIgnored in src/Libraries/Hyena/Hyena/Tests/StringUtilTests.cs please?

Comment 21 Sandy Armstrong 2009-02-07 01:07:28 UTC

Woo hoo! I haven't tested this thoroughly, but it solves the initial problem I described.  Thanks John!

I'll continue using this and let you know if I run into any problems.

Somehow I missed your comment about screwing up sorting.  I'll keep an eye out for that, too...

Comment 22 John Millikin 2009-02-07 01:41:04 UTC

> We have the normalized/stripped values for artist/album/title cached (in
> *Lowered columns) but for Genre, Composer etc we need to call
> LowerAndStripIgnored (which I'd rather we called Normalize) on in the actual
> SQL, no?

Yes, that'll need to become a custom SQLite function. Unless we want to cache those fields also.

I don't want to call the function `Normalize`, because what it's doing isn't normalization -- it's irreversibly mangling the text to remove certain characters. Maybe `SearchKey`?

> I'd rather not add the FuzzyStringQueryValue class - instead we should just
> change the existing ToLower/LOWER code in
> src/Libraries/Hyena/Hyena.Query/QueryField.cs's ToSql method.  This *will*
> normalize all strings, but...maybe we can add a Normalize property to
> StringQueryValue that default to false to be able to turn it off.  Are there
> any strings for which we don't want normalization done?

Probably shouldn't strip anything from MIME-type or file location. Could change `StringQueryValue` into an abstract class with two subclasses, `FuzzyStringQueryValue` (default) and `StrictStringQueryValue`, then set the location and MIME-type fields to use strict query values.

> has_ascii = (c >= 'a' && c <= 'z') should be |=, no?

No, straight assignment is correct. Combining characters should be stripped only if *the previous* character was Latin. Perhaps `has_ascii` is a poor name, how's `previous_latin`?

Comment 23 John Millikin 2009-02-07 05:22:15 UTC

Created attachment 128135 [details] [review]
Ignore accents when searching (v2)

* Name the accent-stripping function `SearchKey`.
* Add the custom SQLite function `HYENA_SEARCH_KEY`, and uses it for searching in string columns.
* Add a new query value type, `ExactStringQueryValue`, used for URIs and MIME-types.
* Use a DB migration, rather than metadata version bump, to refresh search keys.

Comment 24 Gabriel Burt 2009-02-08 22:01:54 UTC

John, this is nearly perfect - great job!  And thanks for explaining the logic in SearchKey.

Two things:

1) In the change to QueryField.cs, the AppendFormat that you modify used to produce SQL like this:

  (Foo = [value] AND/OR lower(Foo) = [value.ToLower])

but it should now produce just

  HYENA_SEARCH_KEY(Foo) = [SearchKey(value)]

and the AppendFormat right before this one (for 'pre-lowered' columns) should produce

  Foo = [SearchKey(value)]

where things in brackets are done in manged code once and String.Format'd in.

2. Unit tests please :)

Comment 25 John Millikin 2009-02-10 21:38:05 UTC

Created attachment 128416 [details] [review]
Ignore accents when searching (v3)

Adds unit tests, and cleans up `QueryField.ToSql()` a bit. Since `StringQueryValue.ToSql()` now returns lower-case values, much of the LOWER-related code can be removed.

Comment 26 Jakub 'Livio' Rusinek 2009-02-10 21:40:38 UTC

BTW, [sorry for bugspam!], what about transliteration? For example when you have Banshee in Polish and keyboard's set to Polish and you're searching a Russian item.

It would be very nice.

Comment 27 John Millikin 2009-02-10 22:00:17 UTC

> what about transliteration? For example when you have Banshee in Polish and
> keyboard's set to Polish and you're searching a Russian item.

Here be dragons.

Automatic transliteration is difficult, even between such similar scripts as Latin and Cyrillic. Take the name Чайковский, for example, which can be reasonably transliterated to any of:

Chaikovski
Chaikovskii
Ciaikovsky
Csajkovszkij
Tchaikovski
Tchaikovsky
Tchaikowsky
Tsaikovsky
Tschaikowsky

depending on local spelling conventions, language, and which transliteration system is used.

Comment 28 Gabriel Burt 2009-02-11 02:36:02 UTC

Alright, committed!  Thanks John for your perseverance!

I added a few more unit tests to the patch.

Comment 29 Alexander Kojevnikov 2009-05-20 13:16:31 UTC

*** Bug 583328 has been marked as a duplicate of this bug. ***

Comment 30 Alexander Kojevnikov 2009-05-21 02:52:44 UTC

*** Bug 583379 has been marked as a duplicate of this bug. ***