GNOME Bugzilla – Bug 528498
Searching and sorting should ignore accents/use UTF equivalents
Last modified: 2009-05-21 02:52:44 UTC
Searching takes accented letters into account, but I don't know how to type them and shouldn't have to learn. Steps to reproduce: 1. Download awesome Bézèd'h album (free) here: http://www.jamendo.com/en/album/135 2. Import said album into Banshee 3. Attempt to find your awesome new album by typing "bez" in the search field Expected results: Awesome new Bézèd'h album appears, for your listening pleasure. Actual results: No music appears. Baby jesus cries at lack of french celtic rock. Let's just ignore those pesky accents, m'kay?
*** Bug 533871 has been marked as a duplicate of this bug. ***
The solution could be that the 'lowername' in the table of tracks in the database contained the name lowered and without accents.
*** Bug 534915 has been marked as a duplicate of this bug. ***
*** Bug 535216 has been marked as a duplicate of this bug. ***
Related to this is is the possibility that users may have more complex transliteration stored in metadata, e.g. Japanese->English such as (Yoko Kanno, Kanno Yōko, 菅野よう子)
Bug 499650 and Bug 561380 both mention using "sortartist" to change how artists are grouped, but sortartist may be able to fix this bug as well. If there was a separate field for sortartist, that field could use only UTF characters, and search could look at that field instead of the regular artist field. Then Bézèd'h would be grouped with "be" and you could find it when you search for "bez"
Shouldn't this have been fixed by bug 458941? For the records, also see bug 343505 about the same issue in Evolution's search (written in C).
I am still experiencing this problem (which was reported *after* bug 458941 was fixed).
John Millikin is working on getting us custom Sqlite functions, so we can hook in proper/fully-unicode-aware collaction and case methods.
It is probably easy to write a normalization function that will strip all diacritics from two strings and compare them. Perhaps it should be limited to Latin base characters, otherwise it might be broken on other languages with combining diacritics. For example, か and が would match. > sortartist may be able to fix this bug as well. If there was > a separate field for sortartist, that field could use only UTF characters, and > search could look at that field instead of the regular artist field. Then > Bézèd'h would be grouped with "be" and you could find it when you search for > "bez" This is an improper use of the sortnames, since 1) Ordering may be dependent on diacritics in many locales. 2) Sortnames and real names can differ significantly, so searching "the beatles" won't match the sortname "Beatles, The".
Created attachment 127956 [details] [review] Proof of concept for a function to safely strip accents from Latin text Strips diacritical marks from Latin letters, and some kinds of punctuation. To perform a search, check (strip(search_term) == strip(track.Title)). It might be best to have variants of this function, depending on the user's current locale.
I don't understand the way this function needs so I decided to give you the letters with diacritics and "clean ASCII" equivalent. You'll probably need this. For pl_PL(.UTF8): ą → a Ą → A ć → c Ć → C ę → e Ę → E ł → L Ł → L ń → n Ń → N ó → o Ó → O ś → s Ś → S ź → z Ź → Z ż → z Ż → Z
Well, in Spanish accented vowels (á, é, í, ó, ú, ü) are sorted like the simple vowels, but 'ñ' is sorted after 'n'. That is simply an example of how difficult such a function can be if we have to take every Latin language. There's no method already implemented that do that?
Sorting of non-Latin characters will probably be fixed as a side effect of bug 499650 , using Mono's collation algorithms. Such algorithms handle sorting based on the current locale, and should provide proper handling of accented and non-Latin characters. Ignoring accents while searching is a separate issue, and Mono doesn't have any built-in support for it. If a list of conflated characters can be compiled, searching can be modified to ignore them.
Oh, I told about searching in Polish but nothing about sorting. All these letters with diacritis in Polish are after their "clean ASCII" equivalents. So the Polish alphabet (and sorting schema) looks like this: a, ą, b, c, ć, d, e, ę, f, g, h, i, j, k, l, ł, m, n, ń, o, ó, p, r, s, ś, t, u, w, y, z, ź, ż.
Created attachment 128130 [details] [review] Ignore accents when searching (v1) Let searches on the text-based fields be calculated based on a normalized string, free of accents and some types of punctuation. This patch completely screws up the current sorting system, so bug 499650 has to be fixed first.
Created attachment 128132 [details] [review] Ignore accents when searching (v1) Forgot to add FuzzyStringQueryValue.cs to the diff
John, thanks for the patch. Comments: 1. please follow HACKING with your if/foreach statements. 2. Instead of "if (ignored_special_cases == null) { BuildSpecialCases (); }" you can put just make BuildSpecialCases the static constructor. 3. We have the normalized/stripped values for artist/album/title cached (in *Lowered columns) but for Genre, Composer etc we need to call LowerAndStripIgnored (which I'd rather we called Normalize) on in the actual SQL, no? 4. Instead of bumping the MetadataVersion, let's just add a migration and call "UPDATE CoreTracks SET TitleLowered = BANSHEE_NORMALIZE(Title)", no? 5. I'd rather not add the FuzzyStringQueryValue class - instead we should just change the existing ToLower/LOWER code in src/Libraries/Hyena/Hyena.Query/QueryField.cs's ToSql method. This *will* normalize all strings, but...maybe we can add a Normalize property to StringQueryValue that default to false to be able to turn it off. Are there any strings for which we don't want normalization done?
Ok, made a couple typos/thinkos in that comment. #4 we need to call it for all *Lowered columns, of course, and it should be a db migration. #5 I mean the Normalize value would default to true
has_ascii = (c >= 'a' && c <= 'z') should be |=, no? Can you write some unit tests for LowerAndStripIgnored in src/Libraries/Hyena/Hyena/Tests/StringUtilTests.cs please?
Woo hoo! I haven't tested this thoroughly, but it solves the initial problem I described. Thanks John! I'll continue using this and let you know if I run into any problems. Somehow I missed your comment about screwing up sorting. I'll keep an eye out for that, too...
> We have the normalized/stripped values for artist/album/title cached (in > *Lowered columns) but for Genre, Composer etc we need to call > LowerAndStripIgnored (which I'd rather we called Normalize) on in the actual > SQL, no? Yes, that'll need to become a custom SQLite function. Unless we want to cache those fields also. I don't want to call the function `Normalize`, because what it's doing isn't normalization -- it's irreversibly mangling the text to remove certain characters. Maybe `SearchKey`? > I'd rather not add the FuzzyStringQueryValue class - instead we should just > change the existing ToLower/LOWER code in > src/Libraries/Hyena/Hyena.Query/QueryField.cs's ToSql method. This *will* > normalize all strings, but...maybe we can add a Normalize property to > StringQueryValue that default to false to be able to turn it off. Are there > any strings for which we don't want normalization done? Probably shouldn't strip anything from MIME-type or file location. Could change `StringQueryValue` into an abstract class with two subclasses, `FuzzyStringQueryValue` (default) and `StrictStringQueryValue`, then set the location and MIME-type fields to use strict query values. > has_ascii = (c >= 'a' && c <= 'z') should be |=, no? No, straight assignment is correct. Combining characters should be stripped only if *the previous* character was Latin. Perhaps `has_ascii` is a poor name, how's `previous_latin`?
Created attachment 128135 [details] [review] Ignore accents when searching (v2) * Name the accent-stripping function `SearchKey`. * Add the custom SQLite function `HYENA_SEARCH_KEY`, and uses it for searching in string columns. * Add a new query value type, `ExactStringQueryValue`, used for URIs and MIME-types. * Use a DB migration, rather than metadata version bump, to refresh search keys.
John, this is nearly perfect - great job! And thanks for explaining the logic in SearchKey. Two things: 1) In the change to QueryField.cs, the AppendFormat that you modify used to produce SQL like this: (Foo = [value] AND/OR lower(Foo) = [value.ToLower]) but it should now produce just HYENA_SEARCH_KEY(Foo) = [SearchKey(value)] and the AppendFormat right before this one (for 'pre-lowered' columns) should produce Foo = [SearchKey(value)] where things in brackets are done in manged code once and String.Format'd in. 2. Unit tests please :)
Created attachment 128416 [details] [review] Ignore accents when searching (v3) Adds unit tests, and cleans up `QueryField.ToSql()` a bit. Since `StringQueryValue.ToSql()` now returns lower-case values, much of the LOWER-related code can be removed.
BTW, [sorry for bugspam!], what about transliteration? For example when you have Banshee in Polish and keyboard's set to Polish and you're searching a Russian item. It would be very nice.
> what about transliteration? For example when you have Banshee in Polish and > keyboard's set to Polish and you're searching a Russian item. Here be dragons. Automatic transliteration is difficult, even between such similar scripts as Latin and Cyrillic. Take the name Чайковский, for example, which can be reasonably transliterated to any of: Chaikovski Chaikovskii Ciaikovsky Csajkovszkij Tchaikovski Tchaikovsky Tchaikowsky Tsaikovsky Tschaikowsky depending on local spelling conventions, language, and which transliteration system is used.
Alright, committed! Thanks John for your perseverance! I added a few more unit tests to the patch.
*** Bug 583328 has been marked as a duplicate of this bug. ***
*** Bug 583379 has been marked as a duplicate of this bug. ***