GNOME Bugzilla – Bug 709753
Add helpers for string matches when using GtkSearchBar-like widget
Last modified: 2013-10-15 13:16:10 UTC
This is a follow up on bug #621393. When searching for elements matching a string, it is often not as trivial as a g_strcmp(). For example in empathy when searching for contacts, we have code that match contact names regardless of the case, accentuated letters, and words. A few examples: - when searching for "cl", the string "Xavier Claessens" should match, because "cl" is the beginning of a word. - when searching for "ier", the string "Xavier Claessens" should NOT match, because it is not at the beginning of a word. - when searching for "fred" the string "Frédéric" should match. We have an algorithm for that in EmpathyLiveSearch widget, that got inherited from Maemo N900 code. I think it is useful helper to get inside GLib somewhere, or maybe inside GtkSearchBar ?
I agree that we want this API because we will need a similar API for g_desktop_app_info_search() There, we currently do the following: - tokenise (by simple algorithm: a token is 1 or more unicode-alnum characters separated by 1 or more unicode-non-alnum characters) - utf8 normalise - casefold - prefix match, as you describe here We do not do any sort of special handling of accents, however. I've heard from some french speakers before that this is not desired. However, I've also heard it mentioned a couple of times recently that this _is_ desired. In any case, _if_ we do it I'd insist that we only do it in one direction. Example: user typing "fred" matches "Frédéric" but user typing "fréd" does NOT match "Frederick".
Created attachment 256848 [details] [review] Add search utilities functions
Created attachment 256851 [details] [review] Add search utilities This is copy/pasted code from empathy source tree, just changed function names. It could need a bit more polishing but that gives the idea. I'm French speaking, and I think it much easier to search without caring about the accents. However, when I worked on that code for Nokia, we had a few issues like (IIRC) German people assuming "oe" match "ö" and not simply "o" which is much more complicated decomposition because it depends on the language and back in the time unicode tables were not even correct about it (I don't remember the details). I think empathy's solution is easy enough and nobody complained about it AFAIK.
Comment on attachment 256851 [details] [review] Add search utilities Hm, my "git bz" strange things... :/
(In reply to comment #1) > In any case, _if_ we do it I'd insist that we only do it in one direction. > Example: user typing "fred" matches "Frédéric" but user typing "fréd" does NOT > match "Frederick". My proposed patch match in both case, but that could be changed. Not sure it really matters.
Comment on attachment 256848 [details] [review] Add search utilities functions >+ ch = g_unichar_tolower (ch); >+ g_unichar_fully_decompose (ch, FALSE, &retval, 1); You can't just use g_unichar_tolower(), because then "straße" won't match "STRASSE". And you can't use g_unichar_fully_decompose() that way either, because then "sœur" won't match "soeur". I think the only fully-correct way to do what you want with the APIs glib provides is to g_utf8_normalize(G_NORMALIZE_DEFAULT) and g_utf8_casefold() the whole string, and then compare the resulting characters. >+ /* If we want to go to next word, ignore alpha-num chars */ >+ if (next_word && g_unichar_isalnum (sc)) >+ /* Ignore word separators */ >+ if (!g_unichar_isalnum (sc)) This won't work well in languages that don't use spaces between words. I'm not sure how to fix that though.
Here's an algorithm I've been using in the desktop file work: static void dfi_text_index_add_folded (GPtrArray *array, const gchar *start, const gchar *end) { gchar *normal; normal = g_utf8_normalize (start, end - start, G_NORMALIZE_ALL_COMPOSE); /* TODO: Invent time machine. Converse with Mustafa Ataturk... */ if (strstr (normal, "ı") || strstr (normal, "İ")) { gchar *s = normal; GString *tmp; tmp = g_string_new (NULL); while (*s) { gchar *i, *I, *e; i = strstr (s, "ı"); I = strstr (s, "İ"); if (!i && !I) break; else if (i && !I) e = i; else if (I && !i) e = I; else if (i < I) e = i; else e = I; g_string_append_len (tmp, s, e - s); g_string_append_c (tmp, 'i'); s = g_utf8_next_char (e); } g_string_append (tmp, s); g_free (normal); normal = g_string_free (tmp, FALSE); } g_ptr_array_add (array, g_utf8_casefold (normal, -1)); g_free (normal); } static gchar ** dfi_text_index_split_words (const gchar *value) { const gchar *start = NULL; GPtrArray *result; const gchar *s; result = g_ptr_array_new (); for (s = value; *s; s = g_utf8_next_char (s)) { gunichar c = g_utf8_get_char (s); if (start == NULL) { if (g_unichar_isalnum (c)) start = s; } else { if (!g_unichar_isalnum (c)) { dfi_text_index_add_folded (result, start, s); start = NULL; } } } if (start) dfi_text_index_add_folded (result, start, s); g_ptr_array_add (result, NULL); return (gchar **) g_ptr_array_free (result, FALSE); } This should cover most cases. We should consider that the corpus over which we are searching is quite possibly comprised of many languages. Consider (for example) the case of searching in a list of applications while running in a non-English locale. It's quite likely that some of those applications have names that have not been translated to your locale and therefore show up in a different language (either English or another fallback as per LANGUAGE environment variable). In light of this mixed input we can't really make language-specific assumptions about how to do matching (unless we track the source language of each string, which I don't think is an ability afforded to us by GKeyFile/GAppInfo, much less gettext and even less still for user input). This is my argument for wanting to avoid special handling of accented characters. For the French, treating 'é' as 'e' is easy enough, but if you're German, you probably would then expect ü to be treated as 'ue' rather than 'u' and this is exactly the sort of thing that we cannot reasonably do because of the above; 'ü' -> 'ue' may be appropriate for German, but perhaps not for other languages. Even Turkish is causing trouble (as you can see in the attached code) but this is an extremely exceptional situation and I think it deserves to be special-cased (and again, we could only handle this properly if we knew the source language, which we probably can't know for many cases).
Created attachment 257285 [details] [review] Add g_str_is_ascii() Add a function for checking if a string is pure ASCII.
Created attachment 257286 [details] [review] Add simple string "search" function Add a pair of functions to make it easier to do simple string matching. This will be useful for use with things like GtkSearchBar and will also be the basis of the searching done by the (soon to appear) g_desktop_app_info_search()
Created attachment 257287 [details] [review] Add testcase for search utilities functions
Review of attachment 257285 [details] [review]: doesn't hurt ::: glib/gstrfuncs.c @@ +1535,3 @@ + * contains no bytes with the high bit set. + * + * Returns: %TRUE if @string is ascii Since tag missing
Review of attachment 257286 [details] [review]: ::: glib/gstrfuncs.c @@ +2943,3 @@ + * @translit_locale: (allow none): the language code (like 'de' or + * 'en_GB') from which @string originates + * @ascii_alternates: (out): a return location for ascii alternates Does this need a transfer annotation, and also allow-none ? Also, lets be consistent and capitalize ASCII throughout @@ +2948,3 @@ + * + * A token is a non-empty sequence of (Unicode) alphanumeric/"mark" + * characters separated in the source string by non-alphanumeric/"mark" This is unclear to me - are marks supposed to be part of the tokens, or separating tokens. Would probably better to spell out '/' as 'or' here, and not quote 'mark'. @@ +2973,3 @@ + if (ascii_alternates && g_str_is_ascii (string)) + { + *ascii_alternates = g_new0 (gchar *, 0 + 1); Is this useful ? Also, the documentation does not really make it clear that ascii_alternatives will _always_ return an allocated string array @@ +2989,3 @@ + + for (i = 0; i < n; i++) + if (!g_str_is_ascii (result[i])) I prefer to always have {} around the loop body, even if its just an if, but not a big deal @@ +3008,3 @@ + } + + (*ascii_alternates)[j] = NULL; It wasn't clear to me at all from the documentation that ascii_alternatives would be littered with NULLs @@ +3030,3 @@ + * A hit occurs when each folded token in @search_term is a prefix of a + * folded token from @potential_hit. + * A few examples of search_terms and potential_hits that would and would not generate hits would be very helpful here.
Review of attachment 257287 [details] [review]: sure
Attachment 257285 [details] pushed as 4c51080 - Add g_str_is_ascii() Attachment 257286 [details] pushed as 38dfce5 - Add simple string "search" function Attachment 257287 [details] pushed as 0420295 - Add testcase for search utilities functions
Bug 710142 aims to add locale-dependent transliteration.