GNOME Bugzilla – Bug 303239
"False positive" searching through document
Last modified: 2011-06-01 00:24:11 UTC
Please describe the problem: Searching the 'º' character in a UTF-8 document gave false positives, and 'o', 'ó', etc, were given as maches for 'º'. If I make it match case, then it only matches the desired character. Steps to reproduce: 1. Open a document, I used: http://cvs.sourceforge.net/viewcvs.py/*checkout*/inkscape/inkscape/po/ca.po?rev=1.87 2. Search for character 'º', without upper/lowercase match activated. Actual results: Characters like 'o', 'ó' are also matched. Expected results: I would expect 'º' to be the only match. I wouldn't expect 'o'/'ó'... to be the lower/uppercase of 'º' either. Does this happen every time? yes Other information:
Sorry for the delay... I'm moving this bug to gtk+ (since search is implemented there). However I suspect that the behavior is intentional and follows some international standard.
I would think this should move to gtksourceview first, since you are using the gtksourceview implementation of searching. Feel free to move back to gtk+ if you have a testcase showing the same behaviour using straight gtk_text_iter_search_forward()...
Mattias, you forgot to reassign to owner. Confirming that this still occurs in 2.10.
I confirm this bug using gedit. To investigate the problem I have performed some tests using the following code: casefold = g_utf8_casefold ("o", -1); normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL); normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD); g_print ("Case fold of o : %s - Normalized ALL: %s - NFD: %s\n", casefold, normalized_all, normalized_nfd); g_free (casefold); g_free (normalized_all); g_free (normalized_nfd); casefold = g_utf8_casefold ("O", -1); normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL); normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD); g_print ("Case fold of O : %s - Normalized ALL: %s - NFD: %s\n", casefold, normalized_all, normalized_nfd); g_free (casefold); g_free (normalized_all); g_free (normalized_nfd); casefold = g_utf8_casefold ("º", -1); normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL); normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD); g_print ("Case fold of º : %s - Normalized ALL: %s - NFD: %s\n", casefold, normalized_all, normalized_nfd); g_free (casefold); g_free (normalized_all); g_free (normalized_nfd); casefold = g_utf8_casefold ("ò", -1); normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL); normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD); g_print ("Case fold of ò : %s - Normalized ALL: %s - NFD: %s\n", casefold, normalized_all, normalized_nfd); g_free (casefold); g_free (normalized_all); g_free (normalized_nfd); casefold = g_utf8_casefold ("ó", -1); normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL); normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD); g_print ("Case fold of ó : %s - Normalized ALL: %s - NFD: %s\n", casefold, normalized_all, normalized_nfd); g_free (casefold); g_free (normalized_all); g_free (normalized_nfd); And I got: Case fold of o : o - Normalized ALL: o - NFD: o Case fold of O : o - Normalized ALL: o - NFD: o Case fold of º : º - Normalized ALL: o - NFD: º Case fold of ò : ò - Normalized ALL: ò - NFD: ò Case fold of ó : ó - Normalized ALL: ó - NFD: ó This shows us two problems; 1. We are using G_NORMALIZE_ALL. We should use NFD as described in http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf (page 91). This is easy to fix. 2. Searching for "o" matches "ò" since we probably make only a partial comparison. This is also confirmed by the fact that searching for "ò" does not match "o". Reading the Unicode document I have seen that there is a possible optmization we can introduce in our code: "The invocations of normalization before folding in the above definitions are to catch very infrequent edge cases. Normalization is not required before folding, except for the character U+0345 n COMBINING GREEK YPOGEGRAMMENI and any characters that have it as part of their decomposition, such as U+1FC3 o GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI. In practice, optimized versions of implementations can catch these special cases and, thereby, avoid an extra normalization."
Since fixing this bug (together with bug #168247) requires some serious thinking, I'm going first to fix problem 1 (s/G_NORMALIZE_ALL/G_NORMALIZE_NFD) and then we will work on a proper fix. Other tests: casefold = g_utf8_casefold ("ͅ", -1); normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL); normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD); g_print ("Case fold of ͅ (U+0345 COMBINING GREEK YPOGEGRAMMENI) : %s - Normalized ALL: %s - NFD: %s\n", casefold, normalized_all, normalized_nfd); g_free (casefold); g_free (normalized_all); g_free (normalized_nfd); casefold = g_utf8_casefold ("ῃ", -1); normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL); normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD); g_print ("Case fold of ῃ : %s - Normalized ALL: %s - NFD: %s\n", casefold, normalized_all, normalized_nfd); g_free (casefold); g_free (normalized_all); g_free (normalized_nfd); casefold = g_utf8_casefold ("ß", -1); normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL); normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD); g_print ("Case fold of ß : %s - Normalized ALL: %s - NFD: %s\n", casefold, normalized_all, normalized_nfd); g_free (casefold); g_free (normalized_all); g_free (normalized_nfd); casefold = g_utf8_casefold ("SS", -1); normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL); normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD); g_print ("Case fold of SS : %s - Normalized ALL: %s - NFD: %s\n", casefold, normalized_all, normalized_nfd); g_free (casefold); g_free (normalized_all); g_free (normalized_nfd); With results: Case fold of ͅ (U+0345 COMBINING GREEK YPOGEGRAMMENI) : ι - Normalized ALL: ι - NFD: ι Case fold of ῃ : ηι - Normalized ALL: ηι - NFD: ηι Case fold of ß : ss - Normalized ALL: ss - NFD: ss Case fold of SS : ss - Normalized ALL: ss - NFD: ss
The specific problem reported by Josep has been fixed with the patch I have just committed. Thought the 2nd problem I reported in comment #2 is still valid. From IRC: <paolo> furthermore I think we use "normalize" only once while we should use it twice <paolo> i.e. NFD(toCasefold(NFD(X))) <paolo> yep, it seems we only perform <paolo> NFD(toCasefold(X)) <paolo> so actually we already have the "optimization" ChangeLog entry for the committed patch: 2005-08-04 Paolo Maggi <paolo@gnome.org> * gtksourceview/gtksourceiter.c (pointer_from_offset_skipping_decomp) (g_utf8_strcasestr) (g_utf8_strrcasestr) (g_utf8_caselessnmatch) (forward_chars_with_skipping) (strbreakup): s/G_NORMALIZE_ALL/G_NORMALIZE_NFD. See bug #303239 for more info.
(In reply to comment #6) > The specific problem reported by Josep has been fixed with the patch I have just > committed. > Thought the 2nd problem I reported in comment #2 is still valid. (In reply to comment #4) > > 2. Searching for "o" matches "ò" since we probably make only a partial > comparison. This is also confirmed by the fact that searching for "ò" does not > match "o". Hi searching for "o" no longer matches "ó" and searching for "ó" no longer matches "o" as well. If I understood well, this was the only thing left so I'm closing the bug.