After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 303239 - "False positive" searching through document
"False positive" searching through document
Status: RESOLVED FIXED
Product: gtksourceview
Classification: Platform
Component: General
unspecified
Other All
: High critical
: ---
Assigned To: GTK Sourceview maintainers
GTK Sourceview maintainers
Depends on: 348754
Blocks:
 
 
Reported: 2005-05-06 12:00 UTC by Josep Puigdemont
Modified: 2011-06-01 00:24 UTC
See Also:
GNOME target: ---
GNOME version: 2.9/2.10



Description Josep Puigdemont 2005-05-06 12:00:13 UTC
Please describe the problem:
Searching the 'º' character in a UTF-8 document gave false positives, and 'o',
'ó', etc, were given as maches for 'º'.
If I make it match case, then it only matches the desired character.


Steps to reproduce:
1. Open a document, I used:
http://cvs.sourceforge.net/viewcvs.py/*checkout*/inkscape/inkscape/po/ca.po?rev=1.87
2. Search for character 'º', without upper/lowercase match activated.



Actual results:
Characters like 'o', 'ó' are also matched.

Expected results:
I would expect 'º' to be the only match.
I wouldn't expect 'o'/'ó'... to be the lower/uppercase of 'º' either.


Does this happen every time?
yes

Other information:
Comment 1 Paolo Borelli 2005-05-23 07:54:13 UTC
Sorry for the delay... I'm moving this bug to gtk+ (since search is implemented
there).

However I suspect that the behavior is intentional and follows some
international standard.
Comment 2 Matthias Clasen 2005-05-23 16:17:53 UTC
I would think this should move to gtksourceview first, since you are using the
gtksourceview implementation of searching. Feel free to move back to gtk+ if you
have a testcase showing the same behaviour using straight
gtk_text_iter_search_forward()...
Comment 3 Luis Villa 2005-07-14 17:52:31 UTC
Mattias, you forgot to reassign to owner. Confirming that this still occurs in 2.10.
Comment 4 Paolo Maggi 2005-08-04 08:18:17 UTC
I confirm this bug using gedit.

To investigate the problem I have performed some tests using the following code:

	casefold = g_utf8_casefold ("o", -1);
	normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL);
	normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD);
	g_print ("Case fold of o : %s - Normalized ALL: %s - NFD: %s\n", casefold,
normalized_all, normalized_nfd);
	g_free (casefold);
	g_free (normalized_all);
	g_free (normalized_nfd);

	casefold = g_utf8_casefold ("O", -1);
	normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL);
	normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD);
	g_print ("Case fold of O : %s - Normalized ALL: %s - NFD: %s\n", casefold,
normalized_all, normalized_nfd);
	g_free (casefold);
	g_free (normalized_all);
	g_free (normalized_nfd);
	
	casefold = g_utf8_casefold ("º", -1);
	normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL);
	normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD);
	g_print ("Case fold of º : %s - Normalized ALL: %s - NFD: %s\n", casefold,
normalized_all, normalized_nfd);
	g_free (casefold);
	g_free (normalized_all);
	g_free (normalized_nfd);

	casefold = g_utf8_casefold ("ò", -1);
	normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL);
	normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD);
	g_print ("Case fold of ò : %s - Normalized ALL: %s - NFD: %s\n", casefold,
normalized_all, normalized_nfd);
	g_free (casefold);
	g_free (normalized_all);
	g_free (normalized_nfd);
	
	casefold = g_utf8_casefold ("ó", -1);
	normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL);
	normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD);
	g_print ("Case fold of ó : %s - Normalized ALL: %s - NFD: %s\n", casefold,
normalized_all, normalized_nfd);
	g_free (casefold);
	g_free (normalized_all);
	g_free (normalized_nfd);

And I got:

Case fold of o : o - Normalized ALL: o - NFD: o
Case fold of O : o - Normalized ALL: o - NFD: o
Case fold of º : º - Normalized ALL: o - NFD: º
Case fold of ò : ò - Normalized ALL: ò - NFD: ò
Case fold of ó : ó - Normalized ALL: ó - NFD: ó

This shows us two problems;
1. We are using G_NORMALIZE_ALL. We should use NFD as described in
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf (page 91). This is easy to
fix.

2. Searching for "o" matches "ò" since we probably make only a partial
comparison. This is also confirmed by the fact that searching for "ò" does not
match "o".

Reading the Unicode document I have seen that there is a possible optmization we
can introduce in our code:
"The invocations of normalization before folding in the above definitions are to
catch very infrequent edge cases. Normalization is not required before folding,
except for the character U+0345 n COMBINING GREEK YPOGEGRAMMENI and any
characters that have it as part of their decomposition, such as U+1FC3 o GREEK
SMALL LETTER ETA WITH YPOGEGRAMMENI.
In practice, optimized versions of implementations can catch these special cases
and, thereby, avoid an extra normalization."

Comment 5 Paolo Maggi 2005-08-04 08:45:37 UTC
Since fixing this bug (together with bug #168247) requires some serious
thinking, I'm going first to fix problem 1 (s/G_NORMALIZE_ALL/G_NORMALIZE_NFD)
and then we will work on a proper fix.

Other tests:

	casefold = g_utf8_casefold ("ͅ", -1);
	normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL);
	normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD);
	g_print ("Case fold of ͅ (U+0345 COMBINING GREEK YPOGEGRAMMENI) : %s -
Normalized ALL: %s - NFD: %s\n", casefold, normalized_all, normalized_nfd);
	g_free (casefold);
	g_free (normalized_all);
	g_free (normalized_nfd);

	casefold = g_utf8_casefold ("ῃ", -1);
	normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL);
	normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD);
	g_print ("Case fold of ῃ : %s - Normalized ALL: %s - NFD: %s\n", casefold,
normalized_all, normalized_nfd);
	g_free (casefold);
	g_free (normalized_all);
	g_free (normalized_nfd);
	
	casefold = g_utf8_casefold ("ß", -1);
	normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL);
	normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD);
	g_print ("Case fold of ß : %s - Normalized ALL: %s - NFD: %s\n", casefold,
normalized_all, normalized_nfd);
	g_free (casefold);
	g_free (normalized_all);
	g_free (normalized_nfd);

	casefold = g_utf8_casefold ("SS", -1);
	normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL);
	normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD);
	g_print ("Case fold of SS : %s - Normalized ALL: %s - NFD: %s\n", casefold,
normalized_all, normalized_nfd);
	g_free (casefold);
	g_free (normalized_all);
	g_free (normalized_nfd);

With results:

Case fold of ͅ (U+0345 COMBINING GREEK YPOGEGRAMMENI) : ι - Normalized ALL: ι -
NFD: ι
Case fold of ῃ : ηι - Normalized ALL: ηι - NFD: ηι
Case fold of ß : ss - Normalized ALL: ss - NFD: ss
Case fold of SS : ss - Normalized ALL: ss - NFD: ss

Comment 6 Paolo Maggi 2005-08-04 09:56:34 UTC
The specific problem reported by Josep has been fixed with the patch I have just
committed.
Thought the 2nd problem I reported in comment #2 is still valid.

From IRC:

<paolo> furthermore I think we use "normalize" only once while we should use it
twice
<paolo> i.e. NFD(toCasefold(NFD(X))) 
<paolo> yep, it seems we only perform
<paolo> NFD(toCasefold(X))
<paolo> so actually we already have the "optimization"

ChangeLog entry for the committed patch:

2005-08-04  Paolo Maggi  <paolo@gnome.org>

	* gtksourceview/gtksourceiter.c
	(pointer_from_offset_skipping_decomp) (g_utf8_strcasestr)
	(g_utf8_strrcasestr) (g_utf8_caselessnmatch)
	(forward_chars_with_skipping) (strbreakup): 
	s/G_NORMALIZE_ALL/G_NORMALIZE_NFD. See bug #303239 for more info.

Comment 7 Carnë Draug 2011-06-01 00:24:11 UTC
(In reply to comment #6)
> The specific problem reported by Josep has been fixed with the patch I have just
> committed.
> Thought the 2nd problem I reported in comment #2 is still valid.

(In reply to comment #4)
> 
> 2. Searching for "o" matches "ò" since we probably make only a partial
> comparison. This is also confirmed by the fact that searching for "ò" does not
> match "o".

Hi
searching for "o" no longer matches "ó" and searching for "ó" no longer matches "o" as well. If I understood well, this was the only thing left so I'm closing the bug.