Bug 303239 – "False positive" searching through document

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 303239 - "False positive" searching through document


Summary:	"False positive" searching through document


Status:	RESOLVED FIXED

Product:	gtksourceview
Classification:	Platform
Component:	General
Version:	unspecified
Hardware:	Other All

Importance:	High critical
Target Milestone:	---
Assigned To:	GTK Sourceview maintainers
QA Contact:	GTK Sourceview maintainers

URL:
Whiteboard:

Depends on:	348754
Blocks:

Reported:	2005-05-06 12:00 UTC by Josep Puigdemont
Modified:	2011-06-01 00:24 UTC

See Also:
GNOME target:	---
GNOME version:	2.9/2.10

Description Josep Puigdemont 2005-05-06 12:00:13 UTC

Please describe the problem:
Searching the 'º' character in a UTF-8 document gave false positives, and 'o',
'ó', etc, were given as maches for 'º'.
If I make it match case, then it only matches the desired character.


Steps to reproduce:
1. Open a document, I used:
http://cvs.sourceforge.net/viewcvs.py/*checkout*/inkscape/inkscape/po/ca.po?rev=1.87
2. Search for character 'º', without upper/lowercase match activated.



Actual results:
Characters like 'o', 'ó' are also matched.

Expected results:
I would expect 'º' to be the only match.
I wouldn't expect 'o'/'ó'... to be the lower/uppercase of 'º' either.


Does this happen every time?
yes

Other information:

Comment 1 Paolo Borelli 2005-05-23 07:54:13 UTC

Sorry for the delay... I'm moving this bug to gtk+ (since search is implemented
there).

However I suspect that the behavior is intentional and follows some
international standard.

Comment 2 Matthias Clasen 2005-05-23 16:17:53 UTC

I would think this should move to gtksourceview first, since you are using the
gtksourceview implementation of searching. Feel free to move back to gtk+ if you
have a testcase showing the same behaviour using straight
gtk_text_iter_search_forward()...

Comment 3 Luis Villa 2005-07-14 17:52:31 UTC

Mattias, you forgot to reassign to owner. Confirming that this still occurs in 2.10.

Comment 4 Paolo Maggi 2005-08-04 08:18:17 UTC

I confirm this bug using gedit.

To investigate the problem I have performed some tests using the following code:

	casefold = g_utf8_casefold ("o", -1);
	normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL);
	normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD);
	g_print ("Case fold of o : %s - Normalized ALL: %s - NFD: %s\n", casefold,
normalized_all, normalized_nfd);
	g_free (casefold);
	g_free (normalized_all);
	g_free (normalized_nfd);

	casefold = g_utf8_casefold ("O", -1);
	normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL);
	normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD);
	g_print ("Case fold of O : %s - Normalized ALL: %s - NFD: %s\n", casefold,
normalized_all, normalized_nfd);
	g_free (casefold);
	g_free (normalized_all);
	g_free (normalized_nfd);
	
	casefold = g_utf8_casefold ("º", -1);
	normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL);
	normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD);
	g_print ("Case fold of º : %s - Normalized ALL: %s - NFD: %s\n", casefold,
normalized_all, normalized_nfd);
	g_free (casefold);
	g_free (normalized_all);
	g_free (normalized_nfd);

	casefold = g_utf8_casefold ("ò", -1);
	normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL);
	normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD);
	g_print ("Case fold of ò : %s - Normalized ALL: %s - NFD: %s\n", casefold,
normalized_all, normalized_nfd);
	g_free (casefold);
	g_free (normalized_all);
	g_free (normalized_nfd);
	
	casefold = g_utf8_casefold ("ó", -1);
	normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL);
	normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD);
	g_print ("Case fold of ó : %s - Normalized ALL: %s - NFD: %s\n", casefold,
normalized_all, normalized_nfd);
	g_free (casefold);
	g_free (normalized_all);
	g_free (normalized_nfd);

And I got:

Case fold of o : o - Normalized ALL: o - NFD: o
Case fold of O : o - Normalized ALL: o - NFD: o
Case fold of º : º - Normalized ALL: o - NFD: º
Case fold of ò : ò - Normalized ALL: ò - NFD: ò
Case fold of ó : ó - Normalized ALL: ó - NFD: ó

This shows us two problems;
1. We are using G_NORMALIZE_ALL. We should use NFD as described in
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf (page 91). This is easy to
fix.

2. Searching for "o" matches "ò" since we probably make only a partial
comparison. This is also confirmed by the fact that searching for "ò" does not
match "o".

Reading the Unicode document I have seen that there is a possible optmization we
can introduce in our code:
"The invocations of normalization before folding in the above definitions are to
catch very infrequent edge cases. Normalization is not required before folding,
except for the character U+0345 n COMBINING GREEK YPOGEGRAMMENI and any
characters that have it as part of their decomposition, such as U+1FC3 o GREEK
SMALL LETTER ETA WITH YPOGEGRAMMENI.
In practice, optimized versions of implementations can catch these special cases
and, thereby, avoid an extra normalization."

Comment 5 Paolo Maggi 2005-08-04 08:45:37 UTC

Since fixing this bug (together with bug #168247) requires some serious
thinking, I'm going first to fix problem 1 (s/G_NORMALIZE_ALL/G_NORMALIZE_NFD)
and then we will work on a proper fix.

Other tests:

	casefold = g_utf8_casefold ("ͅ", -1);
	normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL);
	normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD);
	g_print ("Case fold of ͅ (U+0345 COMBINING GREEK YPOGEGRAMMENI) : %s -
Normalized ALL: %s - NFD: %s\n", casefold, normalized_all, normalized_nfd);
	g_free (casefold);
	g_free (normalized_all);
	g_free (normalized_nfd);

	casefold = g_utf8_casefold ("ῃ", -1);
	normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL);
	normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD);
	g_print ("Case fold of ῃ : %s - Normalized ALL: %s - NFD: %s\n", casefold,
normalized_all, normalized_nfd);
	g_free (casefold);
	g_free (normalized_all);
	g_free (normalized_nfd);
	
	casefold = g_utf8_casefold ("ß", -1);
	normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL);
	normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD);
	g_print ("Case fold of ß : %s - Normalized ALL: %s - NFD: %s\n", casefold,
normalized_all, normalized_nfd);
	g_free (casefold);
	g_free (normalized_all);
	g_free (normalized_nfd);

	casefold = g_utf8_casefold ("SS", -1);
	normalized_all = g_utf8_normalize (casefold, -1, G_NORMALIZE_ALL);
	normalized_nfd = g_utf8_normalize (casefold, -1, G_NORMALIZE_NFD);
	g_print ("Case fold of SS : %s - Normalized ALL: %s - NFD: %s\n", casefold,
normalized_all, normalized_nfd);
	g_free (casefold);
	g_free (normalized_all);
	g_free (normalized_nfd);

With results:

Case fold of ͅ (U+0345 COMBINING GREEK YPOGEGRAMMENI) : ι - Normalized ALL: ι -
NFD: ι
Case fold of ῃ : ηι - Normalized ALL: ηι - NFD: ηι
Case fold of ß : ss - Normalized ALL: ss - NFD: ss
Case fold of SS : ss - Normalized ALL: ss - NFD: ss

Comment 6 Paolo Maggi 2005-08-04 09:56:34 UTC

The specific problem reported by Josep has been fixed with the patch I have just
committed.
Thought the 2nd problem I reported in comment #2 is still valid.

From IRC:

<paolo> furthermore I think we use "normalize" only once while we should use it
twice
<paolo> i.e. NFD(toCasefold(NFD(X))) 
<paolo> yep, it seems we only perform
<paolo> NFD(toCasefold(X))
<paolo> so actually we already have the "optimization"

ChangeLog entry for the committed patch:

2005-08-04  Paolo Maggi  <paolo@gnome.org>

	* gtksourceview/gtksourceiter.c
	(pointer_from_offset_skipping_decomp) (g_utf8_strcasestr)
	(g_utf8_strrcasestr) (g_utf8_caselessnmatch)
	(forward_chars_with_skipping) (strbreakup): 
	s/G_NORMALIZE_ALL/G_NORMALIZE_NFD. See bug #303239 for more info.

Comment 7 Carnë Draug 2011-06-01 00:24:11 UTC

(In reply to comment #6)
> The specific problem reported by Josep has been fixed with the patch I have just
> committed.
> Thought the 2nd problem I reported in comment #2 is still valid.

(In reply to comment #4)
> 
> 2. Searching for "o" matches "ò" since we probably make only a partial
> comparison. This is also confirmed by the fact that searching for "ò" does not
> match "o".

Hi
searching for "o" no longer matches "ó" and searching for "ó" no longer matches "o" as well. If I understood well, this was the only thing left so I'm closing the bug.