Bug 339805 – Find/Replace does not differentiate accented chars in with normal latin chars in it/es locales

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 339805 - Find/Replace does not differentiate accented chars in with normal latin chars in it/es locales


Summary:	Find/Replace does not differentiate accented chars in with normal latin chars...


Status:	RESOLVED FIXED

Product:	gtksourceview
Classification:	Platform
Component:	General
Version:	git master
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	GTK Sourceview maintainers
QA Contact:	GTK Sourceview maintainers

URL:
Whiteboard:

Depends on:	348754
Blocks:

Reported:	2006-04-26 12:21 UTC by Matt Keenan (IRC:MattMan)
Modified:	2007-09-04 15:02 UTC

See Also:
GNOME target:	---
GNOME version:	2.13/2.14

Attachments
gtksourceiter.c (17.92 KB, text/plain) 2007-01-10 21:40 UTC, Yevgen Muntyan		Details
patch (1.72 KB, patch) 2007-01-10 21:43 UTC, Yevgen Muntyan	none	Details \| Review
patch (3.27 KB, patch) 2007-01-11 01:16 UTC, Yevgen Muntyan	accepted-commit_now	Details \| Review

Description Matt Keenan (IRC:MattMan) 2006-04-26 12:21:28 UTC

Please describe the problem:
Find/Replace in does not differenciate between normal latin characters and
accented ones.

Steps to reproduce:
1.Open a file containing localized characters in gedit application
2.Enter the following words 
- sí (LATIN CAPITAL LETTER I WITH ACUTE) – means “yes” in English language
- si (Note: I is without an accent mark) – means “if” in English language
3.Click on Replace
4.Replace dialog window is launched
5.Type the localized string si (Note: i without an accent - “if” in English
language) and click on Find button

Actual results:
he string sí (Note: i with an accent - “yes” in English language) is higlighted
in the document which is not the actual string to be searched.Further if a
Replace with string is given,the string  sí would also to be replaced with the
Replace with string given

Expected results:
Find/Replace should only highlight/replace specific characters that are searched
for.

Does this happen every time?
yes

Other information:

Comment 1 Paolo Maggi 2007-01-10 18:17:18 UTC

<behdad> paolo: ok, how about after the strncmp, checking that the next char in normalized_s1 is not a g_unichar_iszerowidth()?
<behdad> paolo: g_unichar_iszerowidth() is new in trunk.  though, it may match to some chars you don't want to.
* cworth has quit (bye)
<paolo> hmm... what do I obtain doing so?
<behdad> paolo: "si" will not match to "si followed by an accent"
<behdad> paolo: you just want the ISZEROWIDTHTYPE check from g_unichar_iszerowidth() btw.
<paolo> oh, since normalization split sì in si' or something like that
<behdad> yeah
<paolo> yep, it could be the problem
<paolo> behdad: thanks
* fer (~fherrera@a88-115-27-99.elisa-laajakaista.fi) has joined #gtk+
<paolo> you said I need only the "if (G_UNLIKELY (ISZEROWIDTHTYPE (c))) return TRUE;" part of the function
<paolo> right?
* iago has quit (bye!)
<behdad> paolo: yeah. or return FALSE, depending on what the return value means.

Comment 2 Yevgen Muntyan 2007-01-10 18:40:17 UTC

Interesting that pcre copes with this case correctly, doesn't match "sí" when looking for "si".

Comment 3 Yevgen Muntyan 2007-01-10 18:47:18 UTC

G_NORMALIZE_ALL_COMPOSE instead of G_NORMALIZE_ALL helps here too. Not in all cases perhaps.

Comment 4 Yevgen Muntyan 2007-01-10 21:27:32 UTC

<muntyan> behdad: UCD.html says "Changed general category of Zero Width Space (U+200B) from Zs to Cf.", so Zero Width Space falls into G_UNICODE_FORMAT?
<behdad> muntyan: yes
* bandini has quit (Ex-Chat)
* mmc (~ercmarusk@83-103-88-29.ip.fastwebnet.it) has joined #gtk+
<muntyan> behdad: but don't we want to ignore it when searching for text? i.e. to treat it not like accent mark
<muntyan> (ISZEROWIDTHTYPE includes G_UNICODE_FORMAT)
<behdad> muntyan: in that case, my fault.  just check for the _MARK types.
 behdad benzea 
<muntyan> behdad: ISMARK, right?
<behdad> muntyan: yeah, exactly.

Comment 5 Yevgen Muntyan 2007-01-10 21:40:37 UTC

Created attachment 79991 [details]
gtksourceiter.c

I've cooked this.

Comment 6 Yevgen Muntyan 2007-01-10 21:43:47 UTC

Created attachment 79992 [details] [review]
patch

Sorry, this is what I wanted to post.

Comment 7 Paolo Borelli 2007-01-11 00:24:28 UTC

It's late so I am prolly missing something obvious, but whar does this part of the patch has to do with the reset?

+#define g_utf8_strcasestr	gtk_source_strcasestr
+#define g_utf8_strrcasestr	gtk_source_strrcasestr
+#define g_utf8_caselessnmatch	gtk_source_caselessnmatch

The other part makes sense to me (as much as I understood what I behdad said), the only nitpick is that we usually do not use 'inline'[1]

1) I understand that it makes sense to inline the function since it's used only in that place, but as far as I know a) gcc will figure that out b) inline is not available on all the compilers we support (sun etc)

Comment 8 Yevgen Muntyan 2007-01-11 00:39:21 UTC

(In reply to comment #7)
> It's late so I am prolly missing something obvious, but whar does this part of
> the patch has to do with the reset?
> 
> +#define g_utf8_strcasestr      gtk_source_strcasestr
> +#define g_utf8_strrcasestr     gtk_source_strrcasestr
> +#define g_utf8_caselessnmatch  gtk_source_caselessnmatch

Um, didn't clean up the patch. That's what I have here to avoid name clash with glib.

> The other part makes sense to me (as much as I understood what I behdad said),
> the only nitpick is that we usually do not use 'inline'[1]

C++-ism, can't get rid of it. Totally agree it should not be there.

Comment 9 Yevgen Muntyan 2007-01-11 01:16:34 UTC

Created attachment 80004 [details] [review]
patch

Real thing now (not sure if it's nice though, as I said it's "what I cooked here").

Comment 10 Paolo Maggi 2007-01-11 08:54:04 UTC

Yevgen: thanks for the patch.

It probably solves the specific problem reported here, so it can go it as a first step.
I don't think it is generic enough to solve for example the problem of searching "s" in a text containing "ß".

May be Behdad as another great idea on how to solve this.

Please, commit the patch in both HEAD and latest branch.

Comment 11 Paolo Borelli 2007-01-11 08:59:14 UTC

/me puts on his pain-in-the-ass hat

1 - can you add a little comment above 

+	return type != G_UNICODE_NON_SPACING_MARK &&
+		type != G_UNICODE_ENCLOSING_MARK &&
+		type != G_UNICODE_NON_SPACING_MARK;

 saying what we are doing


2 - for the namespace clashing: what about gtk_source_utf8_strcasestr etc? (that si keep utf8 in the name)

Comment 12 Yevgen Muntyan 2007-02-10 16:07:42 UTC

Committed, finally. Anyway, what's the problem with searching "s" in a text containing "ß"? And what are the other problems of search? It always worked for me in Russian, so I assumed it's wokring fine :)

Comment 13 Yevgen Muntyan 2007-09-04 15:02:50 UTC

Didn't close it back then because I couldn't close it.