GNOME Bugzilla – Bug 339805
Find/Replace does not differentiate accented chars in with normal latin chars in it/es locales
Last modified: 2007-09-04 15:02:50 UTC
Please describe the problem: Find/Replace in does not differenciate between normal latin characters and accented ones. Steps to reproduce: 1.Open a file containing localized characters in gedit application 2.Enter the following words - sí (LATIN CAPITAL LETTER I WITH ACUTE) – means “yes” in English language - si (Note: I is without an accent mark) – means “if” in English language 3.Click on Replace 4.Replace dialog window is launched 5.Type the localized string si (Note: i without an accent - “if” in English language) and click on Find button Actual results: he string sí (Note: i with an accent - “yes” in English language) is higlighted in the document which is not the actual string to be searched.Further if a Replace with string is given,the string sí would also to be replaced with the Replace with string given Expected results: Find/Replace should only highlight/replace specific characters that are searched for. Does this happen every time? yes Other information:
<behdad> paolo: ok, how about after the strncmp, checking that the next char in normalized_s1 is not a g_unichar_iszerowidth()? <behdad> paolo: g_unichar_iszerowidth() is new in trunk. though, it may match to some chars you don't want to. * cworth has quit (bye) <paolo> hmm... what do I obtain doing so? <behdad> paolo: "si" will not match to "si followed by an accent" <behdad> paolo: you just want the ISZEROWIDTHTYPE check from g_unichar_iszerowidth() btw. <paolo> oh, since normalization split sì in si' or something like that <behdad> yeah <paolo> yep, it could be the problem <paolo> behdad: thanks * fer (~fherrera@a88-115-27-99.elisa-laajakaista.fi) has joined #gtk+ <paolo> you said I need only the "if (G_UNLIKELY (ISZEROWIDTHTYPE (c))) return TRUE;" part of the function <paolo> right? * iago has quit (bye!) <behdad> paolo: yeah. or return FALSE, depending on what the return value means.
Interesting that pcre copes with this case correctly, doesn't match "sí" when looking for "si".
G_NORMALIZE_ALL_COMPOSE instead of G_NORMALIZE_ALL helps here too. Not in all cases perhaps.
<muntyan> behdad: UCD.html says "Changed general category of Zero Width Space (U+200B) from Zs to Cf.", so Zero Width Space falls into G_UNICODE_FORMAT? <behdad> muntyan: yes * bandini has quit (Ex-Chat) * mmc (~ercmarusk@83-103-88-29.ip.fastwebnet.it) has joined #gtk+ <muntyan> behdad: but don't we want to ignore it when searching for text? i.e. to treat it not like accent mark <muntyan> (ISZEROWIDTHTYPE includes G_UNICODE_FORMAT) <behdad> muntyan: in that case, my fault. just check for the _MARK types. behdad benzea <muntyan> behdad: ISMARK, right? <behdad> muntyan: yeah, exactly.
Created attachment 79991 [details] gtksourceiter.c I've cooked this.
Created attachment 79992 [details] [review] patch Sorry, this is what I wanted to post.
It's late so I am prolly missing something obvious, but whar does this part of the patch has to do with the reset? +#define g_utf8_strcasestr gtk_source_strcasestr +#define g_utf8_strrcasestr gtk_source_strrcasestr +#define g_utf8_caselessnmatch gtk_source_caselessnmatch The other part makes sense to me (as much as I understood what I behdad said), the only nitpick is that we usually do not use 'inline'[1] 1) I understand that it makes sense to inline the function since it's used only in that place, but as far as I know a) gcc will figure that out b) inline is not available on all the compilers we support (sun etc)
(In reply to comment #7) > It's late so I am prolly missing something obvious, but whar does this part of > the patch has to do with the reset? > > +#define g_utf8_strcasestr gtk_source_strcasestr > +#define g_utf8_strrcasestr gtk_source_strrcasestr > +#define g_utf8_caselessnmatch gtk_source_caselessnmatch Um, didn't clean up the patch. That's what I have here to avoid name clash with glib. > The other part makes sense to me (as much as I understood what I behdad said), > the only nitpick is that we usually do not use 'inline'[1] C++-ism, can't get rid of it. Totally agree it should not be there.
Created attachment 80004 [details] [review] patch Real thing now (not sure if it's nice though, as I said it's "what I cooked here").
Yevgen: thanks for the patch. It probably solves the specific problem reported here, so it can go it as a first step. I don't think it is generic enough to solve for example the problem of searching "s" in a text containing "ß". May be Behdad as another great idea on how to solve this. Please, commit the patch in both HEAD and latest branch.
/me puts on his pain-in-the-ass hat 1 - can you add a little comment above + return type != G_UNICODE_NON_SPACING_MARK && + type != G_UNICODE_ENCLOSING_MARK && + type != G_UNICODE_NON_SPACING_MARK; saying what we are doing 2 - for the namespace clashing: what about gtk_source_utf8_strcasestr etc? (that si keep utf8 in the name)
Committed, finally. Anyway, what's the problem with searching "s" in a text containing "ß"? And what are the other problems of search? It always worked for me in Russian, so I assumed it's wokring fine :)
Didn't close it back then because I couldn't close it.