GNOME Bugzilla – Bug 168247
wrong selection of characters when searching for single characters
Last modified: 2011-06-01 00:31:25 UTC
Please describe the problem: Repeated search for single characters sometimes selects wrong characters. Steps to reproduce: 1. copy the following text into gedit: Amrum im November Wer nicht gerade alleine anreist, mietet sich ja fast immer eine Ferienwohnung oder ein ganzes Haus. Und die meisten dieser Häuser sind komfortabel und gemütlich eingerichtet, damit man sich als Gast darin wohlfühlt. Und wann fühlt man sich drinnen am wohlsten? Wenn es draußen regnet und stürmt. Der November gehört also zu den Monaten, in denen sich die Gemütlichkeit der Amrumer Häuser am besten auskosten lässt. 2. Press <CTRL>-F and enter 'ü' (without quotes) as search term. 3. Repeat search with <CTRL>-G or by pressing the 'search' button 4 times. Actual results: In the word 'stürmt' the characters 'rm' are selected. Expected results: In the word 'stürmt' the character 'ü' should be selected. Does this happen every time? yes Other information: Problem occured when searching for Umlauts 'ä,ö,ü' and german 'ß'. Same problem when searching for standard characters like 'o'. When searching for 'o' in above text, after the 8th repeat character 'v' is selected instead of 'o' in word 'November'.
which version of gtksourceview do you have? I seem to recall that this is a problem in gtksourceview that has been fixed in newer versions...
I have libgtksourceview 1.1.1-1 (from debian testing).
I think this was fixed in http://bugzilla.gnome.org/show_bug.cgi?id=141756 IIRC GtkSourceView 1.1.91 is currently in debian experimental, it will not require updated dependencies, can you upgrade and confirm that the bug is fixed?
Sorry, I cannot install 1.1.91, but I checked the Changelog of gtksourceview 1.1.1 and found: 2004-07-21 Gustavo Giráldez <gustavo.giraldez@gmx.net> * gtksourceview/gtksourceiter.c (forward_chars_with_skipping) (pointer_from_offset_skipping_decomp): Use g_utf8_normalize() instead of g_unicode_canonical_decomposition() to calculate normalized character lengths. If this entry refers to Bug 141756 from gtksourceview it should already be resolved in my version 1.1.1 if there haven't been more changes. Maybe someone with the most recent version of gtksourceview can try to reproduce the bug and report here.
It seems 'ß' is the key of the problem. The same problem can be easily reproduced searching for 'a' in a document containing only the string "ßabc". Note that seaching for a in the string "ßa" makes gedit/gtksourceview go in a endless loop. I think this is related to the fact the the uppercase of "ß" is "SS". If you search for "ß" in "SSabc", "SS" is correctly selected, but if you search for "ß" in "ßabc", "ßa" is selected. At the same way, if you search for "SS" in "SSabc", "SS" is correctly selected, but if you search for "SS" in "ßabc", "ßa" is selected. As abvious, all the previous comments apply only if "Match case" is not selected. BTW, a part the problem with the endless cycle (that I just discovered), this is a problem I was aware of (when I implemented the code, it was in the pre-2.0 period), but that I removed from my mind since it seemed no one was affecting by it. Now it is time to try again to fix it... but I can anticipate it will be a challenging problem ;) Moving product to GtkSourceView and changing Priority/Severity to High/Critical due to the endless loop. I don't know if I will have time to look at it before the "code freeze" for 2.10, so it will be cool if someone (pbor? Gustavo? random person?) could try to give a look at it.
with regard to the infinite loop... I found a strange thing: if I open a new file with just "ßa" in it and search for "a" I can reproduce the infinite loop, but if I save the file to disk it doesn't go in an infinite loop, but the selction is wrong, it goes from after the "a" to the end of the line.
Created attachment 37932 [details] [review] patch Yay! I think I nailed it. At least the infinite loop.
Where is the magic? Why does it work? (I ask because I'd like to know the rationale of the fix) Does it completely fix all the problems reported in this bug report? Have you tested it hard? Have you seen if it broke again old bugs? We really need a "search test suite" (bug #136434) Gustavo: may you test it too? What do you think of this solution?
the rationale is the follwing: "offset" passed to pointer_from_offset_skipping_decomp is the number of chars before the match in the string after both casefold and normalize. The function walks the original string one character at a time and to find out how much to substract from offset it has to both casefold and normalize the current char. Before the patch it didn't casefold the current char, and since casefold may change the lenght of the string it ended up returning a pointer to the wrong character in the string. The patch fixes all the problems in this bugreport, but it still needs more testing to make sure we haven't regressed.
Created attachment 38001 [details] [review] more complete patch I think the previous patch is right on its own, but it uncovers another problem: when searching for "ß" in "ßa" both ß and a are selected. I think it is due to the following line in lines_match() /* Go to end of search string */ forward_chars_with_skipping (&next, g_utf8_strlen (*lines, -1), visible_only, !slice, TRUE); Note that *lines has already been normalized and casefolded so g_utf8_strlen (*lines, -1) may not be the right number of characters for moving the text iter . The solution is to *not* casefold and normalize the lines in strbreakup, but just do it on the needle in string matching function themselves (g_utf8_strcasestr and g_utf8_strrcasestr). Note also that g_utf8_caselessnmatch was already casefolding and normalizing both strings.
darn... the latest patch seems to break multiline search in some cases for instance when searching "I have also noticed this with the ´ (′ in html) character. In this case,\n it is reliably that the ´prime´ character and also the „following„ character are\n selected by the find." in the testcase of bug #141756
Created attachment 38013 [details] [review] Update version of the patch by pbor
Created attachment 38033 [details] [review] Simpler algorithm, fixes all the problems described in this bug. Taking the idea of returning the start and end of the match from the original haystack from the last patch by Paolo Borelli, I've come up with a much simpler approach. It's probably slower than the previous implementation, but it works ok. As far as I could test it, it doesn't regress any of the previous search bugs and fixes the issues described here. The core of the algorithm is g_utf8_caselessmatch() which works by taking a character at a time from both strings and comparing the normalized versions of them. The algorithm also works ok in comparing a decomposed character against a non-decomposed character. I hope there's still time before the freeze to get this one in.
Created attachment 38041 [details] [review] minimal bandaid I have not looked deeply in Gustavo's patch, but I am a bit worried of totally rewriting the algorithm few hours before the freeze... after all we loived with this bug for a few releases. Maybe for 1.2.0 we should apply this minimal bandaid which just avoids the infinite loop without fixing the real bug.
See http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf for more info about Unicode caseless matching
a couple more datapoints. small snippet from a irc discussion: pbor gustavo: forgive my utf8 ignorance, but I have a cople of questions about your patch... are you sure that normalizing character by character always yelds the rigth results? pbor I mean, aren't there some sequences of characters the if combined together can give a different normalized result? gustavo pbor: well, i'm not entirely sure, but we already did that before :-) pbor well, afaics before we did it only once we already found the match with strncmp, so it may have been way more difficult to hit the wrong case... ... gustavo just found this: http://www.unicode.org/unicode/reports/tr15/ gustavo it's about normalization ... pbor I have that doubt because I once asked owen what would be necessary to move the caseless search to GtkTextView... pbor and he said he would not accept a patch which normalizes the full string (for memory consumption reasons) pbor so he pointed me to a bug open in bugzilla where it was requested a way to do normalization incrementally * pbor looks for the bug pbor http://bugzilla.gnome.org/show_bug.cgi?id=123406
what about implementing g_utf8_caselessmatch in gustavo's patch in the following way (pseudo code follows). It would avoid doing char by char normalization which I suspect it'snot utf8 correct... do you think the performance would be really really bad? pseudocode: s1norm = casefold and normalize (s1) s2norm = casefold and normalize (s2) if !strncmp(s1norm, s2norm, strlen (s2norm)) return FALSE // no match // match!, now we must find out where the match ends in s1 p = s1 + strlen (s2norm) tmp = casefold and normalize (p - s1) strncmp (tmp, s2norm, strlen (s2norm)) ok? return No? try with p moved one char back and moved one char forward Still No? Move p by two chars, 3 chars etc until we moved it by the same len of s2norm Of course it leads to a really slow worst case, but - the common case is pretty fast - s2 is supposed to be fairly short, so it's not *that* bad
I have committed the pbor's patches of comment #7 and #14 as a 7emporary solution. We will have to go with the gustavo's solution for the future.
Created attachment 38124 [details] [review] gustavo's patch + alternate caseless match I'm dumping here a first cut of the idea I described above: basically it's gustavo's patch with a different g_utf8_caselessmatch function: *not* much tested (I just tried a couple of searches tomake sure gedit didn't crash)
Created attachment 38145 [details] [review] fix obvious problem in the above patch
What's the status of this bug?
Jeroen: See comment #18
This came up on irc and mclasen pointed out this useful link to a free Boyer-Moore implementation for UTF-8 text: http://crl.nmsu.edu/~mleisher/ucdata.html
*** Bug 539293 has been marked as a duplicate of this bug. ***
*** Bug 563905 has been marked as a duplicate of this bug. ***
This bug seems to be fixed on the current release. Searching for ü in the text of the original report, or for "a" in "ßabc" whether the file is saved on disk or not, selects the right character. Closing.