GNOME Bugzilla – Bug 168247
wrong selection of characters when searching for single characters
Last modified: 2011-06-01 00:31:25 UTC
Please describe the problem:
Repeated search for single characters sometimes selects wrong characters.
Steps to reproduce:
1. copy the following text into gedit:
Amrum im November
Wer nicht gerade alleine anreist, mietet sich ja fast immer eine Ferienwohnung
oder ein ganzes Haus. Und die meisten dieser Häuser sind komfortabel und
gemütlich eingerichtet, damit man sich als Gast darin wohlfühlt. Und wann fühlt
man sich drinnen am wohlsten? Wenn es draußen regnet und stürmt. Der November
gehört also zu den Monaten, in denen sich die Gemütlichkeit der Amrumer Häuser
am besten auskosten lässt.
2. Press <CTRL>-F and enter 'ü' (without quotes) as search term.
3. Repeat search with <CTRL>-G or by pressing the 'search' button 4 times.
In the word 'stürmt' the characters 'rm' are selected.
In the word 'stürmt' the character 'ü' should be selected.
Does this happen every time?
Problem occured when searching for Umlauts 'ä,ö,ü' and german 'ß'.
Same problem when searching for standard characters like 'o'. When searching for
'o' in above text, after the 8th repeat character 'v' is selected instead of 'o'
in word 'November'.
which version of gtksourceview do you have?
I seem to recall that this is a problem in gtksourceview that has been fixed in
I have libgtksourceview 1.1.1-1 (from debian testing).
I think this was fixed in http://bugzilla.gnome.org/show_bug.cgi?id=141756
IIRC GtkSourceView 1.1.91 is currently in debian experimental, it will not
require updated dependencies, can you upgrade and confirm that the bug is fixed?
Sorry, I cannot install 1.1.91, but I checked the Changelog of gtksourceview
1.1.1 and found:
2004-07-21 Gustavo Giráldez <firstname.lastname@example.org>
* gtksourceview/gtksourceiter.c (forward_chars_with_skipping)
(pointer_from_offset_skipping_decomp): Use g_utf8_normalize()
instead of g_unicode_canonical_decomposition() to calculate
normalized character lengths.
If this entry refers to Bug 141756 from gtksourceview it should already be
resolved in my version 1.1.1 if there haven't been more changes.
Maybe someone with the most recent version of gtksourceview can try to reproduce
the bug and report here.
It seems 'ß' is the key of the problem.
The same problem can be easily reproduced searching for 'a' in a document
containing only the string "ßabc".
Note that seaching for a in the string "ßa" makes gedit/gtksourceview go in a
I think this is related to the fact the the uppercase of "ß" is "SS".
If you search for "ß" in "SSabc", "SS" is correctly selected, but if you search
for "ß" in "ßabc", "ßa" is selected. At the same way, if you search for "SS" in
"SSabc", "SS" is correctly selected, but if you search for "SS" in "ßabc", "ßa"
As abvious, all the previous comments apply only if "Match case" is not selected.
BTW, a part the problem with the endless cycle (that I just discovered), this is
a problem I was aware of (when I implemented the code, it was in the pre-2.0
period), but that I removed from my mind since it seemed no one was affecting by
it. Now it is time to try again to fix it... but I can anticipate it will be a
challenging problem ;)
Moving product to GtkSourceView and changing Priority/Severity to High/Critical
due to the endless loop.
I don't know if I will have time to look at it before the "code freeze" for
2.10, so it will be cool if someone (pbor? Gustavo? random person?) could try to
give a look at it.
with regard to the infinite loop... I found a strange thing:
if I open a new file with just "ßa" in it and search for "a" I can reproduce the
infinite loop, but if I save the file to disk it doesn't go in an infinite loop,
but the selction is wrong, it goes from after the "a" to the end of the line.
Created attachment 37932 [details] [review]
Yay! I think I nailed it. At least the infinite loop.
Where is the magic? Why does it work? (I ask because I'd like to know the
rationale of the fix)
Does it completely fix all the problems reported in this bug report?
Have you tested it hard? Have you seen if it broke again old bugs?
We really need a "search test suite" (bug #136434)
Gustavo: may you test it too? What do you think of this solution?
the rationale is the follwing: "offset" passed to
pointer_from_offset_skipping_decomp is the number of chars before the match in
the string after both casefold and normalize. The function walks the original
string one character at a time and to find out how much to substract from offset
it has to both casefold and normalize the current char. Before the patch it
didn't casefold the current char, and since casefold may change the lenght of
the string it ended up returning a pointer to the wrong character in the string.
The patch fixes all the problems in this bugreport, but it still needs more
testing to make sure we haven't regressed.
Created attachment 38001 [details] [review]
more complete patch
I think the previous patch is right on its own, but it uncovers another
problem: when searching for "ß" in "ßa" both ß and a are selected. I think it
is due to the following line in lines_match()
/* Go to end of search string */
forward_chars_with_skipping (&next, g_utf8_strlen (*lines, -1), visible_only,
Note that *lines has already been normalized and casefolded so g_utf8_strlen
(*lines, -1) may not be the right number of characters for moving the text iter
The solution is to *not* casefold and normalize the lines in strbreakup, but
just do it on the needle in string matching function themselves
(g_utf8_strcasestr and g_utf8_strrcasestr). Note also that
g_utf8_caselessnmatch was already casefolding and normalizing both strings.
darn... the latest patch seems to break multiline search in some cases for
instance when searching
"I have also noticed this with the ´ (′ in html) character. In this case,\n
it is reliably that the ´prime´ character and also the „following„ character are\n
selected by the find."
in the testcase of bug #141756
Created attachment 38013 [details] [review]
Update version of the patch by pbor
Created attachment 38033 [details] [review]
Simpler algorithm, fixes all the problems described in this bug.
Taking the idea of returning the start and end of the match from the original
haystack from the last patch by Paolo Borelli, I've come up with a much simpler
It's probably slower than the previous implementation, but it works ok. As far
as I could test it, it doesn't regress any of the previous search bugs and
fixes the issues described here.
The core of the algorithm is g_utf8_caselessmatch() which works by taking a
character at a time from both strings and comparing the normalized versions of
them. The algorithm also works ok in comparing a decomposed character against a
I hope there's still time before the freeze to get this one in.
Created attachment 38041 [details] [review]
I have not looked deeply in Gustavo's patch, but I am a bit worried of totally
rewriting the algorithm few hours before the freeze... after all we loived with
this bug for a few releases.
Maybe for 1.2.0 we should apply this minimal bandaid which just avoids the
infinite loop without fixing the real bug.
See http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf for more info about
Unicode caseless matching
a couple more datapoints.
small snippet from a irc discussion:
pbor gustavo: forgive my utf8 ignorance, but I have a cople of questions about
your patch... are you sure that normalizing character by character always yelds
the rigth results?
pbor I mean, aren't there some sequences of characters the if combined together
can give a different normalized result?
gustavo pbor: well, i'm not entirely sure, but we already did that before :-)
pbor well, afaics before we did it only once we already found the match with
strncmp, so it may have been way more difficult to hit the wrong case...
gustavo just found this: http://www.unicode.org/unicode/reports/tr15/
gustavo it's about normalization
pbor I have that doubt because I once asked owen what would be necessary to move
the caseless search to GtkTextView...
pbor and he said he would not accept a patch which normalizes the full string
(for memory consumption reasons)
pbor so he pointed me to a bug open in bugzilla where it was requested a way to
do normalization incrementally
* pbor looks for the bug
what about implementing g_utf8_caselessmatch in gustavo's patch in the following
way (pseudo code follows).
It would avoid doing char by char normalization which I suspect it'snot utf8
correct... do you think the performance would be really really bad?
s1norm = casefold and normalize (s1)
s2norm = casefold and normalize (s2)
if !strncmp(s1norm, s2norm, strlen (s2norm))
return FALSE // no match
// match!, now we must find out where the match ends in s1
p = s1 + strlen (s2norm)
tmp = casefold and normalize (p - s1)
strncmp (tmp, s2norm, strlen (s2norm))
No? try with p moved one char back and moved one char forward
Still No? Move p by two chars, 3 chars etc until we moved it by the same len of
Of course it leads to a really slow worst case, but
- the common case is pretty fast
- s2 is supposed to be fairly short, so it's not *that* bad
I have committed the pbor's patches of comment #7 and #14 as a 7emporary solution.
We will have to go with the gustavo's solution for the future.
Created attachment 38124 [details] [review]
gustavo's patch + alternate caseless match
I'm dumping here a first cut of the idea I described above: basically it's
gustavo's patch with a different g_utf8_caselessmatch function: *not* much
tested (I just tried a couple of searches tomake sure gedit didn't crash)
Created attachment 38145 [details] [review]
fix obvious problem in the above patch
What's the status of this bug?
Jeroen: See comment #18
This came up on irc and mclasen pointed out this useful link to a free
Boyer-Moore implementation for UTF-8 text: http://crl.nmsu.edu/~mleisher/ucdata.html
*** Bug 539293 has been marked as a duplicate of this bug. ***
*** Bug 563905 has been marked as a duplicate of this bug. ***
This bug seems to be fixed on the current release.
Searching for ü in the text of the original report, or for "a" in "ßabc" whether the file is saved on disk or not, selects the right character.