Bug 168247 – wrong selection of characters when searching for single characters

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 168247 - wrong selection of characters when searching for single characters


Summary:	wrong selection of characters when searching for single characters


Status:	RESOLVED FIXED

Product:	gtksourceview
Classification:	Platform
Component:	General
Version:	git master
Hardware:	Other All

Importance:	High critical
Target Milestone:	---
Assigned To:	GTK Sourceview maintainers
QA Contact:	GTK Sourceview maintainers

URL:
Whiteboard:

Duplicates:	539293 563905 (view as bug list)
Depends on:
Blocks:	348754

Reported:	2005-02-23 11:49 UTC by Tobias Eitel
Modified:	2011-06-01 00:31 UTC

See Also:
GNOME target:	---
GNOME version:	2.7/2.8

Attachments
patch (883 bytes, patch) 2005-02-25 10:34 UTC, Paolo Borelli	none	Details \| Review
more complete patch (4.18 KB, patch) 2005-02-27 15:01 UTC, Paolo Borelli	none	Details \| Review
Update version of the patch by pbor (4.93 KB, patch) 2005-02-27 18:11 UTC, Paolo Maggi	none	Details \| Review
Simpler algorithm, fixes all the problems described in this bug. (15.59 KB, patch) 2005-02-28 00:30 UTC, Gustavo Giráldez	none	Details \| Review
minimal bandaid (733 bytes, patch) 2005-02-28 09:19 UTC, Paolo Borelli	none	Details \| Review
gustavo's patch + alternate caseless match (13.43 KB, patch) 2005-03-01 21:29 UTC, Paolo Borelli	none	Details \| Review
fix obvious problem in the above patch (13.54 KB, patch) 2005-03-02 10:53 UTC, Paolo Borelli	none	Details \| Review

Description Tobias Eitel 2005-02-23 11:49:56 UTC

Please describe the problem:
Repeated search for single characters sometimes selects wrong characters.

Steps to reproduce:
1. copy the following text into gedit:

Amrum im November

Wer nicht gerade alleine anreist, mietet sich ja fast immer eine Ferienwohnung
oder ein ganzes Haus. Und die meisten dieser Häuser sind komfortabel und
gemütlich eingerichtet, damit man sich als Gast darin wohlfühlt. Und wann fühlt
man sich drinnen am wohlsten? Wenn es draußen regnet und stürmt. Der November
gehört also zu den Monaten, in denen sich die Gemütlichkeit der Amrumer Häuser
am besten auskosten lässt. 


2. Press <CTRL>-F and enter 'ü' (without quotes) as search term.
3. Repeat search with <CTRL>-G or by pressing the 'search' button 4 times. 


Actual results:
In the word 'stürmt' the characters 'rm' are selected.

Expected results:
In the word 'stürmt' the character 'ü' should be selected.

Does this happen every time?
yes

Other information:
Problem occured when searching for Umlauts 'ä,ö,ü' and german 'ß'.  
Same problem when searching for standard characters like 'o'. When searching for
'o' in above text, after the 8th repeat character 'v' is selected instead of 'o'
in word 'November'.

Comment 1 Paolo Borelli 2005-02-23 12:11:16 UTC

which version of gtksourceview do you have?

I seem to recall that this is a problem in gtksourceview that has been fixed in
newer versions...

Comment 2 Tobias Eitel 2005-02-23 12:44:15 UTC

I have libgtksourceview 1.1.1-1 (from debian testing).

Comment 3 Paolo Borelli 2005-02-23 13:11:39 UTC

I think this was fixed in http://bugzilla.gnome.org/show_bug.cgi?id=141756

IIRC GtkSourceView 1.1.91 is currently in debian experimental, it will not
require updated dependencies, can you upgrade and confirm that the bug is fixed?

Comment 4 Tobias Eitel 2005-02-23 18:54:47 UTC

Sorry, I cannot install 1.1.91, but I checked the Changelog of gtksourceview
1.1.1 and found:
2004-07-21  Gustavo Giráldez  <gustavo.giraldez@gmx.net>
* gtksourceview/gtksourceiter.c (forward_chars_with_skipping) 
(pointer_from_offset_skipping_decomp): Use g_utf8_normalize()
instead of g_unicode_canonical_decomposition() to calculate
normalized character lengths.

If this entry refers to Bug 141756 from gtksourceview it should already be
resolved in my version 1.1.1 if there haven't been more changes. 
Maybe someone with the most recent version of gtksourceview can try to reproduce
the bug and report here.

Comment 5 Paolo Maggi 2005-02-23 23:10:53 UTC

It seems 'ß' is the key of the problem.

The same problem can be easily reproduced searching for 'a' in a document
containing only the string "ßabc".
Note that seaching for a in the string "ßa" makes gedit/gtksourceview go in a
endless loop.
I think this is related to the fact the the uppercase of "ß" is "SS".

If you search for "ß" in "SSabc", "SS" is correctly selected, but if you search
for "ß" in "ßabc", "ßa" is selected. At the same way, if you search for "SS" in
"SSabc", "SS" is correctly selected, but if you search for "SS" in "ßabc", "ßa"
is selected.

As abvious, all the previous comments apply only if "Match case" is not selected.

BTW, a part the problem with the endless cycle (that I just discovered), this is
a problem I was aware of (when I implemented the code, it was in the pre-2.0 
period), but that I removed from my mind since it seemed no one was affecting by
it. Now it is time to try again to fix it... but I can anticipate it will be  a
challenging problem ;)

Moving product to GtkSourceView and changing Priority/Severity to High/Critical
due to the endless loop. 
I don't know if I will have time to look at it before the "code freeze" for
2.10, so it will be cool if someone (pbor? Gustavo? random person?) could try to
give a look at it.

Comment 6 Paolo Borelli 2005-02-24 17:16:52 UTC

with regard to the infinite loop... I found a strange thing:

if I open a new file with just "ßa" in it and search for "a" I can reproduce the
infinite loop, but if I save the file to disk it doesn't go in an infinite loop,
but the selction is wrong, it goes from after the "a" to the end of the line.

Comment 7 Paolo Borelli 2005-02-25 10:34:02 UTC

Created attachment 37932 [details] [review]
patch

Yay! I think I nailed it. At least the infinite loop.

Comment 8 Paolo Maggi 2005-02-25 12:05:13 UTC

Where is the magic? Why does it work? (I ask because I'd like to know the
rationale of the fix) 
Does it completely fix all the problems reported in this bug report?
Have you tested it hard? Have you seen if it broke again old bugs? 
We really need a "search test suite" (bug #136434)

Gustavo: may you test it too? What do you think of this solution?

Comment 9 Paolo Borelli 2005-02-25 12:36:39 UTC

the rationale is the follwing: "offset" passed to
pointer_from_offset_skipping_decomp is the number of chars before the match in
the string after both casefold and normalize. The function walks the original
string one character at a time and to find out how much to substract from offset
it has to both casefold and normalize the current char. Before the patch it
didn't casefold the current char, and since casefold may change the lenght of
the string it ended up returning a pointer to the wrong character in the string.

The patch fixes all the problems in this bugreport, but it still needs more
testing to make sure we haven't regressed.

Comment 10 Paolo Borelli 2005-02-27 15:01:39 UTC

Created attachment 38001 [details] [review]
more complete patch

I think the previous patch is right on its own, but it uncovers another
problem: when searching for "ß" in "ßa" both ß and a are selected. I think it
is due to the following line in lines_match()

/* Go to end of search string */
forward_chars_with_skipping (&next, g_utf8_strlen (*lines, -1), visible_only,
!slice, TRUE);

Note that *lines has already been normalized and casefolded so g_utf8_strlen
(*lines, -1) may not be the right number of characters for moving the text iter
.

The solution is to *not* casefold and normalize the lines in strbreakup, but
just do it on the needle in string matching function themselves
(g_utf8_strcasestr and g_utf8_strrcasestr). Note also that
g_utf8_caselessnmatch was already casefolding and normalizing both strings.

Comment 11 Paolo Borelli 2005-02-27 16:10:10 UTC

darn... the latest patch seems to break multiline search in some cases for
instance when searching

"I have also noticed this with the ´ (&prime; in html) character.  In this case,\n
it is reliably that the ´prime´ character and also the „following„ character are\n
selected by the find."

in the testcase of bug #141756

Comment 12 Paolo Maggi 2005-02-27 18:11:01 UTC

Created attachment 38013 [details] [review]
Update version of the patch by pbor

Comment 13 Gustavo Giráldez 2005-02-28 00:30:05 UTC

Created attachment 38033 [details] [review]
Simpler algorithm, fixes all the problems described in this bug.

Taking the idea of returning the start and end of the match from the original
haystack from the last patch by Paolo Borelli, I've come up with a much simpler
approach.

It's probably slower than the previous implementation, but it works ok. As far
as I could test it, it doesn't regress any of the previous search bugs and
fixes the issues described here.

The core of the algorithm is g_utf8_caselessmatch() which works by taking a
character at a time from both strings and comparing the normalized versions of
them. The algorithm also works ok in comparing a decomposed character against a
non-decomposed character.

I hope there's still time before the freeze to get this one in.

Comment 14 Paolo Borelli 2005-02-28 09:19:00 UTC

Created attachment 38041 [details] [review]
minimal bandaid

I have not looked deeply in Gustavo's patch, but I am a bit worried of totally
rewriting the algorithm few hours before the freeze... after all we loived with
this bug for a few releases.

Maybe for 1.2.0 we should apply this minimal bandaid which just avoids the
infinite loop without fixing the real bug.

Comment 15 Paolo Maggi 2005-02-28 22:48:36 UTC

See http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf for more info about
Unicode caseless matching

Comment 16 Paolo Borelli 2005-02-28 22:56:25 UTC

a couple more datapoints.

small snippet from a irc discussion:

pbor gustavo: forgive my utf8 ignorance, but I have a cople of questions about
your patch... are you sure that normalizing character by character always yelds
the rigth results?
pbor I mean, aren't there some sequences of characters the if combined together
can give a different normalized result?
gustavo pbor: well, i'm not entirely sure, but we already did that before :-)
pbor well, afaics before we did it only once we already found the match with
strncmp, so it may have been way more difficult to hit the wrong case...

...

gustavo just found this: http://www.unicode.org/unicode/reports/tr15/
gustavo it's about normalization

...

pbor I have that doubt because I once asked owen what would be necessary to move
the caseless search to GtkTextView...
pbor and he said he would not accept a patch which normalizes the full string
(for memory consumption reasons)
pbor so he pointed me to a bug open in bugzilla where it was requested a way to
do normalization incrementally
* pbor looks for the bug
pbor http://bugzilla.gnome.org/show_bug.cgi?id=123406

Comment 17 Paolo Borelli 2005-03-01 10:26:35 UTC

what about implementing g_utf8_caselessmatch in gustavo's patch in the following
way (pseudo code follows).

It would avoid doing char by char normalization which I suspect it'snot utf8
correct... do you think the performance would be really really bad?


pseudocode:

s1norm = casefold and normalize (s1)
s2norm = casefold and normalize (s2)

if !strncmp(s1norm, s2norm, strlen (s2norm))
        return FALSE // no match

// match!, now we must find out where the match ends in s1

p = s1 + strlen (s2norm)

tmp = casefold and normalize (p - s1)
strncmp (tmp, s2norm, strlen (s2norm))

ok? return

No? try with p moved one char back and moved one char forward

Still No? Move p by two chars, 3 chars etc until we moved it by the same len of
s2norm

Of course it leads to a really slow worst case, but
 - the common case is pretty fast
 - s2 is supposed to be fairly short, so it's not *that* bad

Comment 18 Paolo Maggi 2005-03-01 10:43:08 UTC

I have committed the pbor's patches of comment #7 and #14 as a 7emporary solution.
We will have to go with the gustavo's solution for the future.

Comment 19 Paolo Borelli 2005-03-01 21:29:24 UTC

Created attachment 38124 [details] [review]
gustavo's patch + alternate caseless match

I'm dumping here a first cut of the idea I described above: basically it's
gustavo's patch with a different g_utf8_caselessmatch function: *not* much
tested (I just tried a couple of searches tomake sure gedit didn't crash)

Comment 20 Paolo Borelli 2005-03-02 10:53:09 UTC

Created attachment 38145 [details] [review]
fix obvious problem in the above patch

Comment 21 Jeroen Zwartepoorte 2005-07-25 14:13:02 UTC

What's the status of this bug?

Comment 22 Paolo Maggi 2005-07-25 14:21:45 UTC

Jeroen: See comment #18

Comment 23 Paolo Borelli 2005-12-07 18:10:53 UTC

This came up on irc and mclasen pointed out this useful link to a free
Boyer-Moore implementation for UTF-8 text: http://crl.nmsu.edu/~mleisher/ucdata.html

Comment 24 Ignacio Casal Quinteiro (nacho) 2009-12-30 16:27:10 UTC

*** Bug 539293 has been marked as a duplicate of this bug. ***

Comment 25 Ignacio Casal Quinteiro (nacho) 2009-12-31 14:48:25 UTC

*** Bug 563905 has been marked as a duplicate of this bug. ***

Comment 26 Carnë Draug 2011-06-01 00:31:25 UTC

This bug seems to be fixed on the current release.

Searching for ü in the text of the original report, or for "a" in "ßabc" whether the file is saved on disk or not, selects the right character.

Closing.