After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 496780 - Replace of ligatures replaces two letters not one
Replace of ligatures replaces two letters not one
Status: RESOLVED FIXED
Product: gtksourceview
Classification: Platform
Component: General
unspecified
Other All
: Normal normal
: ---
Assigned To: GTK Sourceview maintainers
GTK Sourceview maintainers
Depends on:
Blocks:
 
 
Reported: 2007-11-14 16:06 UTC by Aidan Delaney
Modified: 2008-11-29 22:15 UTC
See Also:
GNOME target: ---
GNOME version: 2.21/2.22



Description Aidan Delaney 2007-11-14 16:06:05 UTC
Please describe the problem:
Using the Replace functionality (Ctrl+h) on a UTF-8 document containing common ligatures (eg: ff [U+FB00 LATIN SMALL LIGATURE FF], or fi [U+FB01 LATIN SMALL LIGATURE FI]) if one tries to replace the ligature itself with two letters then the ligature and the letter after the ligature are replaced.

Steps to reproduce:
1. Create a UTF-8 file containing the word "different" where it is spelt as d-i-ff-e-r-e-n-t where 'ff' is the ligature (I'm trying to write in ASCII so that there are not encoding screw-ups).
2. Hit Ctrl+h
3. Repace 'ff' (the ligature) with 'f-f' (two f letters).
4. Notice that the word is now spelt d-i-f-f-r-e-n-t.  Replacing 'ff' with 'f-f' has  replaced two characters rather than one.
5. This does not happen if one was to try and replace the letter 'd' with 'a-b'


Actual results:


Expected results:
I'd expect the word d-i-ff-e-r-e-n-t to be spelt d-i-f-f-e-r-e-n-t.

Does this happen every time?
Yes, when concerning ligatures.

Other information:
I have only tried this in the following locale
LANG=en_IE.UTF-8
LC_CTYPE="en_IE.UTF-8"
LC_NUMERIC="en_IE.UTF-8"
LC_TIME="en_IE.UTF-8"
LC_COLLATE="en_IE.UTF-8"
LC_MONETARY="en_IE.UTF-8"
LC_MESSAGES="en_IE.UTF-8"
LC_PAPER="en_IE.UTF-8"
LC_NAME="en_IE.UTF-8"
LC_ADDRESS="en_IE.UTF-8"
LC_TELEPHONE="en_IE.UTF-8"
LC_MEASUREMENT="en_IE.UTF-8"
LC_IDENTIFICATION="en_IE.UTF-8"
LC_ALL=
Comment 1 Paolo Borelli 2008-11-29 13:52:50 UTC
reassigning to gtksourceview, search&replace is implemented there.
Comment 2 Yevgen Muntyan 2008-11-29 16:03:47 UTC
The bug is caused by the following: if you have that small double f thing, then g_utf8_normalize() on it gives you one double-f character, while g_utf8_normalize(g_utf8_casefold()) gives two latin f characters. And gtksourceview code doesn't call g_utf8_casefold() where it computes the end of the match (inside forward_chars_with_skipping(), that's the guilty place).

Now, is it a glib bug? We can fix it in gtksourceview or do nothing, depending on what glib thinks. CC'ing Owen.

You can try the following code to see what I am talking about:

#include <glib.h>
int main(void)
{
    char *s = "\357\254\200"; // double-f thingie
    g_print ("%s\n", g_utf8_normalize (s, -1, G_NORMALIZE_NFD));
    g_print ("%s\n", g_utf8_normalize (g_utf8_casefold (s, -1), -1, G_NORMALIZE_NFD));
    g_print ("%d\n", g_utf8_strlen (g_utf8_normalize (s, -1, G_NORMALIZE_NFD), -1));
    g_print ("%d\n", g_utf8_strlen (g_utf8_normalize (g_utf8_casefold (s, -1), -1, G_NORMALIZE_NFD), -1));
    return 0;
}
Comment 3 Behdad Esfahbod 2008-11-29 19:30:21 UTC
The glib behavior is consistent with what Unicode defines and expected.
Comment 4 Yevgen Muntyan 2008-11-29 22:15:54 UTC
2008-11-29  Yevgen Muntyan  <muntyan@tamu.edu>

	* gtksourceview/gtksourceiter.c: (forward_chars_with_skipping):
	call g_utf8_casefold() before normalizing, to match behavior
	of the search code. Bug #496780.
	* tests/test-widget.c: added Find command.