Bug 496780 – Replace of ligatures replaces two letters not one

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 496780 - Replace of ligatures replaces two letters not one


Summary:	Replace of ligatures replaces two letters not one


Status:	RESOLVED FIXED

Product:	gtksourceview
Classification:	Platform
Component:	General
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	GTK Sourceview maintainers
QA Contact:	GTK Sourceview maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2007-11-14 16:06 UTC by Aidan Delaney
Modified:	2008-11-29 22:15 UTC

See Also:
GNOME target:	---
GNOME version:	2.21/2.22

Description Aidan Delaney 2007-11-14 16:06:05 UTC

Please describe the problem:
Using the Replace functionality (Ctrl+h) on a UTF-8 document containing common ligatures (eg: ﬀ [U+FB00 LATIN SMALL LIGATURE FF], or ﬁ [U+FB01 LATIN SMALL LIGATURE FI]) if one tries to replace the ligature itself with two letters then the ligature and the letter after the ligature are replaced.

Steps to reproduce:
1. Create a UTF-8 file containing the word "diﬀerent" where it is spelt as d-i-ff-e-r-e-n-t where 'ff' is the ligature (I'm trying to write in ASCII so that there are not encoding screw-ups).
2. Hit Ctrl+h
3. Repace 'ff' (the ligature) with 'f-f' (two f letters).
4. Notice that the word is now spelt d-i-f-f-r-e-n-t.  Replacing 'ff' with 'f-f' has  replaced two characters rather than one.
5. This does not happen if one was to try and replace the letter 'd' with 'a-b'


Actual results:


Expected results:
I'd expect the word d-i-ff-e-r-e-n-t to be spelt d-i-f-f-e-r-e-n-t.

Does this happen every time?
Yes, when concerning ligatures.

Other information:
I have only tried this in the following locale
LANG=en_IE.UTF-8
LC_CTYPE="en_IE.UTF-8"
LC_NUMERIC="en_IE.UTF-8"
LC_TIME="en_IE.UTF-8"
LC_COLLATE="en_IE.UTF-8"
LC_MONETARY="en_IE.UTF-8"
LC_MESSAGES="en_IE.UTF-8"
LC_PAPER="en_IE.UTF-8"
LC_NAME="en_IE.UTF-8"
LC_ADDRESS="en_IE.UTF-8"
LC_TELEPHONE="en_IE.UTF-8"
LC_MEASUREMENT="en_IE.UTF-8"
LC_IDENTIFICATION="en_IE.UTF-8"
LC_ALL=

Comment 1 Paolo Borelli 2008-11-29 13:52:50 UTC

reassigning to gtksourceview, search&replace is implemented there.

Comment 2 Yevgen Muntyan 2008-11-29 16:03:47 UTC

The bug is caused by the following: if you have that small double f thing, then g_utf8_normalize() on it gives you one double-f character, while g_utf8_normalize(g_utf8_casefold()) gives two latin f characters. And gtksourceview code doesn't call g_utf8_casefold() where it computes the end of the match (inside forward_chars_with_skipping(), that's the guilty place).

Now, is it a glib bug? We can fix it in gtksourceview or do nothing, depending on what glib thinks. CC'ing Owen.

You can try the following code to see what I am talking about:

#include <glib.h>
int main(void)
{
    char *s = "\357\254\200"; // double-f thingie
    g_print ("%s\n", g_utf8_normalize (s, -1, G_NORMALIZE_NFD));
    g_print ("%s\n", g_utf8_normalize (g_utf8_casefold (s, -1), -1, G_NORMALIZE_NFD));
    g_print ("%d\n", g_utf8_strlen (g_utf8_normalize (s, -1, G_NORMALIZE_NFD), -1));
    g_print ("%d\n", g_utf8_strlen (g_utf8_normalize (g_utf8_casefold (s, -1), -1, G_NORMALIZE_NFD), -1));
    return 0;
}

Comment 3 Behdad Esfahbod 2008-11-29 19:30:21 UTC

The glib behavior is consistent with what Unicode defines and expected.

Comment 4 Yevgen Muntyan 2008-11-29 22:15:54 UTC

2008-11-29  Yevgen Muntyan  <muntyan@tamu.edu>

	* gtksourceview/gtksourceiter.c: (forward_chars_with_skipping):
	call g_utf8_casefold() before normalizing, to match behavior
	of the search code. Bug #496780.
	* tests/test-widget.c: added Find command.