GNOME Bugzilla – Bug 496780
Replace of ligatures replaces two letters not one
Last modified: 2008-11-29 22:15:54 UTC
Please describe the problem: Using the Replace functionality (Ctrl+h) on a UTF-8 document containing common ligatures (eg: ff [U+FB00 LATIN SMALL LIGATURE FF], or fi [U+FB01 LATIN SMALL LIGATURE FI]) if one tries to replace the ligature itself with two letters then the ligature and the letter after the ligature are replaced. Steps to reproduce: 1. Create a UTF-8 file containing the word "different" where it is spelt as d-i-ff-e-r-e-n-t where 'ff' is the ligature (I'm trying to write in ASCII so that there are not encoding screw-ups). 2. Hit Ctrl+h 3. Repace 'ff' (the ligature) with 'f-f' (two f letters). 4. Notice that the word is now spelt d-i-f-f-r-e-n-t. Replacing 'ff' with 'f-f' has replaced two characters rather than one. 5. This does not happen if one was to try and replace the letter 'd' with 'a-b' Actual results: Expected results: I'd expect the word d-i-ff-e-r-e-n-t to be spelt d-i-f-f-e-r-e-n-t. Does this happen every time? Yes, when concerning ligatures. Other information: I have only tried this in the following locale LANG=en_IE.UTF-8 LC_CTYPE="en_IE.UTF-8" LC_NUMERIC="en_IE.UTF-8" LC_TIME="en_IE.UTF-8" LC_COLLATE="en_IE.UTF-8" LC_MONETARY="en_IE.UTF-8" LC_MESSAGES="en_IE.UTF-8" LC_PAPER="en_IE.UTF-8" LC_NAME="en_IE.UTF-8" LC_ADDRESS="en_IE.UTF-8" LC_TELEPHONE="en_IE.UTF-8" LC_MEASUREMENT="en_IE.UTF-8" LC_IDENTIFICATION="en_IE.UTF-8" LC_ALL=
reassigning to gtksourceview, search&replace is implemented there.
The bug is caused by the following: if you have that small double f thing, then g_utf8_normalize() on it gives you one double-f character, while g_utf8_normalize(g_utf8_casefold()) gives two latin f characters. And gtksourceview code doesn't call g_utf8_casefold() where it computes the end of the match (inside forward_chars_with_skipping(), that's the guilty place). Now, is it a glib bug? We can fix it in gtksourceview or do nothing, depending on what glib thinks. CC'ing Owen. You can try the following code to see what I am talking about: #include <glib.h> int main(void) { char *s = "\357\254\200"; // double-f thingie g_print ("%s\n", g_utf8_normalize (s, -1, G_NORMALIZE_NFD)); g_print ("%s\n", g_utf8_normalize (g_utf8_casefold (s, -1), -1, G_NORMALIZE_NFD)); g_print ("%d\n", g_utf8_strlen (g_utf8_normalize (s, -1, G_NORMALIZE_NFD), -1)); g_print ("%d\n", g_utf8_strlen (g_utf8_normalize (g_utf8_casefold (s, -1), -1, G_NORMALIZE_NFD), -1)); return 0; }
The glib behavior is consistent with what Unicode defines and expected.
2008-11-29 Yevgen Muntyan <muntyan@tamu.edu> * gtksourceview/gtksourceiter.c: (forward_chars_with_skipping): call g_utf8_casefold() before normalizing, to match behavior of the search code. Bug #496780. * tests/test-widget.c: added Find command.