GNOME Bugzilla – Bug 700103
wrong word boundary detection arround MidLetter characters defined by Unicode UAX TR29
Last modified: 2018-05-22 13:08:52 UTC
1.- Open gedit or GIMP or any other GNOME aplication with edit text field. 2.- Type any string with "l·l", like "goril·les" or "paral·leles" 3.- Double-click at word Expected result: the whole word should be selected ("goril·les" or "paral·leles") Obtained: the word is segmented arround "·" char. So only the first or second part of the word is selected ("goril" or "les", if you typed "goril·les"). Tested in Windows (GIMP) and GNOME (gedit and GIMP). AFAIK, Pango follows [1] Unicode Text Segmenation algorithm [2]. According to [2], "·" char U+00B7 is a MidLetter, and rules WB6 and WB7 forbide word-breaking here. So... I don't know where is the problem. But there is one somewhere. This bug is annoying, because not only affects when user double-clicks on text, also affects when using spell-checker. For instance, "goril·les" is splitted, 2 words are passed to spell-chechker: "goril" and "les" and no good spell-checking ara possible for Catalan. See related bug 610106 [1] https://git.gnome.org/browse/pango/tree/pango/break.c?id=5b38ec2ff9f26b0e3204ba79c1d1b5c0d2b92edb#n20 [2] http://www.unicode.org/reports/tr29/
(In reply to comment #0) > .. See related bug 610106... Upps, I meant bug 692156, sorry.
Created attachment 259800 [details] MidLetter demo test for UAX TR29 This little program shows the wrong bounadary detection arround MidLetter characters defined in UAX TR29 http://www.unicode.org/reports/tr29/#MidLetter
I've found the same problem in all MidLetter characters definend in Unicode UAX TR29. See attachment 259800 [details] or just, 1.- Open gedit 2.- Paste the following text: ----------8<--8<--8<---------- U+00B7--> abc·def U+0387--> abc·def U+05F4 --> abc״def U+2027 --> abc‧def U+003A --> abc:def U+FE13 --> abc︓def U+FE55 --> abc﹕def U+FF1A --> abc:def U+02D7 --> abc˗def ----------8<--8<--8<---------- 3.- Try to select those "abc?def" strings with double-click or try Ctrl + arrows. Such strings are not considered as a single word.
Created attachment 265078 [details] test pango word boundaries pango-only test
Running the attached program, it's clear pango marks wrongly word starts/ends, and they are not compatible with word boundaries. ./test_pango Enter a test string (finish with RET): abc·def abc״def abc‧def abc:def abc︓def abc﹕def abc:def abc˗def Language is "ca-es". Mark character is "|". Word start: |abc·|def |abc״|def |abc‧|def |abc:|def |abc︓|def |abc﹕|def |abc:|def |abc˗|def Word end: abc|·def| abc|״def| abc|‧def| abc|:def| abc|︓def| abc|﹕def| abc|:def| abc|˗def| Word boundary: |abc·def| |abc״def| |abc‧def| |abc:def| |abc︓def| |abc﹕def| |abc:def| |abc|˗|def|
*** Bug 692156 has been marked as a duplicate of this bug. ***
This bug is needed to fix 131576, an extremely annoying bug that breaks spell-checking for many common words in gedit and other applications. I hope it can be resolved soon, as this bug has been outstanding since 2004!
Anyone who has a week to hack on this can fix it.
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/pango/issues/218.