After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 700103 - wrong word boundary detection arround MidLetter characters defined by Unicode UAX TR29
wrong word boundary detection arround MidLetter characters defined by Unicode...
Status: RESOLVED OBSOLETE
Product: pango
Classification: Platform
Component: coretext
unspecified
Other All
: Normal normal
: ---
Assigned To: gtk-quartz maintainers
pango-maint
: 692156 (view as bug list)
Depends on: 97545
Blocks: 692156
 
 
Reported: 2013-05-10 19:08 UTC by jmontane
Modified: 2018-05-22 13:08 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
MidLetter demo test for UAX TR29 (2.91 KB, text/x-csrc)
2013-11-14 12:13 UTC, jmontane
Details
test pango word boundaries (2.95 KB, text/x-csrc)
2013-12-31 16:52 UTC, jmontane
Details

Description jmontane 2013-05-10 19:08:03 UTC
1.- Open gedit or GIMP or any other GNOME aplication with edit text field. 
2.- Type any string with "l·l", like "goril·les" or "paral·leles"
3.- Double-click at word

Expected result: the whole word should be selected ("goril·les" or "paral·leles")

Obtained: the word is segmented arround "·" char. So only the first or second part of the word is selected ("goril" or "les", if you typed "goril·les").

Tested in Windows (GIMP) and GNOME (gedit and GIMP).

AFAIK, Pango follows [1] Unicode Text Segmenation algorithm [2].

According to [2], "·" char U+00B7 is a MidLetter, and rules WB6 and WB7 forbide word-breaking here.

So... I don't know where is the problem. But there is one somewhere.

This bug is annoying, because not only affects when user double-clicks on text, also affects when using spell-checker. For instance, "goril·les" is splitted, 2 words are passed to spell-chechker: "goril" and "les" and no good spell-checking ara possible for Catalan. See related bug 610106

[1] https://git.gnome.org/browse/pango/tree/pango/break.c?id=5b38ec2ff9f26b0e3204ba79c1d1b5c0d2b92edb#n20

[2] http://www.unicode.org/reports/tr29/
Comment 1 jmontane 2013-05-21 06:56:05 UTC
(In reply to comment #0)
> .. See related bug 610106...

Upps, I meant bug 692156, sorry.
Comment 2 jmontane 2013-11-14 12:13:51 UTC
Created attachment 259800 [details]
MidLetter demo test for UAX TR29

This little program shows the wrong bounadary detection arround MidLetter characters defined in UAX TR29 

http://www.unicode.org/reports/tr29/#MidLetter
Comment 3 jmontane 2013-11-14 12:20:02 UTC
I've found the same problem in all MidLetter characters definend in Unicode UAX TR29.

See attachment 259800 [details] or just,

1.- Open gedit
2.- Paste the following text:
----------8<--8<--8<----------
U+00B7--> abc·def
U+0387--> abc·def
U+05F4 --> abc״def
U+2027 --> abc‧def
U+003A --> abc:def
U+FE13 --> abc︓def
U+FE55 --> abc﹕def
U+FF1A --> abc:def
U+02D7 --> abc˗def
----------8<--8<--8<----------

3.- Try to select those "abc?def" strings with double-click or try Ctrl + arrows.

Such strings are not considered as a single word.
Comment 4 jmontane 2013-12-31 16:52:35 UTC
Created attachment 265078 [details]
test pango word boundaries

pango-only test
Comment 5 jmontane 2013-12-31 16:54:30 UTC
Running the attached program, it's clear pango marks wrongly word starts/ends, and they are not compatible with word boundaries.


 ./test_pango 
Enter a test string (finish with RET): abc·def abc״def abc‧def abc:def abc︓def abc﹕def abc:def abc˗def
Language is "ca-es".
Mark character is "|".
Word start:         |abc·|def |abc״|def |abc‧|def |abc:|def |abc︓|def |abc﹕|def |abc:|def |abc˗|def
Word end:           abc|·def| abc|״def| abc|‧def| abc|:def| abc|︓def| abc|﹕def| abc|:def| abc|˗def|
Word boundary:      |abc·def| |abc״def| |abc‧def| |abc:def| |abc︓def| |abc﹕def| |abc:def| |abc|˗|def|
Comment 6 Sébastien Wilmet 2014-04-12 11:24:16 UTC
*** Bug 692156 has been marked as a duplicate of this bug. ***
Comment 7 John Baptist 2014-04-18 11:38:56 UTC
This bug is needed to fix 131576, an extremely annoying bug that breaks spell-checking for many common words in gedit and other applications. I hope it can be resolved soon, as this bug has been outstanding since 2004!
Comment 8 Behdad Esfahbod 2014-04-18 16:51:38 UTC
Anyone who has a week to hack on this can fix it.
Comment 9 GNOME Infrastructure Team 2018-05-22 13:08:52 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/pango/issues/218.