GNOME Bugzilla – Bug 576156
Extended grapheme clusters should not apply to cursor movement
Last modified: 2012-12-02 15:22:12 UTC
Cursor positions for Thai text as found in gedit: |เป็|น|เรื่|อ|ง|เป็|น|รา|ว| What's expected: |เ|ป็|น|เ|รื่|อ|ง|เ|ป็|น|ร|า|ว| That is, Thai users expect to be able to place cursor after characters classified by UAX#29 as "Prepend", and before Thai characters classified as "Extend", although line breaking is not allowed at those positions. Besides, users expect the Delete key to only delete the "Prepend" character, not the whole chunk with succeeding characters included. That is, hitting Delete in front of "เก" should only delete "เ", not the whole "เก". For some examples of such expectation found elsewhere: - http://code.google.com/p/chromium/issues/detail?id=3523#c27 - http://bugs.icu-project.org/trac/ticket/6774 - https://bugzilla.mozilla.org/show_bug.cgi?id=474068
Created attachment 131070 [details] [review] Initial patch This is my initial hack. It delays classification of Thai/Lao Prepend/Extend to be done after is_cursor_position setting, so word boundary or line break can still benefit from it. Alternatively, it may be removed totally.
Is the patch OK? Please consider fixing this bug. I start to feel keeping rebuilding the patched Pango debs for Thai people [1] for over a year, as well as seeing a lot of comments in related Mozilla bugs [2] [3] [4] [5], not fun at all. [1] http://ftp.debianclub.org/debclub/pool/main/p/pango1.0/ [2] https://bugzilla.mozilla.org/show_bug.cgi?id=474068 [3] https://bugzilla.mozilla.org/show_bug.cgi?id=554363 [4] https://bugzilla.mozilla.org/show_bug.cgi?id=512835 [5] https://bugzilla.mozilla.org/show_bug.cgi?id=546636 FYI, similar ICU bug has been fixed [6]. [6] http://bugs.icu-project.org/trac/ticket/6774
Unless this is fixed i the Unicode spec (which I suggest you follow-up with), it can't go into break.c. It should go into the thai-lang module instead.
Can you please list the characters that you think the spec is wrong about? Is it all the Thai/Lao characters listed in the spec table? Or just the Prepend ones?
I don't think UAX#29 is meant to be read this narrowly. Here are some quotes: # §2 ... This specification defines *default* mechanisms; more sophisticated # implementations can *and should* tailor them for particular locales # or environments. # §3 ... it is important that an editing interface present a uniform # implementation of what the user thinks of as characters. Grapheme # clusters commonly behave as units in terms of mouse selection, arrow # key movement, backspacing, and so on ... However, in some cases editing # a grapheme cluster element by element may be preferable. For example, on # a given system the backspace key might delete by code point, while the # delete key may delete an entire cluster. # ... Grapheme clusters can be tailored to meet further requirements. # Such tailoring is permitted, but the possible rules are outside of # the scope of this document. (All emphasis in original.) So if we have the word of a native Thai speaker (Theppitak) that 'Thai users expect to be able to place cursor after characters classified by UAX#29 as "Prepend", and before Thai characters classified as "Extend", although line breaking is not allowed at those positions', it seems to me that UAX#29 already licenses us to make that change in Pango, without any need for fixes to the standard.
(In reply to comment #5) > So if we have the word of a native Thai speaker (Theppitak) that 'Thai users > expect to be able to place cursor after characters classified by UAX#29 as > "Prepend", and before Thai characters classified as "Extend", although line > breaking is not allowed at those positions', it seems to me that UAX#29 already > licenses us to make that change in Pango, without any need for fixes to the > standard. Using the same logic I can say "Pango already licenses you to make changes in it"... That doesn't mean that it's not a bug in the spec. Imagine tens of such diversions from the spec go in pango... would be a maintenance disaster. Except that you don't need to imagine, ask anyone familiar with the Indic shaping situation... Anyone, the pango bug is pending on feedback from Thep. Lets follow up there.
(In reply to comment #4) > Can you please list the characters that you think the spec is wrong about? Is > it all the Thai/Lao characters listed in the spec table? Or just the Prepend > ones? The spec for "Prepend" and "Extend" characters should not be applied to cursor movement. Prepend characters: U+0E40..U+0E44 (Thai); U+0EC0..U+0EC4 (Lao) Extend characters: U+0E30, U+0E32, U+0E33, U+0E45 (Thai); U+0EB0, U+0EB2, U+0EB3 (Lao) (In reply to comment #3) > Unless this is fixed i the Unicode spec (which I suggest you follow-up with), > it can't go into break.c. It should go into the thai-lang module instead. That's fine. So, we can undo it by adding is_cursor_position attribute to those characters in thai-lang module. Then, how about Lao?
(In reply to comment #7) > That's fine. So, we can undo it by adding is_cursor_position attribute to those > characters in thai-lang module. Then, how about Lao? Then, how about Lao?
Created attachment 165403 [details] [review] Override in thai-lang This patch overrides caret positions for GB_Prepend and GB_Extend characters for Thai in thai-lang engine. Lao is not done yet.
Created attachment 165415 [details] [review] thai-lang with Lao support This patch adds Lao script support to thai-lang just for caret movement overriding. Lao word break can be added to the engine in the future.
Update: the patch in comment 10, while works well for most GNOME apps, causes regression in Mozilla, resulting in non-functioning Thai word break. FYI, Mozilla calls pango_get_log_attr() with hard-coded "en" PangoLanguage. Changing this to pango_language_get_default() doesn't help.
(In reply to comment #11) > Update: the patch in comment 10, while works well for most GNOME apps, causes > regression in Mozilla, resulting in non-functioning Thai word break. > > FYI, Mozilla calls pango_get_log_attr() with hard-coded "en" PangoLanguage. > Changing this to pango_language_get_default() doesn't help. That's because Thai language engine gets called twice. The first round is called with analysis->script == PANGO_SCRIPT_THAI (or PANGO_SCRIPT_LAO), and the second round with analysis->script == PANGO_SCRIPT_COMMON. As a result, the word break routine is blocked by the 'if (analysis->script != PANGO_SCRIPT_THAI)' check in the second round. This situation doesn't happen with other GTK+ apps, which is weird. However, as the engine is now shared with Lao in this patch, allowing Thai break routine on PANGO_SCRIPT_COMMON cannot be done, either. This means analysis->script cannot be relied on here. So, I update the patch to check for unicode range of the text content instead.
Created attachment 173147 [details] [review] thai-lang which checks content
Just wanted to add a note to say that I'm not ignoring these issues but I'm currently focusing on getting shaping done correctly in harfbuzz. I'll get to lang modules hopefully when harfbuzz is done...
The Thai UAX#29 question was raised at the last UTC meeting. The conclusion at that meeting was not to change the wording of UAX#29 too much, at least not to change the extended grapheme cluster definition. Instead, a tailoring for Thai was added to CLDR: http://unicode.org/repos/cldr/trunk/common/segments/th.xml This isn't exactly satisfactory since now we have to add the same thing to all the other languages that use Thai script. Anyway, I think we have enough in the standards documentation to go ahead and fix this bug. In fact a simple solution for all the languages and affected scripts would be to not implement the extended grapheme cluster specification and just to remove rules G9a and G9b completely. But then as the comment following states, you would want to add back sara am and so part of rule G9a. All in all, it's a bit of a mess. But I would suggest folks not be too surprised. If this community considers it necessary for UAX#29 to be further changed, then please let me know as soon as possible. Then we might be able to get some changes in before UAX#29 goes out for public review again.
Retaining the extended grapheme clusters specification is absolutely pointless, as it's against all existing usages. Removing it completely should be the most appropriate thing to do. See comments 65-67 in Mozilla #474068 [1] for examples. [1] https://bugzilla.mozilla.org/show_bug.cgi?id=474068#c65 However, if that's not possible, and omission is still allowed, we should just take it, and ignore the useless parts completely.
Created attachment 184301 [details] [review] Undo Thai/Lao extended grapheme completely As UAX#29 amendment seems sensible and quite possible, I'm getting back to the initial patch. But it broke testboundaries somehow. So, I update it by removing the Thai/Lao extended grapheme clustering completely.
UAX#29 has been amended and released with Unicode 6.1. Thai and Lao characters have been removed from both Extend and Prepend categories, although SARA AM is still retained in "SpacingMark". Should Pango be updated now?
*ping* Unicode 6.1 has been released for almost a year now. Could Pango be updated to the new specification?
Thep, I don't have time to check this, but if you are saying that your patch reflects the Unicode update exactly, please feel free to push it out. One day I update break.c to fully reflect Unicode...
Comment on attachment 173147 [details] [review] thai-lang which checks content Obsolete old patch, for clarity.
Created attachment 230433 [details] [review] Remove Thai/Lao hard-coding Behdad, thanks for the permission. I've adjusted the patch a little bit before actually committing. Since the concept of "Prepend" GCB property has not been dropped completely from UAX#29, just being left empty, I adjust the patch so its logic is still retained. Only Thai/Lao hard-coding is removed.
Pushed: http://git.gnome.org/browse/pango/commit/?id=9a182f7360adce9ffc258cf259f93bdbf54e5a29
Thanks. I also reviewed the change. Looks good.