Bug 576156 – Extended grapheme clusters should not apply to cursor movement

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 576156 - Extended grapheme clusters should not apply to cursor movement


Summary:	Extended grapheme clusters should not apply to cursor movement


Status:	RESOLVED FIXED

Product:	pango
Classification:	Platform
Component:	general
Version:	1.23.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	pango-maint
QA Contact:	pango-maint

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2009-03-21 09:19 UTC by Theppitak Karoonboonyanan
Modified:	2012-12-02 15:22 UTC

See Also:
GNOME target:	---
GNOME version:	2.25/2.26

Attachments
Initial patch (2.44 KB, patch) 2009-03-21 09:59 UTC, Theppitak Karoonboonyanan	none	Details \| Review
Override in thai-lang (1.71 KB, patch) 2010-07-07 08:10 UTC, Theppitak Karoonboonyanan	none	Details \| Review
thai-lang with Lao support (3.02 KB, patch) 2010-07-07 15:06 UTC, Theppitak Karoonboonyanan	none	Details \| Review
thai-lang which checks content (3.14 KB, patch) 2010-10-25 03:14 UTC, Theppitak Karoonboonyanan	none	Details \| Review
Undo Thai/Lao extended grapheme completely (1.33 KB, patch) 2011-03-26 16:19 UTC, Theppitak Karoonboonyanan	none	Details \| Review
Remove Thai/Lao hard-coding (1.24 KB, patch) 2012-12-02 09:05 UTC, Theppitak Karoonboonyanan	none	Details \| Review

Description Theppitak Karoonboonyanan 2009-03-21 09:19:04 UTC

Cursor positions for Thai text as found in gedit:

|เป็|น|เรื่|อ|ง|เป็|น|รา|ว|

What's expected:

|เ|ป็|น|เ|รื่|อ|ง|เ|ป็|น|ร|า|ว|

That is, Thai users expect to be able to place cursor after characters classified by UAX#29 as "Prepend", and before Thai characters classified as "Extend", although line breaking is not allowed at those positions.

Besides, users expect the Delete key to only delete the "Prepend" character, not the whole chunk with succeeding characters included. That is, hitting Delete in front of "เก" should only delete "เ", not the whole "เก".

For some examples of such expectation found elsewhere:
- http://code.google.com/p/chromium/issues/detail?id=3523#c27
- http://bugs.icu-project.org/trac/ticket/6774
- https://bugzilla.mozilla.org/show_bug.cgi?id=474068

Comment 1 Theppitak Karoonboonyanan 2009-03-21 09:59:42 UTC

Created attachment 131070 [details] [review]
Initial patch

This is my initial hack. It delays classification of Thai/Lao Prepend/Extend to be done after is_cursor_position setting, so word boundary or line break can still benefit from it. Alternatively, it may be removed totally.

Comment 2 Theppitak Karoonboonyanan 2010-04-09 02:31:36 UTC

Is the patch OK? Please consider fixing this bug. I start to feel keeping rebuilding the patched Pango debs for Thai people [1] for over a year, as well as seeing a lot of comments in related Mozilla bugs [2] [3] [4] [5], not fun at all.

  [1] http://ftp.debianclub.org/debclub/pool/main/p/pango1.0/
  [2] https://bugzilla.mozilla.org/show_bug.cgi?id=474068
  [3] https://bugzilla.mozilla.org/show_bug.cgi?id=554363
  [4] https://bugzilla.mozilla.org/show_bug.cgi?id=512835
  [5] https://bugzilla.mozilla.org/show_bug.cgi?id=546636

FYI, similar ICU bug has been fixed [6].

  [6] http://bugs.icu-project.org/trac/ticket/6774

Comment 3 Behdad Esfahbod 2010-04-22 00:25:03 UTC

Unless this is fixed i the Unicode spec (which I suggest you follow-up with), it can't go into break.c.  It should go into the thai-lang module instead.

Comment 4 Behdad Esfahbod 2010-04-22 00:30:45 UTC

Can you please list the characters that you think the spec is wrong about?  Is it all the Thai/Lao characters listed in the spec table?  Or just the Prepend ones?

Comment 5 Zack Weinberg 2010-04-22 00:56:27 UTC

I don't think UAX#29 is meant to be read this narrowly.  Here are some quotes:

# §2 ... This specification defines *default* mechanisms; more sophisticated
# implementations can *and should* tailor them for particular locales
# or environments.

# §3 ... it is important that an editing interface present a uniform
# implementation of what the user thinks of as characters. Grapheme
# clusters commonly behave as units in terms of mouse selection, arrow
# key movement, backspacing, and so on ... However, in some cases editing
# a grapheme cluster element by element may be preferable. For example, on
# a given system the backspace key might delete by code point, while the
# delete key may delete an entire cluster.

# ... Grapheme clusters can be tailored to meet further requirements. 
# Such tailoring is permitted, but the possible rules are outside of 
# the scope of this document.

(All emphasis in original.)

So if we have the word of a native Thai speaker (Theppitak) that 'Thai users expect to be able to place cursor after characters classified by UAX#29 as "Prepend", and before Thai characters classified as "Extend", although line breaking is not allowed at those positions', it seems to me that UAX#29 already licenses us to make that change in Pango, without any need for fixes to the standard.

Comment 6 Behdad Esfahbod 2010-04-22 01:21:24 UTC

(In reply to comment #5)

> So if we have the word of a native Thai speaker (Theppitak) that 'Thai users
> expect to be able to place cursor after characters classified by UAX#29 as
> "Prepend", and before Thai characters classified as "Extend", although line
> breaking is not allowed at those positions', it seems to me that UAX#29 already
> licenses us to make that change in Pango, without any need for fixes to the
> standard.

Using the same logic I can say "Pango already licenses you to make changes in it"...  That doesn't mean that it's not a bug in the spec.  Imagine tens of such diversions from the spec go in pango... would be a maintenance disaster.  Except that you don't need to imagine, ask anyone familiar with the Indic shaping situation...

Anyone, the pango bug is pending on feedback from Thep.  Lets follow up there.

Comment 7 Theppitak Karoonboonyanan 2010-04-22 09:19:55 UTC

(In reply to comment #4)
> Can you please list the characters that you think the spec is wrong about?  Is
> it all the Thai/Lao characters listed in the spec table?  Or just the Prepend
> ones?

The spec for "Prepend" and "Extend" characters should not be applied to cursor movement.

Prepend characters: U+0E40..U+0E44 (Thai); U+0EC0..U+0EC4 (Lao)
Extend characters: U+0E30, U+0E32, U+0E33, U+0E45 (Thai); U+0EB0, U+0EB2, U+0EB3 (Lao)

(In reply to comment #3)
> Unless this is fixed i the Unicode spec (which I suggest you follow-up with),
> it can't go into break.c.  It should go into the thai-lang module instead.

That's fine. So, we can undo it by adding is_cursor_position attribute to those characters in thai-lang module. Then, how about Lao?

Comment 8 Theppitak Karoonboonyanan 2010-07-06 08:26:38 UTC

(In reply to comment #7)

> That's fine. So, we can undo it by adding is_cursor_position attribute to those
> characters in thai-lang module. Then, how about Lao?

Then, how about Lao?

Comment 9 Theppitak Karoonboonyanan 2010-07-07 08:10:35 UTC

Created attachment 165403 [details] [review]
Override in thai-lang

This patch overrides caret positions for GB_Prepend and GB_Extend characters for Thai in thai-lang engine. Lao is not done yet.

Comment 10 Theppitak Karoonboonyanan 2010-07-07 15:06:42 UTC

Created attachment 165415 [details] [review]
thai-lang with Lao support

This patch adds Lao script support to thai-lang just for caret movement overriding. Lao word break can be added to the engine in the future.

Comment 11 Theppitak Karoonboonyanan 2010-10-24 12:21:21 UTC

Update: the patch in comment 10, while works well for most GNOME apps, causes regression in Mozilla, resulting in non-functioning Thai word break.

FYI, Mozilla calls pango_get_log_attr() with hard-coded "en" PangoLanguage. Changing this to pango_language_get_default() doesn't help.

Comment 12 Theppitak Karoonboonyanan 2010-10-25 03:12:52 UTC

(In reply to comment #11)
> Update: the patch in comment 10, while works well for most GNOME apps, causes
> regression in Mozilla, resulting in non-functioning Thai word break.
> 
> FYI, Mozilla calls pango_get_log_attr() with hard-coded "en" PangoLanguage.
> Changing this to pango_language_get_default() doesn't help.

That's because Thai language engine gets called twice. The first round is called with analysis->script == PANGO_SCRIPT_THAI (or PANGO_SCRIPT_LAO), and the second round with analysis->script == PANGO_SCRIPT_COMMON. As a result, the word break routine is blocked by the 'if (analysis->script != PANGO_SCRIPT_THAI)' check in the second round. This situation doesn't happen with other GTK+ apps, which is weird.

However, as the engine is now shared with Lao in this patch, allowing Thai break routine on PANGO_SCRIPT_COMMON cannot be done, either. This means analysis->script cannot be relied on here. So, I update the patch to check for unicode range of the text content instead.

Comment 13 Theppitak Karoonboonyanan 2010-10-25 03:14:47 UTC

Created attachment 173147 [details] [review]
thai-lang which checks content

Comment 14 Behdad Esfahbod 2010-11-18 01:29:31 UTC

Just wanted to add a note to say that I'm not ignoring these issues but I'm currently focusing on getting shaping done correctly in harfbuzz.  I'll get to lang modules hopefully when harfbuzz is done...

Comment 15 Martin Hosken 2011-02-03 08:44:37 UTC

The Thai UAX#29 question was raised at the last UTC meeting. The conclusion at that meeting was not to change the wording of UAX#29 too much, at least not to change the extended grapheme cluster definition. Instead, a tailoring for Thai was added to CLDR: http://unicode.org/repos/cldr/trunk/common/segments/th.xml

This isn't exactly satisfactory since now we have to add the same thing to all the other languages that use Thai script. Anyway, I think we have enough in the standards documentation to go ahead and fix this bug. In fact a simple solution for all the languages and affected scripts would be to not implement the extended grapheme cluster specification and just to remove rules G9a and G9b completely. But then as the comment following states, you would want to add back sara am and so part of rule G9a.

All in all, it's a bit of a mess. But I would suggest folks not be too surprised.

If this community considers it necessary for UAX#29 to be further changed, then please let me know as soon as possible. Then we might be able to get some changes in before UAX#29 goes out for public review again.

Comment 16 Theppitak Karoonboonyanan 2011-02-03 09:46:49 UTC

Retaining the extended grapheme clusters specification is absolutely pointless, as it's against all existing usages.  Removing it completely should be the most appropriate thing to do.  See comments 65-67 in Mozilla #474068 [1] for examples.

  [1] https://bugzilla.mozilla.org/show_bug.cgi?id=474068#c65

However, if that's not possible, and omission is still allowed, we should just take it, and ignore the useless parts completely.

Comment 17 Theppitak Karoonboonyanan 2011-03-26 16:19:55 UTC

Created attachment 184301 [details] [review]
Undo Thai/Lao extended grapheme completely

As UAX#29 amendment seems sensible and quite possible, I'm getting back to the initial patch. But it broke testboundaries somehow. So, I update it by removing the Thai/Lao extended grapheme clustering completely.

Comment 18 Theppitak Karoonboonyanan 2012-02-22 08:26:43 UTC

UAX#29 has been amended and released with Unicode 6.1. Thai and Lao characters have been removed from both Extend and Prepend categories, although SARA AM is still retained in "SpacingMark". Should Pango be updated now?

Comment 19 Theppitak Karoonboonyanan 2012-11-20 02:12:35 UTC

*ping*

Unicode 6.1 has been released for almost a year now. Could Pango be updated to the new specification?

Comment 20 Behdad Esfahbod 2012-12-01 15:54:37 UTC

Thep, I don't have time to check this, but if you are saying that your patch reflects the Unicode update exactly, please feel free to push it out.  One day I update break.c to fully reflect Unicode...

Comment 21 Theppitak Karoonboonyanan 2012-12-02 08:36:03 UTC

Comment on attachment 173147 [details] [review]
thai-lang which checks content

Obsolete old patch, for clarity.

Comment 22 Theppitak Karoonboonyanan 2012-12-02 09:05:44 UTC

Created attachment 230433 [details] [review]
Remove Thai/Lao hard-coding

Behdad, thanks for the permission. I've adjusted the patch a little bit before actually committing.

Since the concept of "Prepend" GCB property has not been dropped completely from UAX#29, just being left empty, I adjust the patch so its logic is still retained. Only Thai/Lao hard-coding is removed.

Comment 23 Theppitak Karoonboonyanan 2012-12-02 09:14:00 UTC

Pushed:

http://git.gnome.org/browse/pango/commit/?id=9a182f7360adce9ffc258cf259f93bdbf54e5a29

Comment 24 Behdad Esfahbod 2012-12-02 15:22:12 UTC

Thanks.  I also reviewed the change.  Looks good.