After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 782813 - Patch for Emoji Grapheme Boundaries in pango_default_break function
Patch for Emoji Grapheme Boundaries in pango_default_break function
Status: RESOLVED FIXED
Product: pango
Classification: Platform
Component: general
unspecified
Other Linux
: Normal normal
: ---
Assigned To: pango-maint
pango-maint
Depends on:
Blocks:
 
 
Reported: 2017-05-19 02:33 UTC by Peng Wu
Modified: 2017-08-08 01:27 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Update pango_default_break function for Emoji ZWJ sequence (5.31 KB, patch)
2017-05-19 02:38 UTC, Peng Wu
none Details | Review
Update pango_default_break function for Emoji ZWJ sequence (5.65 KB, patch)
2017-05-19 09:54 UTC, Peng Wu
none Details | Review
Update pango_default_break function for Emoji ZWJ sequence (6.28 KB, patch)
2017-05-22 03:16 UTC, Peng Wu
none Details | Review
Update GraphemeBreakTest.txt to Unicode 9.0.0 (87.43 KB, patch)
2017-05-22 03:20 UTC, Peng Wu
committed Details | Review
Update pango_default_break function for Emoji ZWJ sequence (7.81 KB, patch)
2017-06-01 09:17 UTC, Peng Wu
none Details | Review
Update pango_default_break function for Emoji ZWJ sequence (6.31 KB, patch)
2017-06-13 06:51 UTC, Peng Wu
committed Details | Review
Update pango_default_break function for Word Boundary (5.63 KB, patch)
2017-06-15 07:42 UTC, Peng Wu
none Details | Review
Update pango_default_break function for Word Boundary (5.52 KB, patch)
2017-06-28 05:38 UTC, Peng Wu
committed Details | Review
Update pango_default_break function for Sentence Boundary (12.53 KB, patch)
2017-07-05 09:13 UTC, Peng Wu
none Details | Review
Update pango_default_break function for Sentence Boundary (23.79 KB, patch)
2017-07-19 03:42 UTC, Peng Wu
none Details | Review
Update pango_default_break function for Sentence Boundary (24.26 KB, patch)
2017-07-31 06:38 UTC, Peng Wu
committed Details | Review

Description Peng Wu 2017-05-19 02:33:54 UTC
Update pango_default_break function for Emoji ZWJ sequence.
The patch has reduced fails in GraphemeBreakTest.txt of Unicode 9.0.

Please review it, thanks!
Comment 1 Peng Wu 2017-05-19 02:38:23 UTC
Created attachment 352119 [details] [review]
Update pango_default_break function for Emoji ZWJ sequence

Support Grapheme Boundaries Rule GB10, GB11, GB12 and GB13.
Comment 2 Peng Wu 2017-05-19 09:54:04 UTC
Created attachment 352145 [details] [review]
Update pango_default_break function for Emoji ZWJ sequence

Support Grapheme Boundaries Rule GB10, GB11, GB12 and GB13.
Comment 3 Christian Persch 2017-05-19 18:08:13 UTC
+    case G_UNICODE_OTHER_SYMBOL:
+      if (G_UNLIKELY(wc == 0x261D ||
+                     wc == 0x26F9 ||
+                     (wc >= 0x270A && wc <= 0x270D) ||
+                     wc == 0x1F385 ||
[and so on]

Instead of hardcoding these overlong lists of condition cascades, you probably should  use a script to process emoji-data.txt and emoji-zwj-sequences.txt to extract the necessary data and put it in a form suitable for fast lookup.
Comment 4 Behdad Esfahbod 2017-05-21 03:27:48 UTC
What Christian said...

In general, that code needs to be rewritten, possibly using only data-driven state-machines.  Also, please hook up the tests first, see if you can fix them all.  With these added complexity, we better have tests.  The driver is in tests/testboundaries_ucd.c, but only runs if data files are available.
Comment 5 Peng Wu 2017-05-22 03:16:37 UTC
Created attachment 352327 [details] [review]
Update pango_default_break function for Emoji ZWJ sequence

Support Grapheme Boundaries Rule GB10, GB11, GB12 and GB13.
Comment 6 Peng Wu 2017-05-22 03:20:02 UTC
Created attachment 352328 [details] [review]
Update GraphemeBreakTest.txt to Unicode 9.0.0
Comment 7 Peng Wu 2017-05-22 03:28:22 UTC
I use the python script in https://pwu.fedorapeople.org/pango/generate.py .

Test with GraphemeBreakTest.txt of Unicode 9.0, it passed the tests.

Should I change the code to use switch statement in some auxiliary function?
Comment 8 Peng Wu 2017-06-01 09:17:51 UTC
Created attachment 352987 [details] [review]
Update pango_default_break function for Emoji ZWJ sequence

Support Grapheme Boundaries Rule GB10, GB11, GB12 and GB13.
Comment 9 Peng Wu 2017-06-01 09:22:07 UTC
I updated the python script to simulate 4-way balanced search.
The performance is improved I think.
URL: https://pwu.fedorapeople.org/pango/generate.py

Another method maybe use perfect hash.
URL: http://www.theiling.de/projects/lookuptable.html#examples

Also I draw a graph for the state machine.
URL: https://pwu.fedorapeople.org/pango/emoji-state.svg
Comment 10 Peng Wu 2017-06-12 06:49:10 UTC
Thanks very much on the comments!

Sorry, I think my previous patch may have some problems.

Currently my plan is that try to pass more tests with the slow code
of single if statement, then optimize the lookup performance later.
Comment 11 Peng Wu 2017-06-13 06:51:57 UTC
Created attachment 353653 [details] [review]
Update pango_default_break function for Emoji ZWJ sequence

Support Grapheme Boundaries Rule GB10, GB11, GB12 and GB13.
Comment 12 Peng Wu 2017-06-13 06:56:49 UTC
After re-checked with UAX #29, I removed the GB_InEmoji state.
Comment 13 Peng Wu 2017-06-14 10:57:32 UTC
Just found the Grapheme Cluster Boundary Rules and Word Boundary Rules handle the following sequence different, may need to handle regional indicator different.

pango/tests/GraphemeBreakTest.txt: line 834 failed
   expected: ÷ 0061 ÷ 1F1E6 × 200D ÷ 1F1E7 × 1F1E8 ÷ 0062 ÷	

pango/tests/WordBreakTest.txt: line 1964 failed
   expected: ÷ 0061 ÷ 1F1E6 × 200D × 1F1E7 ÷ 1F1E8 ÷ 0062 ÷
Comment 14 Peng Wu 2017-06-15 07:42:27 UTC
Created attachment 353800 [details] [review]
Update pango_default_break function for Word Boundary

Improve Word Boundary Rule for Hebrew_Letter, Single_Quote,
Double_Quote and Regional Indicator.
Comment 15 Peng Wu 2017-06-15 07:47:16 UTC
The above patch passed WordBreakTest.txt tests.

Please review it, thanks!
Comment 16 Peng Wu 2017-06-28 05:38:47 UTC
Created attachment 354607 [details] [review]
Update pango_default_break function for Word Boundary

Improve Word Boundary Rule for Hebrew_Letter, Single_Quote,
Double_Quote and Regional Indicator.
Comment 17 Peng Wu 2017-07-05 09:13:48 UTC
Created attachment 354912 [details] [review]
Update pango_default_break function for Sentence Boundary

Re-write the code for Sentence Boundary,
and use the code style like Grapheme Boundary and Word Boundary.
Comment 18 Peng Wu 2017-07-05 09:18:04 UTC
BTW, I comment out some code by using #if 0 ... #endif.
Comment 19 Peng Wu 2017-07-19 03:42:47 UTC
Created attachment 355911 [details] [review]
Update pango_default_break function for Sentence Boundary

Re-write the code for Sentence Boundary,
and use the code style like Grapheme Boundary and Word Boundary.
Comment 20 Behdad Esfahbod 2017-07-28 14:54:25 UTC
Lgtm except for the looooooooooooooooooooooooooooooong if conditions...
Comment 21 Matthias Clasen 2017-07-30 17:01:20 UTC
  CC       break.lo
break.c: In function ‘pango_default_break’:
break.c:1084:18: warning: variable ‘script’ set but not used [-Wunused-but-set-variable]
      PangoScript script;
                  ^~~~~~
break.c:490:16: warning: variable ‘prev_type’ set but not used [-Wunused-but-set-variable]
   GUnicodeType prev_type;

These two warnings should be fixed.
Comment 22 Matthias Clasen 2017-07-30 17:02:05 UTC
I've downloaded the test files for Unicode 10, and see a bunch of failures with the patches applied - are those expected ?
Comment 23 Matthias Clasen 2017-07-30 17:12:39 UTC
Worth noting, the failures are only in LineBreakTest.txt, maybe that is expected. So I am not going to hold the patch for that. But please fix the warnings.
Comment 24 Peng Wu 2017-07-31 06:38:54 UTC
Created attachment 356623 [details] [review]
Update pango_default_break function for Sentence Boundary

Re-write the code for Sentence Boundary,
and use the code style like Grapheme Boundary and Word Boundary.
Comment 25 Peng Wu 2017-07-31 06:48:46 UTC
Removed two unused variables from the patch.
Comment 26 Peng Wu 2017-07-31 06:50:00 UTC
Sorry, the LineBreakTest.txt needs UAX #14.
URL: http://www.unicode.org/reports/tr14/tr14-37.html

I am still reading the document, not started to fix it yet.
Comment 27 Matthias Clasen 2017-07-31 17:54:24 UTC
Attachment 352328 [details] pushed as 59ccc00 - Update GraphemeBreakTest.txt to Unicode 9.0.0
Attachment 353653 [details] pushed as 93474c3 - Update pango_default_break function for Emoji ZWJ sequence
Attachment 354607 [details] pushed as 238ac31 - Update pango_default_break function for Word Boundary
Attachment 356623 [details] pushed as 284d357 - Update pango_default_break function for Sentence Boundary
Comment 28 Peng Wu 2017-08-01 05:27:01 UTC
Thanks for the review! :)
Comment 29 Behdad Esfahbod 2017-08-01 11:27:34 UTC
Ok, tested master and grapheme boundaries seems to work fine.

However:

- Line-breaking needs more work.  It breaks at ZWJ in family emoji right now.

- Backspacing is broken with family emoji as well as region flags.

- Word-breaks: we should provide a word break around all emoji IMO.
Comment 30 Peng Wu 2017-08-02 04:37:19 UTC
Okay, I will check the code of line break, and will open another bug for patch review.
Comment 31 Jeremy Bicha 2017-08-08 01:27:38 UTC
I updated Ubuntu 17.10's pango from 1.40.6 to 1.40.8 today and I'm now getting test failures in gtk3. See
https://bugzilla.gnome.org/show_bug.cgi?id=785978

Also, I am seeing test failures in libgtk2-perl:
http://autopkgtest.ubuntu.com/packages/libg/libgtk2-perl/artful/amd64