GNOME Bugzilla – Bug 782813
Patch for Emoji Grapheme Boundaries in pango_default_break function
Last modified: 2017-08-08 01:27:38 UTC
Update pango_default_break function for Emoji ZWJ sequence. The patch has reduced fails in GraphemeBreakTest.txt of Unicode 9.0. Please review it, thanks!
Created attachment 352119 [details] [review] Update pango_default_break function for Emoji ZWJ sequence Support Grapheme Boundaries Rule GB10, GB11, GB12 and GB13.
Created attachment 352145 [details] [review] Update pango_default_break function for Emoji ZWJ sequence Support Grapheme Boundaries Rule GB10, GB11, GB12 and GB13.
+ case G_UNICODE_OTHER_SYMBOL: + if (G_UNLIKELY(wc == 0x261D || + wc == 0x26F9 || + (wc >= 0x270A && wc <= 0x270D) || + wc == 0x1F385 || [and so on] Instead of hardcoding these overlong lists of condition cascades, you probably should use a script to process emoji-data.txt and emoji-zwj-sequences.txt to extract the necessary data and put it in a form suitable for fast lookup.
What Christian said... In general, that code needs to be rewritten, possibly using only data-driven state-machines. Also, please hook up the tests first, see if you can fix them all. With these added complexity, we better have tests. The driver is in tests/testboundaries_ucd.c, but only runs if data files are available.
Created attachment 352327 [details] [review] Update pango_default_break function for Emoji ZWJ sequence Support Grapheme Boundaries Rule GB10, GB11, GB12 and GB13.
Created attachment 352328 [details] [review] Update GraphemeBreakTest.txt to Unicode 9.0.0
I use the python script in https://pwu.fedorapeople.org/pango/generate.py . Test with GraphemeBreakTest.txt of Unicode 9.0, it passed the tests. Should I change the code to use switch statement in some auxiliary function?
Created attachment 352987 [details] [review] Update pango_default_break function for Emoji ZWJ sequence Support Grapheme Boundaries Rule GB10, GB11, GB12 and GB13.
I updated the python script to simulate 4-way balanced search. The performance is improved I think. URL: https://pwu.fedorapeople.org/pango/generate.py Another method maybe use perfect hash. URL: http://www.theiling.de/projects/lookuptable.html#examples Also I draw a graph for the state machine. URL: https://pwu.fedorapeople.org/pango/emoji-state.svg
Thanks very much on the comments! Sorry, I think my previous patch may have some problems. Currently my plan is that try to pass more tests with the slow code of single if statement, then optimize the lookup performance later.
Created attachment 353653 [details] [review] Update pango_default_break function for Emoji ZWJ sequence Support Grapheme Boundaries Rule GB10, GB11, GB12 and GB13.
After re-checked with UAX #29, I removed the GB_InEmoji state.
Just found the Grapheme Cluster Boundary Rules and Word Boundary Rules handle the following sequence different, may need to handle regional indicator different. pango/tests/GraphemeBreakTest.txt: line 834 failed expected: ÷ 0061 ÷ 1F1E6 × 200D ÷ 1F1E7 × 1F1E8 ÷ 0062 ÷ pango/tests/WordBreakTest.txt: line 1964 failed expected: ÷ 0061 ÷ 1F1E6 × 200D × 1F1E7 ÷ 1F1E8 ÷ 0062 ÷
Created attachment 353800 [details] [review] Update pango_default_break function for Word Boundary Improve Word Boundary Rule for Hebrew_Letter, Single_Quote, Double_Quote and Regional Indicator.
The above patch passed WordBreakTest.txt tests. Please review it, thanks!
Created attachment 354607 [details] [review] Update pango_default_break function for Word Boundary Improve Word Boundary Rule for Hebrew_Letter, Single_Quote, Double_Quote and Regional Indicator.
Created attachment 354912 [details] [review] Update pango_default_break function for Sentence Boundary Re-write the code for Sentence Boundary, and use the code style like Grapheme Boundary and Word Boundary.
BTW, I comment out some code by using #if 0 ... #endif.
Created attachment 355911 [details] [review] Update pango_default_break function for Sentence Boundary Re-write the code for Sentence Boundary, and use the code style like Grapheme Boundary and Word Boundary.
Lgtm except for the looooooooooooooooooooooooooooooong if conditions...
CC break.lo break.c: In function ‘pango_default_break’: break.c:1084:18: warning: variable ‘script’ set but not used [-Wunused-but-set-variable] PangoScript script; ^~~~~~ break.c:490:16: warning: variable ‘prev_type’ set but not used [-Wunused-but-set-variable] GUnicodeType prev_type; These two warnings should be fixed.
I've downloaded the test files for Unicode 10, and see a bunch of failures with the patches applied - are those expected ?
Worth noting, the failures are only in LineBreakTest.txt, maybe that is expected. So I am not going to hold the patch for that. But please fix the warnings.
Created attachment 356623 [details] [review] Update pango_default_break function for Sentence Boundary Re-write the code for Sentence Boundary, and use the code style like Grapheme Boundary and Word Boundary.
Removed two unused variables from the patch.
Sorry, the LineBreakTest.txt needs UAX #14. URL: http://www.unicode.org/reports/tr14/tr14-37.html I am still reading the document, not started to fix it yet.
Attachment 352328 [details] pushed as 59ccc00 - Update GraphemeBreakTest.txt to Unicode 9.0.0 Attachment 353653 [details] pushed as 93474c3 - Update pango_default_break function for Emoji ZWJ sequence Attachment 354607 [details] pushed as 238ac31 - Update pango_default_break function for Word Boundary Attachment 356623 [details] pushed as 284d357 - Update pango_default_break function for Sentence Boundary
Thanks for the review! :)
Ok, tested master and grapheme boundaries seems to work fine. However: - Line-breaking needs more work. It breaks at ZWJ in family emoji right now. - Backspacing is broken with family emoji as well as region flags. - Word-breaks: we should provide a word break around all emoji IMO.
Okay, I will check the code of line break, and will open another bug for patch review.
I updated Ubuntu 17.10's pango from 1.40.6 to 1.40.8 today and I'm now getting test failures in gtk3. See https://bugzilla.gnome.org/show_bug.cgi?id=785978 Also, I am seeing test failures in libgtk2-perl: http://autopkgtest.ubuntu.com/packages/libg/libgtk2-perl/artful/amd64