GNOME Bugzilla – Bug 780669
Some emoji which should render as one character with the “Noto Color Emoji” font render as several characters
Last modified: 2017-07-31 06:41:11 UTC
Created attachment 348915 [details] test-text.txt Test text attached. Showing the test text like this pango-view --font='Noto Color Emoji 48' ~/test-text.txt on Fedora 25 shows the emoji not as a single character but as two. On Fedora 24 (and openSUSE Leap 42.2 and Ubuntu 16.04) this works. The version of “Noto Color Emoji” used in all these tests is the latest one from https://www.google.com/get/noto/ which has this file size: -rw-r-----. 1 mfabian mfabian 5987004 10月 20 11:46 NotoColorEmoji.ttf The problem is the same when using the “Emoji One” font from: https://github.com/Ranks/emojione/blob/master/assets/fonts/emojione-android.ttf
Created attachment 348916 [details] pango-view-fedora-25.png Broken display of the test-text.txt by pango-view --font='Noto Color Emoji 48' ~/test-text.txt on Fedora 25.
Created attachment 348917 [details] hb-view-fedora-25-correct.png Correct display of the same test text using: hb-view --font-size=24 --text-file=/home/mfabian/test-text.txt --output-file=/tmp/hb.png --output-format=png /usr/share/fonts/google-noto-emoji/NotoColorEmoji.ttf
Created attachment 348918 [details] pango-view-fedora-24.png Correct display of the same test-text.txt on Fedora 24 using pango-view --font='Noto Color Emoji 48' ~/test-text.txt
hb-view --font-size=24 --text-file=/home/mfabian/test-text.txt --output-file=/tmp/hb.png --output-format=png /usr/share/fonts/google-noto-emoji/NotoColorEmoji.ttf also work correctly on Fedora 24. So it seems the problem is not in harfbuzz.
OK, the problem is caused by the width changes in glibc: Fedora 24: (gdb) p g_unichar_iswide(0x1f469) $4 = 0 (gdb) p g_unichar_iswide(0x200d) $5 = 0 (gdb) p g_unichar_iswide(0x2708) $6 = 0 (gdb) Fedora 25: (gdb) p g_unichar_iswide(0x1f469) $39 = 1 (gdb) p g_unichar_iswide(0x200d) $40 = 0 (gdb) p g_unichar_iswide(0x2708) $41 = 0 (gdb)
Here is the function in pango-context.c which causes the break of the run: /* g_unichar_iswide() uses EastAsianWidth, which is broken. * We should switch to using VerticalTextLayout: * http://www.unicode.org/reports/tr50/#Data50 * * In the mean time, fixup Hangul jamo to be all wide so we * don't break run in the middle. The EastAsianWidth has * 'W' for L-jamo, and 'N' for T and V jamo! * * https://bugzilla.gnome.org/show_bug.cgi?id=705727 */ static gboolean width_iter_iswide (gunichar ch) { if ((0x1100u <= ch && ch <= 0x11FFu) || (0xA960u <= ch && ch <= 0xA97Cu) || (0xD7B0u <= ch && ch <= 0xD7FBu)) return TRUE; return g_unichar_iswide (ch); }
Maybe I should rewrite the function width_iter_next(PangoWidthIter* iter) not to break emoji-zwj sequences: https://git.gnome.org/browse/pango/tree/pango/pango-context.c#n866
Created attachment 348925 [details] [review] 0001-Bug-780669-Do-not-start-a-new-run-at-a-zero-width-jo.patch Patch to fix the problem
Created attachment 348975 [details] [review] Updated patch I updated my patch a bit, pango should not break a run at skin tone modifiers either, otherwise ✌
otherwise pango would break between U+270C VICTORY HAND (which is single width) and U+1F3FF EMOJI MODIFIER FITZPATRICK TYPE-5 which is double width.
Maybe it is even necessary to add 0x1f3f4 to the exceptions where a break at a width change is prevented: diff --git a/pango/pango-context.c b/pango/pango-context.c index f0cea73..2739b2d 100644 --- a/pango/pango-context.c +++ b/pango/pango-context.c @@ -876,7 +876,18 @@ width_iter_next(PangoWidthIter* iter) while (iter->end < iter->text_end) { gunichar ch = g_utf8_get_char (iter->end); - if (width_iter_iswide (ch) != iter->wide) + if (ch == 0x200d || ch == 0x1f3f4 || (ch >= 0x1f3fb && ch <= 0x1f3ff)) + { + /* do not break at a zero-width-joiner or skin tone modifiers*/ + iter->end = g_utf8_next_char (iter->end); + if (iter-> end < iter->text_end) + { + ch = g_utf8_get_char (iter->end); + iter->wide = width_iter_iswide (ch); + } + continue; + } + else if (width_iter_iswide (ch) != iter->wide) break; iter->end = g_utf8_next_char (iter->end); } Because flag sequences like this one: 1F3F4 E0067 E0062 E0073 E0063 E0074 E007F; Emoji_Tag_Sequence; Scotland start with U+1F3F4 WAVING BLACK FLAG, which is a wide character and the rest of the sequence is narrow characters.
Created attachment 349280 [details] [review] Update width_iter_next function to handle emoji sequence
Based on Mike FABIAN's work, I improved the patch.
Created attachment 349282 [details] [review] Update width_iter_next function to handle emoji sequence
Review of attachment 349282 [details] [review]: This patch makes all emoji-zwj-sequences and all emoji sequences with skin tone modifiers and the new flag sequences like the Scottish flag work for me.
Review of attachment 349282 [details] [review]: This makes the function very messy and hard to understand, I'm afraid. Can this be rewritten in terms of some auxiliary function ?
Bug 485556 might be relevant here ?
Created attachment 349659 [details] [review] Update width_iter_next function to handle emoji sequence
The above attachment is simplified patch, please review it, thanks!
Review of attachment 349659 [details] [review]: This looks more understandable, thanks. Please add some details to the commit message, like what the observed misbehavior was that this patch fixes. I hate to ask for more changes, but it would be great to have a few tests for this. Not sure if there is great api to observe the outcome of these changes. ::: pango/pango-utils.c @@ +886,1 @@ } This change looks unrelated and is not mentioned in the commit message at all. Please move it to a separate commit
The new patch does not work for the sequence: U+1F469 U+200D U+2764 U+FE0F U+200D U+1F48B U+200D U+1F468
The new patch from comment#19 also breaks the flag sequence for Scotland: U+1F3F4 U+E0067 U+E0062 U+E0073 U+E0063 U+E0074 U+E007F The new patch breaks after U+1F3F4 here. The old patch from comment#14 does keep that sequence together.
Created attachment 349767 [details] [review] Update width_iter_next function to handle emoji sequence
Thanks for the review! I will add some details to the commit message, and try to write some test cases for it.
(In reply to Peng Wu from comment #23) > Created attachment 349767 [details] [review] [review] > Update width_iter_next function to handle emoji sequence This patch works for me for all emoji sequences.
Comment on attachment 349767 [details] [review] Update width_iter_next function to handle emoji sequence >diff --git a/pango/pango-context.c b/pango/pango-context.c >index f0cea73..8254d4d 100644 >--- a/pango/pango-context.c >+++ b/pango/pango-context.c >@@ -876,6 +877,32 @@ width_iter_next(PangoWidthIter* iter) >+ /* for variation selector-16, tag and emoji modifier. */ >+ if (G_UNLIKELY(ch == 0xFE0F Why don't you check 0xfe0e besides 0xfe0f? I think users can replace 0xfe0f with 0xfe0e. http://www.unicode.org/emoji/charts/emoji-variants.html >+ || (ch >= 0xE0020 && ch <= 0xE007F) >+ || (ch >= 0x1F3FB && ch <= 0x1F3FF))) >+ { >+ iter->end = g_utf8_next_char (iter->end); >+ continue; >+ } >+ > if (width_iter_iswide (ch) != iter->wide) > break; > iter->end = g_utf8_next_char (iter->end);
Created attachment 349851 [details] [review] pango: Support emoji sequence in Unicode Checked several emoji sequences, the sequence pattern is like follows: 1. Use zero width joiner to combine two characters 2. Append some variation selector, tag and emoji modifier to the end of some character
(In reply to Takao Fujiwara from comment #26) > Comment on attachment 349767 [details] [review] [review] > Why don't you check 0xfe0e besides 0xfe0f? > I think users can replace 0xfe0f with 0xfe0e. > http://www.unicode.org/emoji/charts/emoji-variants.html > Sorry, I just updated the patch to include 0xFE0E besides 0xFE0F.
See also: https://bugzilla.gnome.org/show_bug.cgi?id=781123
(In reply to Peng Wu from comment #28) > (In reply to Takao Fujiwara from comment #26) > > Comment on attachment 349767 [details] [review] [review] [review] > > Why don't you check 0xfe0e besides 0xfe0f? > > I think users can replace 0xfe0f with 0xfe0e. > > http://www.unicode.org/emoji/charts/emoji-variants.html > > > > Sorry, I just updated the patch to include 0xFE0E besides 0xFE0F. This patch also works for me for all emoji-sequences. (When using it together with Fujiwara San’s patch from https://bugzilla.gnome.org/show_bug.cgi?id=781123 Fujiwara San’s patch is necessary in addition to the patch in this bug here).
Created attachment 350185 [details] [review] pango: Support emoji sequence in Unicode Checked several emoji sequences, the sequence pattern is like follows: 1. Use zero width joiner to combine two characters 2. Append some variation selector, tag and emoji modifier to the end of some character
minor changes, add unsigned integer modifiers.
Review of attachment 350185 [details] [review]: I think the commit message should say something about width - this is what we are changing here after all. Maybe: Ignore emoji, zero-width jointers and variation selectors for the purposes of finding a run or uniform-width characters. I'm ok with committing this with a better commit message, but I still want to see tests added that verify this for at least some representative emoji sequences.
In Chrome, we introduced a emoji itemizer to detect all sequences that need to be rendered with color by default. We should eventually do something like that in Pango as well. Also note that Pango currently does not support variation selector sequences at all!
Sorry, to support the variation selector sequences, it needs another patch in https://bugzilla.gnome.org/show_bug.cgi?id=781123 .
Created attachment 350919 [details] [review] pango: Support emoji sequence in Unicode Checked several emoji sequences, the sequence pattern is like follows: 1. Use zero width joiner to combine two characters with any width 2. Ignore the width of emoji, tag and variation selectors for the purposes of finding a run or uniform-width characters.
Created attachment 350920 [details] [review] pango: Support emoji sequence in Unicode Checked several emoji sequences, the sequence pattern is like follows: 1. Use zero width joiner to combine two characters with any width 2. Ignore the width of variation selector, tag and emoji modifier for the purposes of finding a run or uniform-width characters.
Updated commit message. Please review it, thanks!
Created attachment 350921 [details] [review] some test cases
I tried to write some tests, dunno whether the test case is right.
Test with GraphemeBreakTest.txt of Unicode 9.0, and tried to fix it in: https://bugzilla.gnome.org/show_bug.cgi?id=782813
Could you review and merge the code?
lgtm.
Attachment 350920 [details] pushed as 734e442 - pango: Support emoji sequence in Unicode Attachment 350921 [details] pushed as 77b56b4 - some test cases
Thanks for the review! :) Maybe we can update test case of GraphemeBreakTest.txt to Unicode 10.0 in bug 782813.