Bug 780669 – Some emoji which should render as one character with the “Noto Color Emoji” font render as several characters

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 780669 - Some emoji which should render as one character with the “Noto Color Emoji” font render as several characters


Summary:	Some emoji which should render as one character with the “Noto Color Emoji” f...


Status:	RESOLVED FIXED

Product:	pango
Classification:	Platform
Component:	general
Version:	1.40.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	pango-maint
QA Contact:	pango-maint

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2017-03-29 06:14 UTC by Mike FABIAN
Modified:	2017-07-31 06:41 UTC

See Also:	https://bugzilla.redhat.com/show_bug.cgi?id=1436077
GNOME target:	---
GNOME version:	---

Attachments
test-text.txt (11 bytes, text/plain) 2017-03-29 06:14 UTC, Mike FABIAN		Details
pango-view-fedora-25.png (6.53 KB, image/png) 2017-03-29 06:15 UTC, Mike FABIAN		Details
hb-view-fedora-25-correct.png (305.86 KB, image/png) 2017-03-29 06:17 UTC, Mike FABIAN		Details
pango-view-fedora-24.png (395.76 KB, image/png) 2017-03-29 06:18 UTC, Mike FABIAN		Details
0001-Bug-780669-Do-not-start-a-new-run-at-a-zero-width-jo.patch (1.12 KB, patch) 2017-03-29 10:29 UTC, Mike FABIAN	none	Details \| Review
Updated patch (1.21 KB, patch) 2017-03-30 08:41 UTC, Mike FABIAN	none	Details \| Review
Update width_iter_next function to handle emoji sequence (1.70 KB, patch) 2017-04-05 09:31 UTC, Peng Wu	none	Details \| Review
Update width_iter_next function to handle emoji sequence (1.79 KB, patch) 2017-04-05 09:48 UTC, Peng Wu	none	Details \| Review
Update width_iter_next function to handle emoji sequence (1.94 KB, patch) 2017-04-11 06:53 UTC, Peng Wu	none	Details \| Review
Update width_iter_next function to handle emoji sequence (1.65 KB, patch) 2017-04-13 08:46 UTC, Peng Wu	none	Details \| Review
pango: Support emoji sequence in Unicode (1.86 KB, patch) 2017-04-14 06:32 UTC, Peng Wu	none	Details \| Review
pango: Support emoji sequence in Unicode (1.86 KB, patch) 2017-04-21 08:14 UTC, Peng Wu	none	Details \| Review
pango: Support emoji sequence in Unicode (1.91 KB, patch) 2017-05-03 02:57 UTC, Peng Wu	none	Details \| Review
pango: Support emoji sequence in Unicode (1.91 KB, patch) 2017-05-03 03:02 UTC, Peng Wu	committed	Details \| Review
some test cases (1022 bytes, patch) 2017-05-03 03:09 UTC, Peng Wu	committed	Details \| Review

Description Mike FABIAN 2017-03-29 06:14:49 UTC

Created attachment 348915 [details]
test-text.txt

Test text attached. Showing the test text like this

pango-view --font='Noto Color Emoji 48' ~/test-text.txt

on Fedora 25 shows the emoji not as a single character but as two.

On Fedora 24 (and openSUSE Leap 42.2 and Ubuntu 16.04) this works.

The version of “Noto Color Emoji” used in all these tests is
the latest one from https://www.google.com/get/noto/

which has this file size:

-rw-r-----. 1 mfabian mfabian 5987004 10月 20 11:46 NotoColorEmoji.ttf

The problem is the same when using the “Emoji One” font from:
https://github.com/Ranks/emojione/blob/master/assets/fonts/emojione-android.ttf

Comment 1 Mike FABIAN 2017-03-29 06:15:55 UTC

Created attachment 348916 [details]
pango-view-fedora-25.png

Broken display of the test-text.txt by 

pango-view --font='Noto Color Emoji 48' ~/test-text.txt

on Fedora 25.

Comment 2 Mike FABIAN 2017-03-29 06:17:12 UTC

Created attachment 348917 [details]
hb-view-fedora-25-correct.png

Correct display of the same test text using:

hb-view --font-size=24 --text-file=/home/mfabian/test-text.txt --output-file=/tmp/hb.png --output-format=png /usr/share/fonts/google-noto-emoji/NotoColorEmoji.ttf

Comment 3 Mike FABIAN 2017-03-29 06:18:15 UTC

Created attachment 348918 [details]
pango-view-fedora-24.png


Correct display of the same test-text.txt on Fedora 24 using

pango-view --font='Noto Color Emoji 48' ~/test-text.txt

Comment 4 Mike FABIAN 2017-03-29 06:19:38 UTC

hb-view --font-size=24 --text-file=/home/mfabian/test-text.txt --output-file=/tmp/hb.png --output-format=png /usr/share/fonts/google-noto-emoji/NotoColorEmoji.ttf

also work correctly on Fedora 24.

So it seems the problem is not in harfbuzz.

Comment 5 Mike FABIAN 2017-03-29 06:20:02 UTC

OK, the problem is caused by the width changes in glibc:

Fedora 24:

(gdb) p g_unichar_iswide(0x1f469)
$4 = 0
(gdb) p g_unichar_iswide(0x200d)
$5 = 0
(gdb) p g_unichar_iswide(0x2708)
$6 = 0
(gdb) 

Fedora 25:

(gdb) p g_unichar_iswide(0x1f469)
$39 = 1
(gdb) p g_unichar_iswide(0x200d)
$40 = 0
(gdb) p g_unichar_iswide(0x2708)
$41 = 0
(gdb)

Comment 6 Mike FABIAN 2017-03-29 06:20:29 UTC

Here is the function in pango-context.c which causes the break of the run:

/* g_unichar_iswide() uses EastAsianWidth, which is broken.
 * We should switch to using VerticalTextLayout:
 * http://www.unicode.org/reports/tr50/#Data50
 *
 * In the mean time, fixup Hangul jamo to be all wide so we
 * don't break run in the middle.  The EastAsianWidth has
 * 'W' for L-jamo, and 'N' for T and V jamo!
 *
 * https://bugzilla.gnome.org/show_bug.cgi?id=705727
 */
static gboolean
width_iter_iswide (gunichar ch)
{
  if ((0x1100u <= ch && ch <= 0x11FFu) ||
      (0xA960u <= ch && ch <= 0xA97Cu) ||
      (0xD7B0u <= ch && ch <= 0xD7FBu))
    return TRUE;

  return g_unichar_iswide (ch);
}

Comment 7 Mike FABIAN 2017-03-29 06:20:44 UTC

Maybe I should rewrite the function width_iter_next(PangoWidthIter* iter)
not to break emoji-zwj sequences:

https://git.gnome.org/browse/pango/tree/pango/pango-context.c#n866

Comment 8 Mike FABIAN 2017-03-29 10:29:25 UTC

Created attachment 348925 [details] [review]
0001-Bug-780669-Do-not-start-a-new-run-at-a-zero-width-jo.patch

Patch to fix the problem

Comment 9 Mike FABIAN 2017-03-30 08:41:01 UTC

Created attachment 348975 [details] [review]
Updated patch

I updated my patch a bit, pango should not break a run at 
skin tone modifiers either, otherwise ✌

Comment 10 Mike FABIAN 2017-03-30 08:43:41 UTC

otherwise pango would break between U+270C VICTORY HAND (which
is single width) and U+1F3FF EMOJI MODIFIER FITZPATRICK TYPE-5
which is double width.

Comment 11 Mike FABIAN 2017-03-30 12:03:36 UTC

Maybe it is even necessary to add 0x1f3f4 to the exceptions where
a break at a width change is prevented:

diff --git a/pango/pango-context.c b/pango/pango-context.c
index f0cea73..2739b2d 100644
--- a/pango/pango-context.c
+++ b/pango/pango-context.c
@@ -876,7 +876,18 @@ width_iter_next(PangoWidthIter* iter)
   while (iter->end < iter->text_end)
     {
       gunichar ch = g_utf8_get_char (iter->end);
-      if (width_iter_iswide (ch) != iter->wide)
+      if (ch == 0x200d || ch == 0x1f3f4 || (ch >= 0x1f3fb && ch <= 0x1f3ff))
+        {
+          /* do not break at a zero-width-joiner or skin tone modifiers*/
+          iter->end = g_utf8_next_char (iter->end);
+          if (iter-> end < iter->text_end)
+            {
+              ch = g_utf8_get_char (iter->end);
+              iter->wide = width_iter_iswide (ch);
+            }
+          continue;
+        }
+      else if (width_iter_iswide (ch) != iter->wide)
         break;
       iter->end = g_utf8_next_char (iter->end);
     }

Because flag sequences like this one:

1F3F4 E0067 E0062 E0073 E0063 E0074 E007F; Emoji_Tag_Sequence; Scotland

start with U+1F3F4 WAVING BLACK FLAG, which is a wide character
and the rest of the sequence is narrow characters.

Comment 12 Peng Wu 2017-04-05 09:31:40 UTC

Created attachment 349280 [details] [review]
Update width_iter_next function to handle emoji sequence

Comment 13 Peng Wu 2017-04-05 09:38:24 UTC

Based on Mike FABIAN's work, I improved the patch.

Comment 14 Peng Wu 2017-04-05 09:48:17 UTC

Created attachment 349282 [details] [review]
Update width_iter_next function to handle emoji sequence

Comment 15 Mike FABIAN 2017-04-05 10:12:47 UTC

Review of attachment 349282 [details] [review]:

This patch makes all emoji-zwj-sequences and all emoji sequences with skin tone modifiers and the new flag sequences like the Scottish flag work for me.

Comment 16 Matthias Clasen 2017-04-05 10:55:15 UTC

Review of attachment 349282 [details] [review]:

This makes the function very messy and hard to understand, I'm afraid. Can this be rewritten in terms of some auxiliary function ?

Comment 17 Matthias Clasen 2017-04-08 04:01:48 UTC

Bug 485556 might be relevant here ?

Comment 18 Peng Wu 2017-04-11 06:53:06 UTC

Created attachment 349659 [details] [review]
Update width_iter_next function to handle emoji sequence

Comment 19 Peng Wu 2017-04-11 10:19:48 UTC

The above attachment is simplified patch, please review it, thanks!

Comment 20 Matthias Clasen 2017-04-11 19:45:17 UTC

Review of attachment 349659 [details] [review]:

This looks more understandable, thanks.

Please add some details to the commit message, like what the observed misbehavior was that this patch fixes.

I hate to ask for more changes, but it would be great to have a few tests for this. Not sure if there is great api to observe the outcome of these changes.

::: pango/pango-utils.c
@@ +886,1 @@
 }

This change looks unrelated and is not mentioned in the commit message at all. Please move it to a separate commit

Comment 21 Mike FABIAN 2017-04-13 06:46:57 UTC

The new patch does not work for the sequence: 

U+1F469 U+200D U+2764 U+FE0F U+200D U+1F48B U+200D U+1F468

Comment 22 Mike FABIAN 2017-04-13 06:51:15 UTC

The new patch from comment#19 also breaks the
flag sequence for Scotland:

U+1F3F4 U+E0067 U+E0062 U+E0073 U+E0063 U+E0074 U+E007F

The new patch breaks after U+1F3F4 here.

The old patch from comment#14 does keep that sequence together.

Comment 23 Peng Wu 2017-04-13 08:46:41 UTC

Created attachment 349767 [details] [review]
Update width_iter_next function to handle emoji sequence

Comment 24 Peng Wu 2017-04-13 08:49:46 UTC

Thanks for the review!

I will add some details to the commit message,
and try to write some test cases for it.

Comment 25 Mike FABIAN 2017-04-13 10:17:09 UTC

(In reply to Peng Wu from comment #23)
> Created attachment 349767 [details] [review] [review]
> Update width_iter_next function to handle emoji sequence

This patch works for me for all emoji sequences.

Comment 26 Takao Fujiwara 2017-04-14 02:46:21 UTC

Comment on attachment 349767 [details] [review]
Update width_iter_next function to handle emoji sequence

>diff --git a/pango/pango-context.c b/pango/pango-context.c
>index f0cea73..8254d4d 100644
>--- a/pango/pango-context.c
>+++ b/pango/pango-context.c
>@@ -876,6 +877,32 @@ width_iter_next(PangoWidthIter* iter)
>+      /* for variation selector-16, tag and emoji modifier. */
>+      if (G_UNLIKELY(ch == 0xFE0F

Why don't you check 0xfe0e besides 0xfe0f?
I think users can replace 0xfe0f with 0xfe0e.
http://www.unicode.org/emoji/charts/emoji-variants.html

>+                    || (ch >= 0xE0020 && ch <= 0xE007F)
>+                    || (ch >= 0x1F3FB && ch <= 0x1F3FF)))
>+        {
>+          iter->end = g_utf8_next_char (iter->end);
>+          continue;
>+        }
>+
>       if (width_iter_iswide (ch) != iter->wide)
>         break;
>       iter->end = g_utf8_next_char (iter->end);

Comment 27 Peng Wu 2017-04-14 06:32:20 UTC

Created attachment 349851 [details] [review]
pango: Support emoji sequence in Unicode

Checked several emoji sequences, the sequence pattern is
like follows:
1. Use zero width joiner to combine two characters
2. Append some variation selector, tag and emoji modifier
to the end of some character

Comment 28 Peng Wu 2017-04-14 06:37:00 UTC

(In reply to Takao Fujiwara from comment #26)
> Comment on attachment 349767 [details] [review] [review]
> Why don't you check 0xfe0e besides 0xfe0f?
> I think users can replace 0xfe0f with 0xfe0e.
> http://www.unicode.org/emoji/charts/emoji-variants.html
> 

Sorry, I just updated the patch to include 0xFE0E besides 0xFE0F.

Comment 29 Mike FABIAN 2017-04-17 06:24:37 UTC

See also: https://bugzilla.gnome.org/show_bug.cgi?id=781123

Comment 30 Mike FABIAN 2017-04-17 06:56:17 UTC

(In reply to Peng Wu from comment #28)
> (In reply to Takao Fujiwara from comment #26)
> > Comment on attachment 349767 [details] [review] [review] [review]
> > Why don't you check 0xfe0e besides 0xfe0f?
> > I think users can replace 0xfe0f with 0xfe0e.
> > http://www.unicode.org/emoji/charts/emoji-variants.html
> > 
> 
> Sorry, I just updated the patch to include 0xFE0E besides 0xFE0F.

This patch also works for me for all emoji-sequences.

(When using it together with Fujiwara San’s patch from 
https://bugzilla.gnome.org/show_bug.cgi?id=781123
Fujiwara San’s patch is necessary in addition to the
patch in this bug here).

Comment 31 Peng Wu 2017-04-21 08:14:04 UTC

Created attachment 350185 [details] [review]
pango: Support emoji sequence in Unicode

Checked several emoji sequences, the sequence pattern is
like follows:
1. Use zero width joiner to combine two characters
2. Append some variation selector, tag and emoji modifier
to the end of some character

Comment 32 Peng Wu 2017-04-21 08:15:22 UTC

minor changes, add unsigned integer modifiers.

Comment 33 Matthias Clasen 2017-04-27 02:21:35 UTC

Review of attachment 350185 [details] [review]:

I think the commit message should say something about width - this is what we are changing here after all.

Maybe:

Ignore emoji, zero-width jointers and variation selectors for the purposes of finding a run or uniform-width characters.

I'm ok with committing this with a better commit message, but I still want to see tests added that verify this for at least
some representative emoji sequences.

Comment 34 Behdad Esfahbod 2017-04-27 02:41:02 UTC

In Chrome, we introduced a emoji itemizer to detect all sequences that need to be rendered with color by default.  We should eventually do something like that in Pango as well.

Also note that Pango currently does not support variation selector sequences at all!

Comment 35 Peng Wu 2017-04-27 03:46:49 UTC

Sorry, to support the variation selector sequences, it needs another patch
in https://bugzilla.gnome.org/show_bug.cgi?id=781123 .

Comment 36 Peng Wu 2017-05-03 02:57:13 UTC

Created attachment 350919 [details] [review]
pango: Support emoji sequence in Unicode

Checked several emoji sequences, the sequence pattern is
like follows:
1. Use zero width joiner to combine two characters with
any width
2. Ignore the width of emoji, tag and variation selectors
for the purposes of finding a run or uniform-width characters.

Comment 37 Peng Wu 2017-05-03 03:02:11 UTC

Created attachment 350920 [details] [review]
pango: Support emoji sequence in Unicode

Checked several emoji sequences, the sequence pattern is
like follows:
1. Use zero width joiner to combine two characters with
any width
2. Ignore the width of variation selector, tag and emoji modifier
for the purposes of finding a run or uniform-width characters.

Comment 38 Peng Wu 2017-05-03 03:05:40 UTC

Updated commit message. Please review it, thanks!

Comment 39 Peng Wu 2017-05-03 03:09:10 UTC

Created attachment 350921 [details] [review]
some test cases

Comment 40 Peng Wu 2017-05-03 03:11:25 UTC

I tried to write some tests, dunno whether the test case is right.

Comment 41 Peng Wu 2017-05-24 02:51:05 UTC

Test with GraphemeBreakTest.txt of Unicode 9.0,
and tried to fix it in:
https://bugzilla.gnome.org/show_bug.cgi?id=782813

Comment 42 Peng Wu 2017-07-19 03:11:36 UTC

Could you review and merge the code?

Comment 43 Behdad Esfahbod 2017-07-28 14:59:10 UTC

lgtm.

Comment 44 Matthias Clasen 2017-07-29 17:28:53 UTC

Attachment 350920 [details] pushed as 734e442 - pango: Support emoji sequence in Unicode
Attachment 350921 [details] pushed as 77b56b4 - some test cases

Comment 45 Peng Wu 2017-07-31 06:41:11 UTC

Thanks for the review! :)

Maybe we can update test case of GraphemeBreakTest.txt to Unicode 10.0 in bug 782813.