Bug 767529 – improved emoji support (unicode TR51)

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 767529 - (vteemoji) improved emoji support (unicode TR51)

(vteemoji)
Summary:	improved emoji support (unicode TR51)


Status:	RESOLVED OBSOLETE

Product:	vte
Classification:	Core
Component:	general
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	VTE Maintainers
QA Contact:	VTE Maintainers

URL:
Whiteboard:

Duplicates:	767457 777624 781676 785563 (view as bug list)
Depends on:
Blocks:

Reported:	2016-06-11 12:16 UTC by Christian Persch
Modified:	2021-06-10 15:14 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Christian Persch 2016-06-11 12:16:39 UTC

In unicode 9, there's a change to East_Asian_Width property that makes emojis Wide, see http://www.unicode.org/reports/tr11/tr11-30.html :

'''
ED4. East Asian Wide (W): All other characters that are always wide. These characters occur only in the context of East Asian typography where they are wide characters (such as the Unified Han Ideographs or Squared Katakana Symbols). This category includes characters that have explicit halfwidth counterparts, along with characters that have the [UTR51] property Emoji_Presentation, with the exception of the range U+1F1E6 REGIONAL INDICATOR SYMBOL LETTER A through U+1F1FF REGIONAL INDICATOR SYMBOL LETTER Z.
'''

and

'''
Emoji style standardized variation sequences behave as though they were East Asian Wide, regardless of their assigned East_Asian_Width property value.
'''

So it's not as simple as just change coming through glib's iswide[_cjk] API, and we'll need to have access to the Emoji_Presentation property (either through glib or simpler by adding it to vte itself).

Comment 1 Christian Persch 2017-04-30 20:43:19 UTC

*** Bug 781676 has been marked as a duplicate of this bug. ***

Comment 2 Christian Persch 2017-04-30 21:02:39 UTC

*** Bug 767457 has been marked as a duplicate of this bug. ***

Comment 3 Christian Persch 2017-07-15 20:17:21 UTC

I was working a bit on this, and several questions came up:

* I'm adding a sequence to select between text and presentation style by default (OSC, unless sth else is more appropriate?). Should that be global, or per screen?

* With the above setting, should we also have an SGR code to select text/presentation style, or solely do that with the VS15/VS16 variation selectors?

* Should we show text attributes (bold, italic, underline etc) on emojis? IMHO bold/italic make no sense, unsure about the others.

* I currently hardcode "Noto (Color) Emoji" as font; do we need API to select the font(s)?

* When the font doesn't have a glyph for a ZWJ sequence or emoji+modifier, they'll be shown as multiple glyphs which of course are much too wide. But the sequences are stuffed together into the cell at the processing layer, which doesn't (and shouldn't) have access to the font in the drawing layer. What should we do in this situation? Could either show a 'missing glyph' on draw, or disallow (ie not compose into one cell) ZWJ and modifier sequences.

Comment 4 Christian Persch 2017-07-29 15:37:32 UTC

*** Bug 785563 has been marked as a duplicate of this bug. ***

Comment 5 Christian Persch 2017-07-29 18:57:46 UTC

*** Bug 777624 has been marked as a duplicate of this bug. ***

Comment 6 Egmont Koblinger 2017-12-17 16:48:51 UTC

Not sure if I understand correctly, but... can a VS15/VS16 turn the preceding character from narrow to wide one?

If so then we need some special casing in the terminal emulation logic (which we'd need around spacing marks too, see bug 584160 comment 45), since there's no guarantee that VTE receives both codepoints in a single run. We need to handle the narrow one first, and maybe later make it wider, poteintially overflowing to the next line.

Can it happen the other way around, a Unicode suffix turning a character from wide to narrow? That would be a true nightmare to implement. (Oops, we shouldn't have scrolled the terminal, let's undo it...)

Comment 7 Christian Persch 2017-12-17 17:15:03 UTC

Yes. UAX11 says under "Recommendations":
"""
[UTS51] emoji presentation sequences behave as though they were East Asian Wide, regardless of their assigned East_Asian_Width property value.
"""

and UTS51 defines this as (emoji presentation selector is VS16):
"""
ED-9a. emoji presentation sequence — A variation sequence consisting of an emoji character followed by a emoji presentation selector.

emoji_presentation_sequence := emoji_character emoji_presentation_selector
"""

So yes, VS16 can turn the character wide, for example <1, VS16> makes the normally narrow "1" into a wide character. Luckily, the other direction does NOT happen; a wide character (which includes all characters which have emoji presentation) followed by VS15 still is wide.

Comment 8 Egmont Koblinger 2018-09-24 11:00:19 UTC

Doesn't VS16 break the principle that wcwidth() and wcswidth() returns the width taken up in terminal emulators?

wcswidth() might still get it right for the entire string, but then it needs to do more complex logic than just the sum of the wcwidth()s of its characters.

Or shall we just say that it's not our problem? :P

Comment 9 George Nachman 2018-10-19 03:24:29 UTC

FWIW Apple's Terminal does not treat U+2764 U+FE0F (heart with VS16) as wide. Has anyone found an application that expects this behavior? I'd like to do the Right Thing in iTerm2 but I have a feeling it would cause more problems than it would solve.

Comment 10 Egmont Koblinger 2018-10-19 07:31:44 UTC

Let's also mention that Unicode 11 adds a new note to tr11 section 2, marked with yellow at https://www.unicode.org/reports/tr11/tr11-34.html :

"Note: The East_Asian_Width property is not intended for use by modern terminal emulators without appropriate tailoring on a case-by-case basis. Such terminal emulators need a way to resolve the halfwidth/fullwidth dichotomy that is necessary for such environments, but the East_Asian_Width property does not provide an off-the-shelf solution for all situations. The growing repertoire of the Unicode Standard has long exceeded the bounds of East Asian legacy character encodings, and terminal emulations often need to be customized to support edge cases and for changes in typographical behavior over time."

I totally don't get their intent. If not for terminal emulators then for who the hell are the East_Asian_Width properties at all? Looks to me that they're just making everything less controlled, more flexible => more of a mess.

It's also unclear to me whether section 5's recommendation "emoji presentation sequences behave as though they were East Asian Wide" is subject to this new clause under section 2 or not. A really not careful quick glimpse suggests to me that it's probably not: section 2 allows terminals to override the definitions coming from the East_Asian_Width file when they are resolved to actual width values, whereas section 5's recommendation talks about the resolved values stating that they should be wide.

So, George, I'm not sure if it's relevant to your question :) probably only to the extent that the "Right Thing" you're looking for may not even exist.

I tend to agree that deviating from wcwidth() would cause more problems than it would solve. And it's not a realistic expectation for bash, zsh, ncurses, you name it to special case VS16, especially if the road to this behavior is via temporarily breakages. I'm really clueless what to do.

Comment 11 Egmont Koblinger 2018-10-19 07:53:53 UTC

Linking back to https://gitlab.com/gnachman/iterm2/issues/7239.

Comment 12 Christian Persch 2018-10-19 21:36:52 UTC

I've been thinking about this problem for a while, and while I don't have a completly though-out solution, I do have some ideas.

While diverging from wcwidth() has clearly disadvantages, wcwidth() itself has some problems:
* Unicode version difference: when libc (e.g. remote) and terminal use data from different versions, the result may differ.
* wcwidth() is not stable: it changes when a future unicode version decide make an existing character Emoji_Presentation
* Ambiguous width: when the terminal makes ambigous-width characters wide, the app will still think it's narrow (afaik glibc wontfixed ambigous=wide locale variants).

Since the above-cited addition to TR11 basically tells us to come up with a standard of our own, I think we should take that opportunity. 

A solution should provide most importantly be stable: character width should not change in future. (Except if there was an actual *bug* in the standard, and the error is grave.) To achieve this even when following unicode and emoji standards, characters that are likely or possibly being emojified in future should be wide, that is not only Emoji_Presentation, but also Extended_Pictographic and probably even the whole blocks (Dingbats, Miscellaneous Symbols and Pictographs, Emoticons, ...).

As to VS15/16, IMO both VS15 and VS16 should make an narrow Emoji character wide, regardless of whether the terminal will actually use text or emoji presentation for it in the end.

The application could get the actual width of a character/string either by a custom library we provide (shared, or copy-pasted), by calculating it itself after something like a glyph width request [https://uobikiemukot.github.io/yaft/glyph_width_report.html], or by a new DCS sequence to measure the width of the passed DATA string.

The use of this standard (instead of libc wcwidth()) could even be gated on a private DECSET mode.

Comment 13 George Nachman 2018-10-20 00:29:51 UTC

> A solution should provide most importantly be stable: character width should not change in future. (Except if there was an actual *bug* in the standard, and the error is grave.) 

Would we be in agreement that Unicode 9's change to make Emoji wide was a bug fix?

> The application could get the actual width of a character/string either by a custom library we provide (shared, or copy-pasted), by calculating it itself after something like a glyph width request [https://uobikiemukot.github.io/yaft/glyph_width_report.html], or by a new DCS sequence to measure the width of the passed DATA string.
> The use of this standard (instead of libc wcwidth()) could even be gated on a private DECSET mode.

A library would be nice since you need to have a TTY to do status reports. Sending a request for a status report is fraught with peril since not all terminals will respond, and then you're blocked reading until a timeout (or use some hack like sending two reports and seeing which one comes back first). But if you do a library, how will an application know whether the terminal it's talking to right now supports that width calculation? The terminal will need a way to tell the library what it can do. More on this later.

One downside of a library is that it will get out of date as quickly as wcwidth. You'll need private DECSET to communicate the library version. I dislike DECSET for this because when your ssh connection dies or your program crashes it's left in the wrong mode. Based on the issue reports I get, this is the source of a lot of confusion (pate bracketing, mouse reporting, and focus reporting being the main culprits). Consider what iTerm2 did with a stack for Unicode version as an alternative. It offers a recovery scenario for those who are interested. Search https://iterm2.com/documentation-escape-codes.html for "Unicode Version".

Finally, we now live in a nightmarish hellscape where the interface to wcwidth can no longer be a single code point, but some variable-length number of code points. At a minimum, two (DINGBAT + VC16); but I wouldn't be the least bit surprised to discover that there are longer sequences that can change the length of a character. So it's not a plug and play replacement for wcwidth. It means ncurses needs to change for this to have any hope of being useful. Terminal emulators will need to cope with existing characters changing widths when new input is received (which is annoying but much more tractable than fixing all existing software that uses wcwidth). All of these problems can be overcome.

In thinking about querying the terminal emulator, besides the aforementioned downsides, is the issue of which queries do you send it? I suppose just Emoji to begin with. It's impractical to query every grapheme cluster. But if all you want is a fix for Emoji, well, a library (plus a reliable way of communicating its capabilities) sure is simpler.

Having written all this, my thinking is this:

1. wcwidth is inadequate, and should be left for dead ASAP
2. querying the terminal emulator for every new grapheme cluster is impractical. Querying it only for Emoji only solves a very narrow slice of the problem. You don't know what's going to change next year, so you don't know what to query for. Querying doesn't solve an important problem unless you're willing to send a *lot* of queries.
3. A library fixes the Emoji problem but is only an adequate solution if it can discover from the terminal emulator what width algorithm it's using. Capability discovery is not a well solved problem.

In conclusion, if we had a good way for applications to discover the capabilities of the terminal emulator, it would give them the ability to know with some degree of confidence what the width of any given character is, and to partake in a negotiation with the terminal emulator (via private DECSET or the like) to come to a mutual agreement.

# Terminal Capabilities

There are a few ways that the terminal emulator can communicate its capabilities to an application, and they are all horribly flawed.

TERM - Most terminals lie and say they're xterm so it will work when you ssh to some ancient system.
LC_*, LANG - Owned by libc, unlikely to change semantics.
TERM_PROGRAM - Apple-only AFAIK.
The various DEVICE ATTRIBUTES control sequences — Asynchronous API with rough edges and ill-defined semantics. The least terrible option.

Synchronizing on glyph size is a problem emblematic of those we'd like to see fixed by communicating a tiny bit of information about the terminal emulator's capabilities to the running application. What we'd like to say is "I support Unicode 9" or "I support VTE's fancy wcwidth replacement, but if you don't then please fall back to Unicode 9". Another example of a capability that would be nice to advertise is support for 24-bit color, which as I understand has not had any success getting added to terminfo.

I would propose a new environment variable that gives a list of capabilities in some compact and extensible representation. Fixing wcwidth is important enough that it could help get this off the ground.

By default, OpenSSH passes all environment variables beginning with LC_ through. This is hacky, to be sure, but is the best communication channel that we have. Something like SECONDARY DEVICE ATTRIBUTE could be investigated as another option, but it has its own problems.

I would propose something like:

LC_TERMINAL_CAPABILITIES=[comma separated list of capabilities]

Capabilities would be short strings describing a single feature of a terminal emulator. For example:
U9 (unicode 9)
Vwc (VTE's fancy upcoming wcwidth)
24 (24-bit color support)

There should be a central authority (which I'm happy to help with in any capacity) to manage the namespace of these capabilities to ensure they are well documented and well behaved, with future and backward compatibility taken into consideration.

Apologies for the dissertation. I think Egmont's style has rubbed off on me ;-)

Comment 14 Egmont Koblinger 2018-10-20 11:21:44 UTC

Seems that in addition to addressing the particular pretty complex scenario around widths, this thread is now also starting to discuss more generic troubles around feature reporting (TERM and friends) and ways to fix/extend that.

I've been thinking about the latter for a while, I have a couple of ideas. Maybe it's the right time for me to finally write them down. I'll try to do it next week.

I'll also come back and respond to the particular width issuse, please hang on.

> Apologies for the dissertation. I think Egmont's style has rubbed off on me ;-)

I hate this style of mine, I just find it hard to concisely express myself in a language that I'm not that good in; especially ever since I stopped speaking English on a daily basis. (E.g. when the BiDi draft reached ~30 pages I started shortening it. After plenty of attempts to shorten it I managed to cut it down to 50 :-D)

Comment 15 Egmont Koblinger 2018-10-30 11:45:26 UTC

I created this page about my draft idea how terminal feature reporting could be fixed:

https://gist.github.com/egmontkob/9a718bc1c82eed354dd0ad8e2b53007a

Do you guys think it makes sense for us to move forward in that direction?

---

(In reply to Christian Persch from comment #12)

> While diverging from wcwidth() has clearly disadvantages, wcwidth() itself
> has some problems: [...]
> 
> Since the above-cited addition to TR11 basically tells us to come up with a
> standard of our own, I think we should take that opportunity.

Yup. Although I believe wcwidth has to remain our default, since that's the one used by the vast majority of applications. (And while it breaks over ssh across different wcwidth implementations, that's still IMHO a relatively rare case compared to running apps locally.)

> A solution should provide most importantly be stable: character width should
> not change in future.

I disagree here. Temporary breakages (like the one we saw with Unicode 9.0 turning the high voltage sign and many others into a wide one) suck, but they're mostly gone after a year or two. IMO the only think worse than a temporary breakage is a permanent breakage, if something's wrong for decades because we're afraid of breaking it for a year or two. If Unicode or wcwidth changes, we need to follow that.

> The application could get the actual width of a character/string either by a
> custom library we provide (shared, or copy-pasted)

I personally don't feel like working (even partially) on such a library, especially if it has to be cross-platform.

> by calculating it itself
> after something like a glyph width request
> [https://uobikiemukot.github.io/yaft/glyph_width_report.html],

With VS16 we might be entering a new era when the width cannot be computed on a per-character basis. Or at least apps would need to know exactly where to cut the string to get individual characters (e.g. that VS* sticks to the preceding codepoint and form a character together).

> or by a new
> DCS sequence to measure the width of the passed DATA string.

Asynchronous escape sequences are bad (as I outline in the aforementioned doc). Slow fragile, unreliable. They are a firm no-go from me.

We can't expect let's say a text editor to run all the pageful of characters through such asynchronous request after every PageDown keypress, and/or cache the result for each character appearing in the document.

If we provided a method to measure the width of the passed data, the caller would potentially need to call it multiple times (kinda like a binary search) to figure out where to logically crop the string so that the longest possible fragment ends up being displayed. This would mean that it would need for the roundtrip time not once but several times.

Instead, IMO iterm2's approach of switching between Unicode version (and reporting which ones are supported by the emulator) is the way to go.

> The use of this standard (instead of libc wcwidth()) could even be gated on
> a private DECSET mode.

+1 for gating.

---

(In reply to George Nachman from comment #13)

> One downside of a library is that it will get out of date as quickly as
> wcwidth. You'll need private DECSET to communicate the library version. I
> dislike DECSET for this because when your ssh connection dies or your
> program crashes it's left in the wrong mode. Based on the issue reports I
> get, this is the source of a lot of confusion (pate bracketing, mouse
> reporting, and focus reporting being the main culprits). Consider what
> iTerm2 did with a stack for Unicode version as an alternative. It offers a
> recovery scenario for those who are interested. Search
> https://iterm2.com/documentation-escape-codes.html for "Unicode Version".

I think DECSET is okay. A breaking ssh is probably a relatively rare scenario, and already suffers from other problems as you mentioned. A nice fix to them is to type "reset", or to stuff your local PS1 with all kinds of escape sequences that individually reset the relevant properties (which should be done by distributions anyway).

That being said, the idea of pushing/popping with identifies is also a nice one, sounds fine to me.

> 1. wcwidth is inadequate, and should be left for dead ASAP

I'm afraid it needs to remain the default for quite a while because that's what almost everyone uses. That being said, we should move forward by providing better alternatives.

> 2. querying the terminal emulator for every new grapheme cluster is
> impractical. [...]

Fully agree.

> In conclusion, if we had a good way for applications to discover the
> capabilities of the terminal emulator, it would give them the ability to
> know with some degree of confidence what the width of any given character
> is, and to partake in a negotiation with the terminal emulator (via private
> DECSET or the like) to come to a mutual agreement.

Yup, and this is where my new proposal comes into play :)

> By default, OpenSSH passes all environment variables beginning with LC_
> through.

No, OpenSSH's default config doesn't contain SendEnv / AcceptEnv lines. Some vendors add "LC_*" there, some add each known LC_THIS and LC_THAT spelled out individually.

> Capabilities would be short strings describing a single feature of a
> terminal emulator. For example:
> U9 (unicode 9)
> Vwc (VTE's fancy upcoming wcwidth)
> 24 (24-bit color support)

Nitpicking, but I'm not a fan of such squeezed names (or terminfo's ones). It doesn't hurt to spell out "Unicode9" or "Truecolor". Ssh could compress on the fly while you're wasting megabytes by listening to music and watching kittens online :-D

> There should be a central authority (which I'm happy to help with in any
> capacity) to manage the namespace of these capabilities to ensure they are
> well documented and well behaved, with future and backward compatibility
> taken into consideration.

That sounds great!

Comment 16 towo 2019-02-22 14:04:29 UTC

> > 1. wcwidth is inadequate, and should be left for dead ASAP

> I'm afraid it needs to remain the default for quite a while because that's what almost everyone uses.

Terminals support legacy applications and frameworks/libraries that largely rely on wcwidth. In order to provide emojis consistently, I prefer the backward approach (as implemented in mintty): cell width is always as wcwidth reports, emojis are adjusted to it, i.e. possibly squeezed or stretched or aligned in a wider cell range.
See https://gitlab.freedesktop.org/terminal-wg/specifications/issues/9#note_121976

Comment 17 GNOME Infrastructure Team 2021-06-10 15:14:24 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/vte/-/issues/2317.