After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 787228 - better support for designate-other-coding-system (DOCS)
better support for designate-other-coding-system (DOCS)
Status: RESOLVED OBSOLETE
Product: vte
Classification: Core
Component: general
unspecified
Other All
: Normal enhancement
: ---
Assigned To: VTE Maintainers
VTE Maintainers
Depends on: vteparser
Blocks:
 
 
Reported: 2017-09-03 23:04 UTC by Mike Frysinger
Modified: 2021-06-10 15:24 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Mike Frysinger 2017-09-03 23:04:59 UTC
unless i'm misreading the vte source, it currently recognizes (but then ignores) these sequences:
- \e%@ - switch to ECMA-35/ISO-2022 mode
- \e%G - switch to UTF-8 mode

ideally it would use these as a signal to control things like GL/GR maps.  specifically, if in UTF-8 mode, GL/GR maps would be ignored completely.  this would prevent accidental switching to graphics maps by LS1/LS0/SCS/etc... sequences (the common "my terminal is corrupted" complaint).

even better would be to support the one-way transition sequences:
- \e%/G - one-way switch to UTF-8 level 1
- \e%/H - one-way switch to UTF-8 level 2
- \e%/I - one-way switch to UTF-8 level 3
i believe VTE already (or at least intends to) support UTF-8 level 3, so making them aliases for each other should be fine.  basically these would be like \e%G, except \e%@ would be ignored after the fact.

even better would be to make the initial mode a configuration option, and by default, VTE starts off in UTF-8 mode.  or maybe does so by detecting the active locale (if it's a Unicode env, default to UTF-8).

mosh mentions this in their tech page:
  https://mosh.org/#techinfo
  Only Mosh will never get stuck in hieroglyphs when a nasty program writes to the terminal.

it also refers to Markus Kuhn's page:
  http://www.cl.cam.ac.uk/~mgk25/unicode.html#term
Comment 1 Egmont Koblinger 2017-09-04 00:04:59 UTC
(In reply to Mike Frysinger from comment #0)

> unless i'm misreading the vte source, it currently recognizes (but then
> ignores) these sequences:
> - \e%@ - switch to ECMA-35/ISO-2022 mode
> - \e%G - switch to UTF-8 mode

That's right. Support for runtime encoding switching via escape sequences was removed recently as part of a cleanup/buxfixing spree, see bug 731208 & bug 732586 (and a tiny followup bug 777747). The encoding (i.e. charset) can only be changed via gnome-terminal's profile prefs and menu entries, that is, via VTE's API. (Plus there's the internal state about those few maps, box drawing mode etc.)

> even better would be to support the one-way transition sequences:

I never understood the philosophy of these one-way transitions. I mean, there must be a way to revert what happened. Obviously opening a new terminal tab and closing the current one does that. How about the Reset menu entry, should that revert? I guess so, I think that's what the user expects. How about the "reset" command? I don't think it should, otherwise it wouldn't be called one-way transition if an escape sequence could revert it, would it? And then would it be fine for users if these two did something different? I don't think so. Oops... What was the case with hardware terminals? Did they revert on a physical power off only? (I guess you didn't have to throw them out and buy a new one :-))

Anyhow, further input/research is needed for me on this as well as the LS1 etc. stuff you mentioned.

Our current code is quite simple'n'stupid here (we put quite some effort into killing bugs by simplifying), the modification you suggest would make it more complicated.

What would be the practical benefits of implementing these changes? Who, when and why would emit those one-way switch escape sequences? Wouldn't it cause even more unexpected behavior for the user, by sometimes getting stuck in a different state? – quite the opposite of the goal you're trying to achieve right now with a solution to the "stuck in hyeroglyphs" problem, if I understand you correctly.

> mosh mentions this in their tech page:
>   https://mosh.org/#techinfo
>   Only Mosh will never get stuck in hieroglyphs when a nasty program writes
> to the terminal.
> 
> it also refers to Markus Kuhn's page:
>   http://www.cl.cam.ac.uk/~mgk25/unicode.html#term

I guess this is what PuTTY does too, and it breaks Midnight Commander's frames (if compiled against ncurses), and perhaps plenty of other apps.

Markus Kuhn is absolutely right that theoretically UTF-8 mode should be stateless, whereas he also mentions that many apps use escape sequences to switch to box-drawing mode. Just think of e.g. someone's in-house legacy app from decades ago, full in English, not supporting any non-ASCII at all except for box drawing, not caring about locales or charsets at all. We probably cannot afford to suddenly break this app. That's why none of the emulators I'm aware of except PuTTY (and maybe Mosh, I'm not familiar with that) follows his recommendation.

It's not really okay to require users to run a middle layer (e.g. luit) (well maybe it's okay if it's really just a very few apps), nor to work around mc's (and other apps') box drawing bugs by manually exporting NCURSES_NO_UTF8_ACS=1. I'd be glad though if you could convince ncurses's maintainer to make it the default, it'd take us noticeably closer to Markus's goal.

As for getting stuck in hieroglyphs, I believe distros should begin setting up their default prompt so that it performs some soft reset, e.g. restores the default charset, default colors and attributes etc.
Comment 2 Mike Frysinger 2017-09-05 17:53:03 UTC
(In reply to Egmont Koblinger from comment #1)

i would lean towards the UI->Reset not resetting things, but that'd be up for GNOME.  for the terminal sequences (e.g. `reset`), then no, it should not be possible to transition back to ISO-2022 mode.

these sequences don't show up randomly.  users send them because that's what they want -- they're trying to set their runtime env so that random binary data doesn't screw up their encoding/charmaps.  like putting them into their shell init scripts.  it's hard to argue that the terminal should be interpreting these in other ways or second guessing explicit user signals.

wrt implementation, when i implemented this in hterm, it was fairly straight forward.  the % escape sequences controlled two bools ("utf8-mode" and "utf8-locked"), and the first is consulted before deciding whether to filter output through the NRCS maps.  i understand the trade-offs between less code and bug rates and test coverage, so let's not rehash those generalities.  although if you dropped support for charsets entirely, then in your own words, there would be fewer bugs :).

yes, if you tried to use an app that tries to use the translation maps after issuing a UTF-8 transition, they would not render correctly.  that would be the user's choice though (see above).  i disagree it's as bad as "stuck in graphics mode", but deliberating degrees of badness is probably a pointless exercise :).

i would highlight that some terminal emulators (like mosh) are already in this situation, and the sky hasn't fallen.  there might be some apps (like mc built against ncurses), but i think we should look at this long term: lets find the few apps and get them updated.  personally i've been using mosh and these modes on other terminal emulators and have yet to see a problem in the programs i use.  so i don't think "plenty of other apps" is an accurate assessment (i grok that my experience is anecdotal, but i believe the number of people using mosh is not insignificant).  mosh does export NCURSES_NO_UTF8_ACS when starting up the session (thanks for that pointer btw).

that's why my suggestion is to continue to default to ISO-2022: no existing apps will break.  the only way they would is if someone switched their terminal to UTF-8 mode, but then the behavior is what the user requested.  by respecting these transition escape sequences, we can make progress by letting people put their terminal into this mode and find things that misbehave.  we can't do that today because there's no way to even notice when the terminal is using these legacy charsets.

i'm not suggesting luit be something we expect people to use on a daily basis.  but (hopefully) at some point in the future, we'll be confident that the packages included in distros no longer utilize these legacy modes so we're only left with one-off legacy apps.  it's impossible to say "no one will ever need these again", but i think if it's only legacy/one-offs, asking people to run through luit is reasonable.

i'll investigate & push on the ncurses guys.  i agree that's something we shouldn't need anymore at least when the locale is UTF8 based.
Comment 3 Egmont Koblinger 2017-09-05 20:59:23 UTC
(In reply to Mike Frysinger from comment #2)

Disclaimer: I'm not the main guy on VTE, the main guy might override what I'm saying :)

> these sequences don't show up randomly.  users send them because that's what
> they want

How many users would you guess are out there who'd explicitly emit this sequence (let's say from .bashrc)? I'd guess it's negligible.

I'm a kind of guy who generally tries to focus on usability, user experience and the big picture rather than fully obeying the specs in all cases. I understand that from the spec's point of view, what you say is what we should be doing. Let's take one step back.


What's the current problem we're trying to address? As far as I understand, it's these two:

- User experience: The terminal can remain in "hieroglyphs mode". It's bad. I can see two main reasons for this to happen: an unclean exit (crash) of an app, or cat'ing a binary file. Both happen quite rarely. The Internet is full of pages providing a workaround for this, users can get used to blindly typing "reset" or invoking the menu entry, or even place the undoing escape sequence in their shell prompt.

- Technical correctness: It's truly ugly technically and against the stateless philosophy of UTF-8 that box drawing mode exists at all. From a technical point of view, it would really be nice to get rid of it. However, in the terminal emulation world, this would break backwards compatibility with US ASCII. UTF-8 was designed to be backwards compatible with US ASCII; however, terminal emulators used to do US ASCII (or Latin-X) + box drawing, so the only way to remain backwards compatible is to support UTF-8 + box drawing which is no longer stateless UTF-8.


What would we have with your suggestion, the one-way switch to "correct" stateless UTF-8 without box drawing mode?

- User experience: Users get to choose between two slightly broken behaviors: occasional hieroglyphs mode, or broken box characters in some apps.

- User experience: Those who choose the first (current and new default) would still occasionally get switched to the other mode. I can see two possible reasons: cat'ing a binary file, or running a smart ass app that for whatever reason believes it's a good idea to emit that escape sequence (I'm pretty sure such apps exist). It would probably occur less frequently than the current hieroglyps mode, and would have a much less impact on usability, but still, it could happen. I'm afraid it would be a problem that's pretty hard to google and track down for users. Plus, there would be no way to undo that by typing a command (or automated by placing something in PS1). I'm not convinced this is any better than what we have now.

- Probably only a negligible fraction of our users would figure out the existence and switch to this new mode.

- Technically: Backwards compatibility broken in the new mode.


If anything at all, I'm personally more open to introducing a new setting under Profile Preferences -> Compatibility for these two modes, rather than the one-way escape sequence. That would at least have the advantage that the mode cannot "accidentally" be switched.

But then still, the second mode would probably only be used by very few people (who care enough about the hieroglyps problem that they switch, yet don't mind the apps that break), so I'm still not absolutely sure that it's worth it. Maybe just an env variable to control VTE's behavior would be enough??


With the escape sequence approach, the default would obviously need to stay the current behavior (as you also said). With the config option instead, it'd still need to stay the default (for a couple of years at least), as we cannot break compatibility and mc's (and other apps') look.

What I cannot see is the big picture, the transition plan. Hardly any folks will switch to the new mode. So how will apps that misbehave be caught? I guess we'd need a transition plan with buy-in from key players (at the very least xterm/ncurses's maintainer plus one or two key distributions). I think some major distro should switch to the new mode for a development cycle to get plenty of users use the new mode and let bug reports flow in, and unless all such bugs can be fixed quickly, the behavior should be reverted to the old one shortly before a stable release. Then this repeated through a couple of development cycles if necessary. Also, preferably this change should be made in all the popular terminal emulators in parellel; after putting so much work in VTE I really wouldn't want to see people switching to another one just because they blame VTE for the broken look of apps during these experiments.

By the way, my biggest open source hobby project is VTE/gnome-terminal, and my second biggest is Midnight Commander. I hope you now understand why I'm strongly against breaking mc's look in VTE :-)

I wouldn't want VTE to export NCURSES_NO_UTF8_ACS as a workaround, first it's ugly design as hell, second it wouldn't work across ssh, su and friends.

> wrt implementation, when i implemented this in hterm, it was fairly straight
> forward.

Implementing indeed sounds fairly straightforward, it's not what I'm worried about. It's a few lines of extra code, it's special casing not to handle a certain escape sequence in UTF-8 mode only plus if another escape sequence has been seen.

Even if the transition succeeds and no remaining app uses box drawing mode in UTF-8 anymore, we still couldn't really further simplify the code. In order to do so, and have a big final code cleanup in place that removes box drawing mode, we'd need to get rid of all charsets except UTF-8. I don't think it's a feasible step in the foreseeable future. Maybe in (wild guess) 10 years or so??

Summary:

What do you think of a config option, or env variable controlling VTE's behavior rather than the escape sequence?

What's the big transition plan? Are there going to be a few dedicated people who test all the packages of a distro, or would we rely on the crowd? How to do the latter without making them switch away from VTE to an emulator where the problematic app is not "broken"?
Comment 4 Egmont Koblinger 2017-09-05 21:57:34 UTC
By the way...

Plenty of things suck big time in terminal emulation.

It expects that apps play nicely and clean up after themselves. If they don't, bad things happen. Outputting random stuff (e.g. cat'ing a binary file) can similarly cause tons of weirdness.

Mouse mode can remain enabled, causing unexpected highlighting experience. Bracketed paste mode can remain active, causing unexpected paste behavior. Colors, attributes can be nondefault, even unreadable. The alternate screen might remain the visible one. Keyboard might remain in application mode, generating different escape sequences. Cursor might remain hidden. Autowrapping might remain turned off. And so on and so forth for pretty much all the DEC(RE)SET escape sequences.

The output might not terminate with a newline, typically causing line editing issues after the non-left-aligned prompt.

Printed escape sequences (part of a binary file) might generate response escape sequences that appear as input (later at your shell prompt) as if you've typed them, completely fooling you.

An escape sequence might remain unterminated in the output, causing you to believe that the terminal froze. See bug 779518 and the ones linked from there.

Terminal line settings (stty) might remain at nondefault values, causing what you type at the shell not to appear, or something similar.

Apps started in the background might produce output that garble your screen, cause your fullscreen app to fall apart, or even split an escape sequence in half, breaking things in totally uncontrollable ways. Other utilities (e.g. write(1)) can also mess with your screen.

Plus sure a whole lot more that didn't occur to me right now...

And sure, you might get stuck in "hieroglyphs mode" too.

It's a world that sucks big time in this regard. If apps don't play nicely, things can easily go haywire. IMO addressing the hieroglyps issue is not going to make this significantly suck less.
Comment 5 Egmont Koblinger 2017-09-15 12:00:30 UTC
Please see also bug 787701 comment 5.

A recent change in terminfo caused ncurses to drift away further from this goal. Until now, an app compiled against ncursesw running with NCURSES_NO_UTF8_ACS=1 used to emit raw UTF-8 line drawing (btw the name of the env var is totally misleading).

From now on, it only emits raw UTF-8 for the line drawing chars that are not repeated. Repeated ones were reverted to be emitted in the old way that you want to get rid of.
Comment 6 Christian Persch 2018-02-25 22:50:12 UTC
(In reply to Egmont Koblinger from comment #3)
> Disclaimer: I'm not the main guy on VTE, the main guy might override what
> I'm saying :)

I think you have it right; I agree with you :-)

> What do you think of a config option, or env variable controlling VTE's
> behavior rather than the escape sequence?

I've been thinking about this, and I think what we could have is API to choose whether we obey the ECMA-35 designation sequences, or not. Something like this:

--8<--
typedef enum {
  VTE_CODING_SYSTEM_NATIVE /* 'ECMA-35' */,
  VTE_CODING_SYSTEM_UTF8,
  VTE_CODING_SYSTEM_UTF8_LOCKED
} VteCodingSystem;

vte_terminal_set_default_coding_system(VteTerminal*,VteCodingSystem);
-->8--

We'd then have gnome-terminal use that new API, either via a pref-only, or with UI (perhaps on the Compatibility page of the profile prefs).

NATIVE mode would be what we have now (which isn't really 'ECMA-35', but rather some weird thing where we recognise the charset designation sequences, but what really is more like DECPCTERM in that we also have an underlying input encoding setting).

UTF8 would be the UTF-8 mode with standard return (ESC % @), and UTF8_LOCKED would be the proposed UTF-8 mode without standard return. 

Escape sequences ESC % / [GHI] would switch from NATIVE to UTF8_LOCKED, and then we'd recognise *no* charset designation sequences anymore. ESC % G would switch from NATIVE to UTF8 with the possibility of using the ESC % @ return to NATIVE mode.

I also agree with Egmont in that reset (both hard and soft) should restore the API-set default (that is, you're not stuck in UTF8 or UTF8_LOCKED).

A further question I have is what should happen in these UTF8{,_LOCKED} modes with the input encoding (vte_terminal_set_encoding()). Should switching to UTF8{,_LOCKED} keep using that possibly non-UTF-8 encoding, or should it also switch the input to UTF-8?

Egmont, does that proposal sound reasonable to you? (It'll be very easy and not much code to implement in the new parser world.)

(In reply to Egmont Koblinger from comment #5)
> A recent change in terminfo caused ncurses to drift away further from this
> goal. Until now, an app compiled against ncursesw running with
> NCURSES_NO_UTF8_ACS=1 used to emit raw UTF-8 line drawing (btw the name of
> the env var is totally misleading).
> 
> From now on, it only emits raw UTF-8 for the line drawing chars that are not
> repeated. Repeated ones were reverted to be emitted in the old way that you
> want to get rid of.

I guess that's because of the xterm limitation that REP only works for characters < 256 there.
Comment 7 Egmont Koblinger 2018-08-17 11:35:11 UTC
Sorry for the long delay.

How much is this still relevant after the parser rewrite, and the deprecation of set_encoding()?

I'm usually in favor of _simplifying_ things. Dropping support for runtime encoding change (via escape sequences) was in this direction, and at some point in the future dropping support of 8-bit charsets (via API) will also be in this direction.

This bug here is primarly about getting rid of the special legacy "box drawing" mode within UTF-8, that is, making it real stateless UTF-8, as per Markus's recommendation and PuTTY's behavior.

Alas, for compatibility reasons, we can't just simply do it without breaking some apps. PuTTY users often run into the problem of seeing "qqq" (and a batman symbol) instead of box drawing chars, it would be bad if this happened in VTE too.

So the only thing we could do is to _add_ more complexity to our current state. Instead of having a "stateful UTF-8 + box drawing", we'd have yet another state that chooses between "stateful UTF-8 + box drawing" versus "real stateless UTF-8". Do I understand the essence of your proposal correctly?

I'm not explicitly against this approach, so if you strongly feel like doing it then go for it, but I'm not really supportive of this idea either. IMHO the best we can do is just leave it as it is.

If I saw that adding this complexity now is temporary only and helps transitioning to the much simpler fully stateless version, I'd be in favor of it. As much as I wish it was the case, unfortunately I can't see how we could get rid of the statefulness in the foreseeable future.
Comment 8 GNOME Infrastructure Team 2021-06-10 15:24:34 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/vte/-/issues/2427.