Bug 769316 – vte loses tab characters at soft line breaks

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 769316 - vte loses tab characters at soft line breaks


Summary:	vte loses tab characters at soft line breaks


Status:	RESOLVED OBSOLETE

Product:	vte
Classification:	Core
Component:	general
Version:	0.34.x
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	VTE Maintainers
QA Contact:	VTE Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2016-07-30 00:52 UTC by Luke Hutchison
Modified:	2021-06-10 15:14 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Luke Hutchison 2016-07-30 00:52:39 UTC

If I print TSV (tab-separated values) data to the terminal, if a tab character (or multiple tab characters) occurs at the end of a "soft line break" (where the line hasn't ended, but the right edge of the console is reached), when I select and copy a block of text containing that line, the tab character(s) is/are dropped. Tab characters are significant in TSV, so this is a big problem when working with tab-delimited data in the console.

(Sorry I'm testing on such an old version (0.34.x), it is the only version that we can run on my work machine.)

Comment 1 Luke Hutchison 2016-07-30 00:53:14 UTC

PS I suspect the same is also true of other whitespace characters at the ends of lines, e.g. standard spaces?

Comment 2 Egmont Koblinger 2016-07-30 07:13:15 UTC

VTE is one of the few terminal emulators that try to remember tabs and allow to copy-paste them in certain typical (but not all) scenarios. Most emulators just copy spaces instead.

This (I mean the behavior of most other emulators) is because when it comes to terminal emulation, tab is a control character (similarly to escape sequences), not a printable one. You can't copy-paste other escape sequences either. A tab advances the cursor to the next tab stop, leaving the characters underneath unchanged.

The visible behavior of tab is already terribly weird at the right margin for legacy reasons, and unfortunately there's no way to fix it (make it wrap to the next line) without breaking backwards compatibility, and there's no sane way to rewrap the contents upon a horizontal resize.

Copy-paste _could_ be made not to lose them in the most typical scenario (the one that you describe) by remembering the number of tabs printed at the end of the line, by using tons of special one-off code just for the sake of tabs. But it would break as soon as you resize the terminal, I have no idea how it could work nicely together with rewrapping.

Tabs are according to my knowledge one of the two issues where printing some stuff and then rewrapping causes different result than printing with the new width at the first place. (The other one is bce: bug 754596.)

Binary data can't be copy-pasted safely (e.g. some systems use ^A (0x01) delimited fields, those _should_ be swallowed by VTE when printed). Consider tabs are binary stuff.

I'm afraid tabs are just not meant to be handled in terminals the way you try to use them. As much as I'd love to fix them, I have no idea how to. :(

In my personal opinion (see also the neverending flamewar of indenting with spaces vs tabs) tab characters should never be stored in text files (just as you don't store backspaces for example). It is the only ASCII code that means to denote semantics rather than plain look of the file, and there's no need for this (and only this) semantics concept on the level of plain text files. It introduces an ambiguity leading to these flamewars which ambiguity wouldn't exist otherwise and everyone would just be happy with spaces, it wouldn't even occur to anyone to introduce them. (There could be a physical tab key on the keyboard sending the 0x09 ascii code, similarly to the backspace key, but all text editing software, including the kernel's tty driver in cooked mode, would be responsible for converting to spaces.) Okay, sorry, I probably should have omitted this paragraph, I definitely don't want a flamewar here :)

Comment 3 Luke Hutchison 2016-07-30 12:02:48 UTC

Hi Egmont, thanks for the detailed explanation.

I know you have tons of weird legacy stuff that VTE needs to support. Although to my knowledge, there are very few commonly-used terminal modes on modern Linux systems, and there are some very common usecases on modern Linux systems that legacy stuff makes it hard to support.

> In my personal opinion (see also the neverending flamewar of indenting with
> spaces vs tabs) tab characters should never be stored in text files (just as
> you don't store backspaces for example).

This used to be the case, but is no longer the case. Tab is now a very important delimiter character. In fact, TSV data is pretty much the de facto format for flat, 2D, text-encoded tables in data science today. The reason is that they are significantly easier to parse and to work with than CSV data, because you don't have to escape delimiters or quote fields most of the time, as string data values rarely need to include tab characters. This means that TSV data can be used easily with tools like GNU Textutils (cut, etc.). Spaces are no good for this, because string fields often need to include spaces.

Additionally, printing a tab character always results in at least one visible space, so it works as a visual delimiter even in the console (which other control characters do not), and on average you get several characters of visual separation, and some rough visual alignment with columns. This makes it a good delimiter for quickly eyeballing datasets on the console.

> It introduces an ambiguity leading to these flamewars which ambiguity
> wouldn't exist otherwise

This is true of using tab characters for indentation of program code, but it is not true of the use case of using tab as a semantically-important delimiter in datafiles. There is no ambiguity to using it as a delimiter: it always denotes the end of one field and the start of the next. Two tabs in a row indicate an empty field between two other fields, etc.

> there's no sane way to rewrap the contents upon a horizontal resize

I have noticed that VTE now does rewrapping of text, which is an awesome feature. I assumed that under the hood, there were two representations of what is being displayed: the character grid of the current terminal view, and the raw characters that produced that character grid -- then on rewrapping, the raw characters were re-interpreted to perform the rewrapping. But I think you're saying that there is only one version of what's currently in the console, and each time the window size changes, rewrapping happens in-place based on the currently-displayed characters? But then how are actual line breaks (newline characters) distinguished from wrapped lines? You must still store where actual line breaks occur.

Comment 4 Egmont Koblinger 2016-07-30 12:31:10 UTC

(In reply to Luke Hutchison from comment #3)

> I have noticed that VTE now does rewrapping of text, which is an awesome
> feature. I assumed that under the hood, there were two representations of
> what is being displayed: the character grid of the current terminal view,
> and the raw characters that produced that character grid -- then on
> rewrapping, the raw characters were re-interpreted to perform the
> rewrapping.

Not quite. Control characters and escape sequences are not remembered as-is in the flow, they are interpreted (e.g. they move the cursor, change color, etc.) as soon as they are encountered and then they are forgotten. It's just the resulting text (along with explicit newlines, color information for each cell etc.) that is remembered.

Remembering and replaying the entire stream that was received by vte so far would be both extremely expensive and lead to incorrect result in many cases.

Take a look at how printing a tab near the end of the line works. It sucks big time. It never overflows to the next line, the cursor just gets stuck at the right margin.

As said above, in terminals tab is a control character. So in this case it is not remembered, it just performs its task (which is a no-op if the cursor is already in the rightmost color) and then it's forgotten.

So, a utility might print tons of tab characters there, but only those that actually moved the cursor will be remembered.

Comment 5 Luke Hutchison 2016-07-30 21:05:04 UTC

> Take a look at how printing a tab near the end of the line works. It sucks big
> time. It never overflows to the next line, the cursor just gets stuck at the
> right margin.

Yes, this is exactly the problem.

> As said above, in terminals tab is a control character.

So is a newline character, by every definition!

> It's just the resulting text (along with explicit newlines, color information
> for each cell etc.) that is remembered.

Can't you just remember explicit tabs and spaces too, applying very similar logic to explicit newlines? I don't see how this is not technically feasible. This would solve the problem with rewrapping, and also make copying text from the terminal lossless.

In modern usage, you can consider only two characters with values below 33 worth preserving: space and tab. Tab is no more a control character than space, from a textfile content semantics point of view.

No characters other than tab or space with a value below 33 are likely to ever be encountered in any reasonable modern textfile. (There was a time when README files also contained form feed / character 12 at the end of each section, but that day is long past.) Tab usage as a non-control character in data sciences will only increase.

The interpretation of tab as a control character, as you describe, with the weird end-of-line semantics etc., is a weird quirk of the world of terminal control character interpretation, which very few people these days have any understanding of (or any desire to understand), and does not reflect current real-world usage or expectations. There must be a non-disruptive way to flag tabs and spaces in the VTE data structure, just like explicit newlines.

Comment 6 Egmont Koblinger 2016-07-30 21:23:53 UTC

(In reply to Luke Hutchison from comment #5)

> > Take a look at how printing a tab near the end of the line works. It sucks big
> > time. It never overflows to the next line, the cursor just gets stuck at the
> > right margin.
> 
> Yes, this is exactly the problem.

But I'm afraid we can't change this behavior, I'm sure it would break a lot of legacy apps. The only thing we can think about is to somehow remember these tabs for copy-paste purposes even though they do not appear on the UI.

> > As said above, in terminals tab is a control character.
> 
> So is a newline character, by every definition!

However, having a newline is a must in order to be able to store a 2 dimensional grid of characters (that is, a plain text document) in a 1 dimensional storage (i.e. a file). Tab is still a character whose existance is IMO heavily questionable. Anyway, let's put this aside...

> Can't you just remember explicit tabs and spaces too, applying very similar
> logic to explicit newlines? I don't see how this is not technically
> feasible. This would solve the problem with rewrapping, and also make
> copying text from the terminal lossless.

Currently VTE's data structures are built around what's on the display, remembers this, and converts to a continuous text stream when a given row scrolls out.

As a result, we can only remember data that is tied to a character cell (plus newlines). E.g. if I recall correctly, we only remember a limited amount of zero-width combining accents, and after a while we start dropping them. Probably we immediately drop them if they appear at the beginning of the line. I think we also drop all the zero-width non-printable control characters such as BiDi marks. That is, apart from tabs that are quite a few other characters that we don't remember exactly and are lost on copy-paste.

We _might_ introduce the concept of data that should be copy-pasted but does not belong to a particular cell. But this would require:

- heavy modifications throughout the entire vte codebase,

- think about when and how this data is wiped out (e.g. when the previous or the next cell is overwritten?),

- DoS considerations (e.g. something that is visually just a few cells can be arbitrarily large data when copy-pasted),

- figure out how it plays together with rewrap-on-resize,

etc... I'm afraid it's a much more complex change than the one you hoped for.

Comment 7 Egmont Koblinger 2016-07-30 21:35:13 UTC

Just for the record, quoting from xterm's manual page:

CONTROL SEQUENCES AND KEYBOARD
       Applications can send sequences of characters to the terminal to change
       its behavior.  Often they are referred to as “ANSI escape sequences” or
       just plain “escape sequences” but both terms are misleading:

       [...]

       ·   Some  of  the  sequences (in particular, the single-character func‐
           tions such as tab and backspace) do not include the escape  charac‐
           ter.

       With  all  of  that  in mind, the standard refers to these sequences of
       characters as “control sequences”.

Comment 8 Luke Hutchison 2016-08-09 10:38:07 UTC

A counter-point: from the man page for "cut" in the GNU CoreUtils package:

  `-d input_delim_byte'
  `--delimiter=input_delim_byte'
      For `-f', fields are separated in the input by the first character in
      input_delim_byte (default is TAB).

So GNU CoreUtils treats TAB as a semantic record separator char (in fact, as the most important semantic record separator char), and not as a control char, since it is the default field separator.

Comment 9 Egmont Koblinger 2016-08-09 10:59:20 UTC

A "character" does not necessarily mean "printable character", or could even be just poor wording in the manual, or poor wording by me in this or any previous comment. Also, "cut" has nothing to with terminals, it's a great tool to process text files that are structured around this control char, or control sequence, whatever you call it, without being printed to a terminal.

Look, I fully understand that this situation sucks, but as much as I'd love to, I'm afraid we cannot change the behavior of TABs in terminals after so many decades. I don't know how many things it would break. Terminal emulator is a world where we heavily carry (and suffer from) the legacy cr@p of multiple decades and unfortunately cannot restart from scratch.

There are 3 possibilities I can see:

- Leave everything as-is.

- Remember tabs that do not advance the cursor at all, and as such, cannot be tied to a particular character cell (rather reside between two cells). Requires coming up with a reasonable specifications about which neighbor cell's modification would wipe out these tabs, preferably coordinated among the few most popular terminal emulators and implemented consistently. And then implement it, which is a really heavy work in case of vte, since it is structured around the visible cells and for each of them the data contained within. All this for a change in copy-paste behavior, but no otherwise visible on-screen behavior. I see no chance for this to happen.

- This one occurred to me a couple of days ago: Come up with an escape sequence which modifies the handling of tabs. Each tab would then have a visual effect, move the cursor by at least 1 cell, wrapping to the next line if necessary. This would require way less code modification, although there are still some nontrivial issues to address (what about tab stops; after wrapping is it the logical or the physical column that matters, can a tab be split at line boundary etc). And it wouldn't be an out-of-the-box fix, you'd have to print that escape sequence to achieve this new behavior, hence it would remain a niche feature used by very few people.

Comment 10 Luke Hutchison 2016-08-09 11:12:38 UTC

> Look, I fully understand that this situation sucks, but as much as I'd love
> to, I'm afraid we cannot change the behavior of TABs in terminals after so
> many decades. I don't know how many things it would break.

Not to flog a dead horse, but this particular use case has already been broken all those decades :-)  It just wasn't as important to fix it before TSV data became much more common than CSV data, which has been really just the last 5-10 years in my observation.

Copying and pasting tabs works when the tab is not at the end of the line, so the only problematic case is that there is some number of tab characters at the end of the line that may get swallowed.

Therefore I suggest a much simpler solution: just add an attribute for each line that says how many tab characters were "swallowed" (not causing the cursor to advance) at the end of the line, before either a printable character that wrapped to the next line, or a newline character that caused a line break. Then when copying the text, insert that number of tabs back into the copied text before the next character on the following line, or before the newline character.

You're already inserting tab characters in the right place for tab characters that do cause the cursor to display, so I don't think this is unreasonable behavior.

I can't think of a way this could possibly break anything. (Certainly it's not any more potentially invasive than the hoops VTE already jumps through to re-wrap text on window size changes.)

Comment 11 Luke Hutchison 2016-08-09 11:13:39 UTC

Typo: "that do cause the cursor to display" -> "that do cause the cursor to advance".

Comment 12 Egmont Koblinger 2016-08-09 11:21:30 UTC

(In reply to Luke Hutchison from comment #10)

> Therefore I suggest a much simpler solution: just add an attribute for each

Such end-of-line swallowed tab characters might become middle-of-line swallowed characters upon a rewrap-on-resize. Hence we have to be able to remember them at any position.

The onscreen bits are scrolled in memory according to the grid (for each cell we know the character, the color, plus some per-row information is available too). The scrolled out bits are stored in 3 file descriptors, one for continuous UTF-8 text, one for attributes, and one for indexes per row. For each format we'd need to figure out how to handle these invisible tabs, and properly convert back-n-forth.

I'm not saying it's not doable. I'm saying it's way too much work for relatively low benefits. However, patches are welcome :)

Comment 13 Egmont Koblinger 2016-08-09 11:22:08 UTC

The onscreen bits are *stored* in memory...

Comment 14 Luke Hutchison 2016-08-09 11:28:52 UTC

> Such end-of-line swallowed tab characters might become middle-of-line swallowed
> characters upon a rewrap-on-resize. Hence we have to be able to remember them
> at any position.

Good point, but they shouldn't be swallowed if they're displayed in the middle of the line, they should be displayed and cause the cursor to advance. Yes, writing reliable code to convert back and forth from this format will probably suck :-/

> However, patches are welcome

Oh, you probably don't want patches from me. The first patch I would submit would strip out the decades' worth of cruft, and start with a clean slate that would treat the console contents as a normal UTF8-encoded text buffer, and would make absolute cursor positioning and character overwriting a second-class citizen at the mercy of "normal" formatting and wrapping rules, rather than the other way around...

Comment 15 Christian Persch′ 2016-08-20 19:28:02 UTC

I just wanted to add that wanting tabs being preserved to be able to copypaste from the terminal also ignores that we in no way guarantee that the rest of the text you cat to the terminal comes back out as-is. I.e. vte is well within its right to do e.g. unicode normalisation (we don't currently, but we could).

Comment 16 Egmont Koblinger 2016-08-20 21:29:00 UTC

I recall this being discussed in a ticket, perhaps it was Behdad pointing out that we shouldn't do this especially because of copy-paste. (I cannot find that bug right now.) But yeah we could.

That being said, I think OP's request is a valid one. He mentions scientific data as a use case, and there I guess most data is ASCII only, or uses a very limited set of Unicode (e.g. Latin characters) – including tabs. There you don't have any fancy Unicode to worry about, only those damn tabs.

Comment 17 Egmont Koblinger 2018-02-11 07:46:53 UTC

The same bug for iTerm2: https://gitlab.com/gnachman/iterm2/issues/6488.

Comment 18 GNOME Infrastructure Team 2021-06-10 15:14:58 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/vte/-/issues/2323.