After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 149631 - gnome-terminal doesn't combine combining chars in utf8
gnome-terminal doesn't combine combining chars in utf8
Status: RESOLVED FIXED
Product: vte
Classification: Core
Component: general
unspecified
Other Linux
: Normal normal
: ---
Assigned To: VTE Maintainers
VTE Maintainers
: 162262 302171 340745 476691 495818 512772 541720 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2004-08-08 09:43 UTC by Hans de Goede
Modified: 2009-02-26 14:18 UTC
See Also:
GNOME target: ---
GNOME version: 2.5/2.6


Attachments
Vectors above (40.10 KB, image/png)
2008-03-22 20:33 UTC, Anatolii Sakhnik
Details

Description Hans de Goede 2004-08-08 09:43:43 UTC
-export LANG=en_US.UTF8
-start gnome-terminal
-select MiscFixed as font
-less: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt

After the first ascii box, in the line containing Stargate, the vector symbols
above the a and b (and others) are behind the a and b instead of above them.
This is strange since other gtk apps do render these correctly.
Comment 1 Hans de Goede 2004-08-08 10:42:05 UTC
That should be LANG=en_US.UTF-8
Comment 2 Allison Karlitskaya (desrt) 2004-08-09 08:29:21 UTC
I see the vector symbols behind the a and b (instead of above them) when viewing
the file in epiphany.
Comment 3 Hans de Goede 2004-08-09 14:16:22 UTC
Yes gecko doesn't handle combining chars all that well. But that doesn't mean it
is ok for gnome-terminal todo the same, check out the file in gedit, or in plain
old xterm.
Comment 4 Behdad Esfahbod 2006-04-11 05:45:35 UTC
Not sure if the terminal should do that, but leaving open.  Xterm does it IIRC.
Comment 5 Behdad Esfahbod 2006-04-28 02:28:45 UTC
*** Bug 302171 has been marked as a duplicate of this bug. ***
Comment 6 Behdad Esfahbod 2006-05-05 15:19:54 UTC
*** Bug 340745 has been marked as a duplicate of this bug. ***
Comment 7 Behdad Esfahbod 2006-05-13 16:46:21 UTC
*** Bug 162262 has been marked as a duplicate of this bug. ***
Comment 8 Denis Jacquerye 2006-05-13 17:17:55 UTC
Apparently is is suggested to use OpenType features to position and zero the width or non-zero width combining diacritics. Anchors could also be used but I have no idea what feature tag would be used for changing the width. 
Should vte use OpenType positioning when available in the font?
Should vte only use monowidth fonts? 
Comment 9 Behdad Esfahbod 2006-05-13 18:59:05 UTC
Vte cannot support OpenType unless it uses Pango.  In that case Pango does take care of everything, given that the font has a glyph for the combining mark.  But the Pango backend for Vte is too slow to enable as default.

But before that, one needs to add support to Vte for zero-width combining marks as to not take a cell, and that's not easy AFAIU, as it currently assumes one char per cell, except for some cells that are continuation cells, enabling double-width chars.
Comment 10 Egmont Koblinger 2006-09-22 14:38:00 UTC
As far as I see, the correct solution would require huge code rework, which is unlikely to happen in the near future. So let's take a look at the possible workarounds.

What vte currently does: it prints the zero-width combining accents on their own, occupying one more character cell. Advantage: accents at least somehow appear on the screen, so it's readable if it's cat'ed to the terminal. Huge disadvantage: the whole screen might get damaged if for example a full screen ncurses application expects that the cursor won't move when printing a zero-width character.

What could be very easily done: simply ignore these characters. Advantage: full screen applications stay in control. Disadvantage: NFD accents are simply ignored, so important information is lost, the user is being mislead.

IMHO it would be quite important to merge the advantages of both of these solutions. Many text editors (including my favourite one, joe) assume that when printing a printable character, the cursor advances by exactly as many columns as the width of that symbol is according to libc's wcwidth() call. As far as I know, this is a right assumption, and this is exactly what wcwidth() is for. When a zero-width combining character is printed by an application, the cursor advances only in vte, not in typical other terminal emulators. So when editing for example .desktop files with joe, vte's screen often gets garbled, while xterm, pterm(putty) or konsole are fine.

Here's my short-term proposal mostly solving both problems, until the real solution (more chars per cell) is implemented. I guess it shouldn't be too hard to do it.

First of all, vte should make sure to move the cursor by exactly wcwidth(received_character) cells (except of course when wcwidth returns -1 which is an undefined case, but moving to the left would be weird). This way full screen applications can stay in control, vte really prints the characters where the applications expect it to print them.

When a combining character is received, vte should check what the previous character was, and combine them together (convert NFD to NFC) and replace the previous one with this. For example, if an "a" is printed, and then a combining accent acute is received, that "a" would be replaced by "á" ("a with acute accent", U+00E1). Perhaps care should be taken not to do this if there were some special ansi escapes (color change, cursor movement) between the letter and the accent, in this case the accent could be silently dropped.

Advantages: Correct solution for one of the problems, and a not-so-bad solution for the other. Probably much more easier to implement than the real solution.

Disadvantages: Only supports accented letters that have a precomposed form, you cannot combine any character and any accent. If you copy-paste the text with the mouse, it undergoes NFD->NFC conversion.

What do you think about this proposal?
Comment 11 Behdad Esfahbod 2006-09-22 19:09:17 UTC
What you propose is not significantly easier to implement than the correct behavior.

FWIW vte already supports multi-width chars, just not zero-width.
Comment 12 Samuel Thibault 2006-09-23 17:04:55 UTC
And it doesn't help very much: when precombined forms are available, people usually use them instead of combinining chars...
Comment 13 Matthias Urlichs 2006-09-24 16:40:33 UTC
People on Linux, maybe. But you certainly can't depend on that.

Apple's OS X, for instance, always decomposes file names.
Comment 14 André Klapper 2007-01-16 14:38:05 UTC
(obviously confirming.)
Comment 15 Seán de Búrca 2007-08-25 05:05:49 UTC
As this bug is still open, I'll add an ancillary problem. Other bugs have been closed as duplicates though with subtle differences. When moving through characters, a combining sequence will throw off the cursor position, leading to editing the wrong part of a string or characters left over after deleting a line.

The apparent cause of this problem is that vte appears to recognize combining sequences and to treat them as a single character. Unfortunately, because the glyphs are not combined this causes the visual cursor to be incorrectly placed. Further, it has been pointed out (the location escapes me, sorry) that this treatment of the characters as a single character for backspacing is incorrect.

When removing characters, combining characters should be removed one by one. In addition to this, until such time as proper shaping is implemented, the visible cursor should either move past all displayed glyphs in a combining sequence at once or the "actual" cursor should move through the characters one by one as well. The former seems to be more correct but it may startle users to have their cursor jumping around the line. Regardless, this problem is quite annoying and should be fixed quickly.
Comment 16 Egmont Koblinger 2007-08-27 11:29:59 UTC
Sean: I guess you refer to the case when you use a full-screen text editor such as vim, joe... The displaying problems occur because the cursor is not where the application expects it to be: vte's cursor jumps but the application expects it to stay when a combining character is printed. As I suggested in comment #10, it might be easy to change vte to fully drop these characters. In this case some worthful information might be lost, but at least the cursor would always stand where it had to. Note that one or two months ago I've sent fixes to the Linux kernel's console terminal driver and I had to face the same issue. Even though my personal perference would be to omit these combining characters, people voted for separately displaying them (what vte does now) so that you can see that there's something, but the cursor is not aligned correctly and you may have troubles when using text editors.

Removing characters is an interesting issue. First of all, we have to distinguish between 'raw' and 'cooked' terminal modes. Raw mode is when your application handles all keypresses immediately, this is the case when a text editor (vim, joe...) or a readline-aware application (bash, bc...) is running. Cooked mode is when the terminal handles line editing locally and only sends a full line to the application when you hit Enter, this is the case if you execute the command "cat" for example, or any command that doesn't do special terminal handling. Second, we have to clearly distinguish between input backspace (you hit backspace on your keyboard) from output backspace (an application prints a backspace). Basically it's output backspace that we care about now.

It's been discussed and clearly agreed on the linux-i18n mailing list that the purpose of output backspace is to move the cursor one single cell to the left (and not erasing anything), independently of what character is there. So for example if you want to erase a double-width character that precedes the cursor then you should print two (2) backspaces, two spaces, and two backspaces again. In raw mode, it's really the application's (or a library's such as ncurses) job to find out what to do when you receive an input backspace. If you want to strip off the combined accent from a character, I guess you have to reprint that certain character without the accent, for example.

I don't know what the desired behavior would be in cooked mode. Currently the terminal emulators I've tried now (vte, konsole, pterm (putty), xterm, linux console) all move the cursor one cell to the left; in case of CJK or combining accents I don't think this is the desired behavior, but at least vte is not worse than other emulators in this case. There might be some technical issues as well. AFAIK it's not really the terminal emulator application, rather the general tty code in the kernel that holds the buffer that's being edited. The kernel already has a very simple algorithm to remove one whole utf-8 character if this behavior is requested by "stty iutf8", so the logical contents of the buffer is okay (you get correct line sent to the app if you type blindly). However, in order to reflect it on the terminal's display, the terminal emulator would need to know what character is being removed, and to do so it had to either duplicate the kernel's job in maintaining the buffer, or had to query the character that's being removed -- neither of these is straightforward and easy. So what you see when you're editing the line is faulty if not all the characters are single-width, and I'm afraid this is simply a "wontfix" due to design limitations.
Comment 17 Seán de Búrca 2007-08-27 12:48:58 UTC
A few things: first, I am NOT referring to being in a text editor. I refer to handling strings at a bash command prompt. Second, removing characters (even visually) isn't acceptable at all. This would certainly cause information loss and undoubtedly serve only to confuse the user. Finally, I'm not sure you're correct about the way that characters vs. glyphs are handled. Certainly double-width characters don't exhibit any problems for me. It's only with a combining sequence that I see problems and these problems do NOT exist in xterm. vte needs to be made aware of how exactly glyphs and characters match up and furthermore it needs to remove characters (including combining characters) one at a time.
Comment 18 Denis Jacquerye 2007-08-27 13:06:30 UTC
Where is it pointed out that treating base and combining characters as one unit when backspacing is wrong? I disagree, in Latin it make sense. I can't speak for other scripts.

It makes sense to have equivalent strings behave similarly. If one backspaces on a NFC string, units are removed, it's good to have the same behaviour on precomposed strings. If you choose to removing combining characters one at a time, it would make sense to be able to do the same with precomposed characters, i.e. precomponsed characters would be decomposed when backspacing.
Comment 19 Egmont Koblinger 2007-08-27 13:09:43 UTC
The bash prompt is logically the same as a text editor, however, there might be bugs in bash itself as well (but 3.2 is mostly okay).

So we're back at the fact that combining characters are not handled correctly by vte, as opposed to xterm for example. Yes, this is what this bugreport is about :-) Vte does not _print_ these combining accents correctly. As I tried to point out, how to _remove_ these is a quite different issue, it's usually the applications' responsibility, not vte's.

Oh, and I really might have been sloppy with the terms character vs. glyph, sorry for that. I actually do not know what these terms exactly mean with respect to combining accents. I also do not know what hitting the backspace after entering a combining accent is supposed to do -- remove the whole character or only remove the accent from it -- but as I've pointed out, it's out of vte's scope, this one is an application (e.g. bash) issue.
Comment 20 Seán de Búrca 2007-08-27 13:38:03 UTC
Denis: as I said, I don't remember the location of the discussion but it was pointed out to me that especially in languages such as Hindi, backspacing through an entire combining sequence is problematic. In Hindi, combining sequences can become quite long and it makes it very annoying to have to retype a (relatively) large number of characters with every backspace. The backspacing behavior I mentioned is also the behavior one would find in any gtk+ application. (I'm not sure exactly where this occurs.) A good experiment is editing sequences from Unicode's NormalizationTest.txt (http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt) in both gedit and a vte terminal.

Egmont: I can assure you that this is not a problem with bash. As I said, the behavior in xterm is fine and I'm using bash there as well. Further, the broken behavior is present in both gnome-terminal and xfce's terminal, both of which use vte. As such, I'm pretty much certain that this IS within vte's scope. Furthermore, bugs addressing basically what I'm talking about have been closed as duplicates of this one, so that leads me to believe that this is the correct bug on which to file my comments. If not, someone let me know and I'll open a new bug.

Character and glyph have very specific meanings in text handling. A character is the data that represents some symbol, a glyph the visual representation of the character. Thus the byte 0x61 is a UTF-8 character, the glyph of which you will see as 'a'. When a base character is followed by combining characters, these will often create a single glyph which represents multiple characters. With vte's current behavior, multiple characters are represented by multiple glyphs but backspacing through such a sequence removes multiple characters and a single glyph. It SHOULD remove a single character and (for now) a single glyph. If and when combining characters are more fully supported, it should remove a single character and, if that character is a combining character, the appropriate part of the corresponding glyph.
Comment 21 Denis Jacquerye 2007-08-27 14:01:21 UTC
(In reply to comment #20)
> The backspacing
> behavior I mentioned is also the behavior one would find in any gtk+
> application. (I'm not sure exactly where this occurs.) A good experiment is
> editing sequences from Unicode's NormalizationTest.txt
> (http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt) in both gedit and
> a vte terminal.
> 

This behaviour doesn’t apply to all scripts.
In regular gtk apps:
य़ (NFC) and य़ (NFD) will have the same backspace behaviour, what you are suggesting, removing one character at a time
é (NFC) and é (NFD) will have the same backspace behaviour, but the opposite of what you are suggesting, removing the whole unit and not one character at a time.

vte has those behaviours depending on the scripts. The bug is that it doesn’t display or remove them from screen properly.
Comment 22 Denis Jacquerye 2007-08-27 14:06:42 UTC
> vte has those behaviours depending on the scripts. The bug is that it doesn’t
> display or remove them from screen properly.
> 
Sorry, scratch that. vte only has the second behaviour. It doesn’t change according to the script. I got confused by the fact that vte still had य on screen after a backspace, yet there’s no character actually left.
Comment 23 Matthias Urlichs 2007-08-27 16:37:30 UTC
First things first: output NFD sequences correctly, please!

What happens when editing NFD sequences in line mode is a separate issue which, frankly, I'd be perfectly happy for VTE not to handle correctly for another year or so. Getting printing to work is more important that quibbling about backspaces through these things.

Anyway: IMHO the point of line mode is that you have a cursor which occupies one character cell, moves one character cell at a time (OK, OK -- two if there's a double-width character there), and thus deletes or overwrites the whole cell if the user presses the appropriate keys. You get bonus points for replacing an e with an é if you see a combining ´ right after moving the cursor on top of the former – but *that's*it*.

Anything else and/or more involved should be handled by a raw-mode editor. I don't know about the average user, but personally I type things that are complicated enough to require more complexity than that with a "real" editor, using a readable font with variable-width spacing.
Comment 24 Anatolii Sakhnik 2008-03-22 13:54:59 UTC
Is anyone fighting against this bug now? Is it going to be resolved at all?
Comment 25 Anatolii Sakhnik 2008-03-22 20:33:16 UTC
Created attachment 107833 [details]
Vectors above

By the way, both vector arrows in the stargate sentence are drawn correctly with the 'Monospace' font. Does that mean that some work has been done recently? Can other combining characters be easily treated the same way?
Comment 26 André Klapper 2008-08-21 21:03:29 UTC
(In reply to comment #24)
> Is anyone fighting against this bug now? Is it going to be resolved at all?

Probably not, otherwise it would be in assigned state. Feel free to provide a patch against current svn trunk...
Comment 27 Behdad Esfahbod 2008-08-21 21:09:22 UTC
Nothing has been done recently, no.  And not very likely to happen anytime soon (by that I mean the next year).  Fixing this requires extensive changes to the internals of vte.
Comment 28 Christian Persch 2008-08-23 12:32:50 UTC
*** Bug 541720 has been marked as a duplicate of this bug. ***
Comment 29 Behdad Esfahbod 2008-11-29 08:14:22 UTC
*** Bug 512772 has been marked as a duplicate of this bug. ***
Comment 30 Behdad Esfahbod 2008-11-29 08:26:07 UTC
*** Bug 476691 has been marked as a duplicate of this bug. ***
Comment 31 Behdad Esfahbod 2008-11-29 08:29:04 UTC
*** Bug 495818 has been marked as a duplicate of this bug. ***
Comment 32 Behdad Esfahbod 2008-12-12 07:00:24 UTC
Actually I found a very smart way to implement this.  Six hours of hacking and it's now done.  Not the stupid way that xterm and konsole do it (by normalizing the input).  The real thing.  Try copy/paste on it too.  Check it out!

2008-12-12  Behdad Esfahbod  <behdad@gnome.org>

        Bug 149631 – gnome-terminal doesn't combine combining chars in utf8

        * src/vteunistr.c:
        * src/vteunistr.h:
        An extended UTF-32 type that assigns numeric values to UTF-8 sequences
        on demand.  Can be used to efficiently store a string, instead of one
        character, at each cell.

        * src/vte-private.h:
        * src/vte.c:
        * src/vtedraw.c:
        * src/vtedraw.h:
        * src/vtepangocairo.c:
        Port to vteunistr instead of gunichar for cell content.  
        
        * src/vte.c: 
        Handle zerowidth insertions by sticking it on the previous cell.

        * src/iso2022.h:
        * src/iso2022.c:
        Cleanup ambiguous-width handling.  Handle zero-width chars.

Comment 33 Egmont Koblinger 2008-12-12 12:54:23 UTC
Thanks very much for addressing this issue!

The solution is pretty interesting... On one hand, it's a very practical and clever idea. On the other hand, allocating memory dynamically and never freeing it worries me a bit. I think the current code is pretty much DoS-able by a malicious application. E.g. accidentally cating /dev/urandom, or curl-ing a malicious homepage containing endless number of composing characters will cause vte to run out of memory, or probably misbehave in certain strange ways if you have tons of memory and unistr_next overflows.

One quick workaround could be to introduce a reasonably low cap on the value of unistr_next (such as 0x81000000 - this allows for 16M combining sequences), and if it reaches that cap then simply make the append functions return their base string, that is, drop the composing character instead of consuming even more memory. Still, it's a bit weird that after accidentally cating /dev/urandom suddenly composing characters may stop working correctly, but probably better than an OOM or crash.

Another solution might be to clean up the data structure every once in a while (e.g. when unistr_next reaches the cap) to get rid of old entries. For example, a new structure could be built up from all those combining characters that are still present on the canvas, and then swapped into the old's place.

It seems to me that currently you allow unlimited number of composing characters for a cell. As far as I remember, the Unicode standard says that applications should support up to 2. Enforcing such a limit could protect against memory problems, because a vte canvas of a certain size could consume only up to a fixed amount of memory, instead of growing infinitely.

Thanks!
Comment 34 Behdad Esfahbod 2008-12-12 14:41:11 UTC
(In reply to comment #33)
> Thanks very much for addressing this issue!

Hey.  Thanks for the quick review.

> The solution is pretty interesting... On one hand, it's a very practical and
> clever idea.

Yeah, in hindsight it's so obvious.  I don't know how I didn't think about it before.

> On the other hand, allocating memory dynamically and never freeing
> it worries me a bit. I think the current code is pretty much DoS-able by a
> malicious application. E.g. accidentally cating /dev/urandom, or curl-ing a
> malicious homepage containing endless number of composing characters will cause
> vte to run out of memory, or probably misbehave in certain strange ways if you
> have tons of memory and unistr_next overflows.

Agreed.  Though I'd say by outputing a control sequence in a loop, not cating /dev/urandom.  The chances of /dev/urandom doing that is smaller than your aliens coming and taking your computer away.

I thought about it a bit and decided to wait a bit before making the code ugly by fixing such issues :).


> One quick workaround could be to introduce a reasonably low cap on the value of
> unistr_next (such as 0x81000000 - this allows for 16M combining sequences), and
> if it reaches that cap then simply make the append functions return their base
> string, that is, drop the composing character instead of consuming even more
> memory. Still, it's a bit weird that after accidentally cating /dev/urandom
> suddenly composing characters may stop working correctly, but probably better
> than an OOM or crash.

While it's quite reasonable, I don't really like this.

> Another solution might be to clean up the data structure every once in a while
> (e.g. when unistr_next reaches the cap) to get rid of old entries. For example,
> a new structure could be built up from all those combining characters that are
> still present on the canvas, and then swapped into the old's place.

This one would be much harsher to the code.  Unless we do ref counting for them, which will be trickier.


> It seems to me that currently you allow unlimited number of composing
> characters for a cell. As far as I remember, the Unicode standard says that
> applications should support up to 2. Enforcing such a limit could protect
> against memory problems, because a vte canvas of a certain size could consume
> only up to a fixed amount of memory, instead of growing infinitely.

I don't remember any such limitations cited in Unicode.  I'm sure there are languages using more than 2 routinely.  Thai?

However, I agree that more than, say, four is not useful.  So we can limit the depth, and that automatically limits the max, though, not to a usable limit.  But at least it protects against attacks as simple as:

  echo -n x; while :; do echo -n -e '\xcc\x80'; done

The only theoreticaly ugly part about that is that I would need to do strlen in append and that's an asymptotic change from O(1) to O(n) :).

How does this sound?

I'll think about it more.

> Thanks!

Comment 35 Behdad Esfahbod 2008-12-12 14:45:28 UTC
Actually limiting to even two doesn't help.  There's more than 1000 combining marks encoded in Unicode, and more than 100,000 valid, assigned, codepoints (let alone that we do not reject unassigned codepoints).  That's 100,000,000,000 sequences one can get right there. 
Comment 36 Samuel Thibault 2008-12-12 17:15:03 UTC
Just tried current HEAD

€ hexdump blip
0000000  61 cc 82 61 0a
€ cat blip
âa

The text above shows up correctly in gedit & co (circumflux shows up
on the first a), but on gnome-terminal with the current HEAD vte, the
circumflex shows up on the second a...

As a side note, seeing the memory of my xterms be possibly unbound even
when the backlog is limited quite frightens me.
Comment 37 Samuel Thibault 2008-12-12 17:20:33 UTC
Ergl, sorry, the problem is with the Monospace font, things are fine
with the code200 font.  I'll file a new bug for that.
Comment 38 Denis Jacquerye 2008-12-12 17:52:24 UTC
(In reply to comment #37)
> Ergl, sorry, the problem is with the Monospace font, things are fine
> with the code200 font.  I'll file a new bug for that.
> 

Which Monospace font? If you don't mind me asking here.
Comment 39 Behdad Esfahbod 2008-12-12 17:54:52 UTC
It's a font bug.  Try the same text in gedit using the same font and see.  Your example works for me with DejaVu Sans Mono.
Comment 40 Egmont Koblinger 2008-12-12 18:03:50 UTC
Serious comments:

I can't remember where I read about this "at most two" combining chars. Probably in "man unicode", which says "Up to two combining characters per base character for certain scripts (in particular Thai) are also supported by some UTF-8 terminal emulators and ISO 10646 fonts (level 2)", meaning that I remembered incorrectly.

Limiting to 4 isn't that bad either, and I wouldn't worry about O(n) as long as n==4 :) (Yeah, you had the same smiley, I know :)) Moreover, you could remember this value and just increase by one as long as "normal" bytes are coming, you'd only need to recalculate it if some escape sequence moves the cursor. (By the way, I have no clue what should happen if a cursor movement (especially if to the bottom left corner) is followed by a combining char.)

To your next comment: I meant limiting the number of combines per cell in combination with the cap on unistr_next - I think only the combo of these two could guarantee that you cannon't run out of memory and up to n combining characters can be put on all the letters.

Not so serious comments:

I don't think it's that hopeless with urandom. You say there are >1000 combining chars, as far as I remember most of them take only two bytes in Unicode, which means the probability of the next to bytes encoding a combining char is above 1%, so once for every 100 bytes you create a combining sequence. Even with the loop you have to output a base char followed by the combining symbol, so there's a factor of 20-30 maybe in the bytes required in the two cases. Unless I calculated something wrong.

Samuel, I like your prompt :)
Comment 41 Egmont Koblinger 2008-12-12 18:05:21 UTC
sorry, I meant s/bottom left/top left/
Comment 42 Behdad Esfahbod 2008-12-12 19:26:37 UTC
(In reply to comment #40)
> Serious comments:
> 
> I can't remember where I read about this "at most two" combining chars.
> Probably in "man unicode", which says "Up to two combining characters per base
> character for certain scripts (in particular Thai) are also supported by some
> UTF-8 terminal emulators and ISO 10646 fonts (level 2)", meaning that I
> remembered incorrectly.

Fair enough.


> Limiting to 4 isn't that bad either, and I wouldn't worry about O(n) as long as
> n==4 :) (Yeah, you had the same smiley, I know :))

Indeed.

> Moreover, you could remember
> this value and just increase by one as long as "normal" bytes are coming, you'd
> only need to recalculate it if some escape sequence moves the cursor.

Nah, that's in a totally different piece of code.  The beauty of the current code is that the entire unistr implementation is in a 120 line function, and has only three simple functions as its interface.  And I only changed the rest of the code in two places: one in vte_terminal_insert_char() to combine, another in vtepangocairo.c:font_info_get_unistr_info() to convert the unistr to utf8 and pass to pango.  That's the beauty of it...

> (By the
> way, I have no clue what should happen if a cursor movement (especially if to
> the top left corner) is followed by a combining char.)

Here is how the current algorithm works:

                row_num = screen->cursor_current.row;
                row = NULL;
                if (col == 0) {
                        /* We are at first column.  See if the previous line softwrapped.
                         * If it did, move there.  Otherwise skip inserting. */

                        if (row_num > 0) {
                                row_num--;
                                row = _vte_terminal_find_row_data (terminal, row_num);

                                if (!row->soft_wrapped)
                                        row = NULL;
                                else
                                        col = row->cells->len;
                        }
                } else {
                        row = _vte_terminal_find_row_data (terminal, row_num);
                }

                if (G_UNLIKELY (!row || !col))
                        goto not_inserted;

                /* Combine it on the previous cell */


                col--;
                cell = _vte_row_data_find_charcell(row, col);

                if (G_UNLIKELY (!cell))
                        goto not_inserted;

                /* Find the previous cell */
                while (cell->attr.fragment && col > 0) {
                        cell = _vte_row_data_find_charcell(row, --col);
                }
                if (G_UNLIKELY (!cell || cell->c == '\t'))
                        goto not_inserted;

                /* Combine the new character on top of the cell string */
                c = _vte_unistr_append_unichar (cell->c, c);

                /* And set it */
                columns = cell->attr.columns;
                for (i = 0; i < columns; i++) {
                        cell = _vte_row_data_find_charcell(row, col++);
                        cell->c = c;
                }

                /* Always invalidate since we put the mark on the *previous* cell
                 * and the higher level code doesn't know this. */
                _vte_invalidate_cells(terminal,
                                      col - columns,
                                      columns,
                                      row_num, 1);

                goto done;


So, if you put it at topleft and there is no scrollback, it drops it.  If there is scrollback and the previous line is softwrapped, it puts it on the last cell of that line.


> To your next comment: I meant limiting the number of combines per cell in
> combination with the cap on unistr_next - I think only the combo of these two
> could guarantee that you cannon't run out of memory and up to n combining
> characters can be put on all the letters.

Agreed.  I'll still think about it.


> Not so serious comments:
> 
> I don't think it's that hopeless with urandom. You say there are >1000
> combining chars, as far as I remember most of them take only two bytes in
> Unicode, which means the probability of the next to bytes encoding a combining
> char is above 1%, so once for every 100 bytes you create a combining sequence.
> Even with the loop you have to output a base char followed by the combining
> symbol, so there's a factor of 20-30 maybe in the bytes required in the two
> cases. Unless I calculated something wrong.

Each unistr takes about 32 bytes to store (8 bytes in the array, 12 in the hash table, times 1.6 to account for empty room the structures).

The thing is, we need to see many *different* sequences of base character followed by combining sequences, as unistr's are internized.  Now, I was wrong to think that the probability to see new sequences diminishes fast.  As the total number of valid sequences is so large, the probability remains constant.  
Also, only 277 of them are two-bytes long.  Lets say 256.  So, the probability of any two byte forming a combining mark is 2^-8 (65536 / 256).  However, there's just so many different sequences you see with an ASCII base.  128*256 different ones.  That's 1MB of memory only.

To see more unique sequences you need a 2-byte base.  Probability of any two byte to be a two-byte base is 2^-5.  So, probability that a particular four bytes is a valid character followed by a mark is 2^-13.  There are 2^11*2^8=2^19 such different sequences.   Even if you see all of them, it takes only 16MB of memory.  Even those are not worth spending our time on.  Lets move on to three-byte base characters.

Probability of any three bytes forming a valid three-byte character is 2^-8.  So, the probability of seeing a three-byte base followed by two-byte mark is 2^-16.  And there is 2^16*2^8=2^24 of them.  Seeing ALL of them consumes 2GB of ram.  Lets call using 32MB or ram for these sequences a disaster.  For that, we need to see 2^20 different sequences.  Lets even assume that the first 2^20 that we see are all different.  So we need to see 2^20 sequences of three-two.  Probability of seeing one was 2^-16.  So we need to see 2^36 bytes.  That is, 64 GB of random data.

So, my estimate is, by cating 64GB of /dev/urandom data (if you have so much entropy!), we'd at most consume 64MB.  And 64MB is faaaaaaar from disaster.  I can continue calculating for 640MB :P.
Comment 43 Samuel Thibault 2008-12-12 19:33:01 UTC
Note that for instance the bogl terminal emulator supports combining
characters by just using an array of 5 unicode chars for each cell.

`The Unicode Standard does not restrict the number of combining characters
that may follow a base character.' (Unicode 4, 2.10)
Comment 44 Egmont Koblinger 2008-12-16 22:35:24 UTC
Behdad, thanks very much for your comment, it explained a lot. Also, your estimation seems to be way more precise than my one was :)

(One minor note: /dev/urandom doesn't need entropy (even though it's not random in that case), that's /dev/random, but still /dev/urandom is damn slow.)

Another issue that just came to my mind: I was worried about your change because I thought it was the first place where vte can be DoS-ed. I'm no longer sure about it. I just tried to change the window title to the contents of my /etc/services, and it worked (could read back the whole file with xwininfo). So I have a feeling that probably too long escape sequences can also cause infinitely growing memory. (Or the limit is somewhere higher up than the length of this file.)

Samuel: using 5 cells per character increases memory usage pretty much, even if you don't use combining characters at all. A terminal that I'd call typical (100 columns, 10000 line scrollback buffer) would consume 16MB more, and of course this should be multiplied by the number of terminals you have. Behdad's idea is great for saving lots of memory in average, it only performs bad in worst case, which worst case might probably get DoS'ed under certain circumstances.
Comment 45 Behdad Esfahbod 2008-12-16 22:41:24 UTC
I'm not really concerned about the DoS case.  One should not run an untrusted application anyway!  But accidentally OOM'ing the terminal is not good.  That's why I like the limit-to-four idea.  Or even limit-to-10.  Having *some* limit.

As for OOM, it's interesting, people actually ask us to add a unlimited-scrollback mode to vte.  I have no idea why they insist that they need it.

Anyway, speaking of memory use, I'm going to make vte store UTF-8 for rows consisting of ASCII-only data.  And not store the attributes at all if the row doesn't have any special attributes.  That brings down memory-per-cell from 8 bytes to 1.  Now that would be an improvement.
Comment 46 Behdad Esfahbod 2008-12-17 08:26:25 UTC
2008-12-17  Behdad Esfahbod  <behdad@gnome.org>

        Bug 149631 – gnome-terminal doesn't combine combining chars in utf8

        * src/vteunistr.c:
        Limit to 9 combining marks per unistr, and a total of 100,000 unistrs.
        All to prevent OOM.

Comment 47 Stanislav Brabec 2009-02-26 14:18:51 UTC
I just tested the rendering of combining characters.

It seems to work for Indic scripts, but for Latin Combining Diacritical Marks it seems to work incorrectly: Instead of putting accent above previous character, it puts character above next letter.

As it seems to work correctly, when Lucida Sans Typewriter font is selected, I guess that it is fault of fonts: http://bugs.freedesktop.org/show_bug.cgi?id=20330

The same problem affects gedit and all other widgets using monospaced font. Widgets using variable space fonts works correctly (most variable space fonts are not affected).

How to reproduce:

Paste following text with selected font:
o̍g
Correct: accent above o
Incorrect: accent above g