After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 729204 - (vtesixel) sixel support
(vtesixel)
sixel support
Status: RESOLVED OBSOLETE
Product: vte
Classification: Core
Component: general
unspecified
Other All
: Normal enhancement
: ---
Assigned To: VTE Maintainers
VTE Maintainers
Depends on: vteparser
Blocks:
 
 
Reported: 2014-04-29 16:22 UTC by Christian Persch
Modified: 2021-06-10 14:50 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Initial implementation of SIXEL graphics feature (51.74 KB, patch)
2016-08-28 15:59 UTC, Hayaki Saito
none Details | Review

Description Christian Persch 2014-04-29 16:22:00 UTC
https://en.wikipedia.org/wiki/Sixel

xterm (since version 294) and various other terminal emulators support this, and e.g. gnuplot can use it. 

Could look at supporting ReGIS [https://en.wikipedia.org/wiki/ReGIS], too.
Comment 1 Egmont Koblinger 2014-06-23 17:29:59 UTC
This might be relevant too (iTerm2+Julia): https://twitter.com/gnachman/status/476878560582721536
Comment 2 Egmont Koblinger 2016-03-27 15:26:54 UTC
FYI: https://github.com/saitoha/libsixel
Comment 3 Hayaki Saito 2016-08-28 15:59:04 UTC
Created attachment 334314 [details] [review]
Initial implementation of SIXEL graphics feature

I've written a patch for libvte SIXEL graphics support.
Try attached patch if you like it.
https://youtu.be/mLQbAYJGZMA
Comment 4 Hayaki Saito 2016-10-10 17:11:51 UTC
Patch v2
https://github.com/saitoha/vte-sixel/commit/ef967c34c2efb9a51a385d96356658bf5ff697c3

Feel free to pick commits from my github branch as you need.
Comment 5 Christian Persch 2016-11-11 21:28:56 UTC
I've had a first look and while there are a number of issues (which I'll comment on later on), the most grave I've found is how this patch uses temp files. These need to be replaced by code that goes through vte's encryption layer, so that no unencrypted data is written to temp files.
Comment 6 Egmont Koblinger 2016-11-12 09:29:47 UTC
Hi guys,

Sorry, it fell throught the cracks when you posted the patch.

I'm really curious about this, while at the mean time I'm sorry I won't have time to do a proper review.

Also, I hardly know anything about this sixel stuff (took a vague look at wiki), so I'm wondering, could you please give like two or three examples that I could try out? Maybe one with raw escape sequences, one with an app producing something useful?

I think it would be also great to have a design doc that explains in big steps how it's implemented, e.g. what goes into the existing streams marking that there's an image to be embedded etc.

chpe's point about encryption is an excellent one; I'd like to extend it to make sure that maliciously crafted output cannot
- fill up /tmp without limits (e.g. "maximum scrollback lines" should also apply here somehow);
- create an uncontrolled number of files under /tmp (that is, preferably write all the images into a single file per vte);
- incraese the number of opened files at a moment by more than let's say 1 or 2 compared to no sixel support.

Another thing I noticed is that you create your own 256-color palette. Wouldn't it be better to use vte's palette? Or is this desired / written in the specs etc. that you need to diverge?

Also, this isn't your fault at all, but this feature feels quite old-fashioned for me, with the limitation like 6 pixels per vertical pixel column of a cell (did I understand it right?) and using palette, as well as requiring arbitrary and complex escape sequences. I'm wondering why we're going in this direction at all, rather than something like terminology's (and IIRC iterm2's) image support that can display any color at any pixel within a dedicated rectangular area, and uses a protocol that speaks some trivial encoding (maybe base64?) of well-known image formats (png, jpg...); and of course trying to get buy-in from whichever notable apps take advantage of sixel right now.

Thanks a lot!
Comment 7 Bastien Nocera 2016-11-13 16:12:00 UTC
Sixel support seems interesting because it's an existing protocol, not something supported by few terminals that likely won't find actual buy-in in those very old protocols.

There's a fair number of applications it seems:
https://github.com/saitoha/libsixel

GNUPlot is also pretty interesting a use case for this support.

Personally, I'd like want to use it with QEmu support. I could also see emulators for older systems (say, pre-1988) using it as it could match their graphics capabilities.

Though I'd say that I've lived without it until now, so I could probably carry on without it. It just opens new doors.
Comment 8 Bastien Nocera 2016-11-13 16:17:23 UTC
Looking at the code, the only comments that I could possibly make are about the coding style not matching the existing code (comments, function arguments, spaces before parens, etc.) so not useful in itself.

If there are bits of the code that's cut'n'paste from another library, I'd try to reduce the changes compared to the upstream libraries, and add those in separate commits before adding them to the build system to document where they're from, and how to update them.
Comment 9 Jürgen Mangler 2017-04-28 16:12:20 UTC
any progress?
Comment 10 Hayaki Saito 2017-05-19 06:23:40 UTC
Thanks for your reviews.

Patch v3: https://github.com/GNOME/vte/compare/master...saitoha:sixel-v3

Christian, I improved the patch by using vte's encrypted stream instead of sloppy use of temp files. Is this what you intended?

Egmont, with above patch, unused images no longer displayed are discarded with _vte_ring_shrink_image_stream() -> _vte_stream_advance_tail(), so now "maximum scrollback lines" is applied to images.

> Another thing I noticed is that you create your own 256-color palette. Wouldn't it be better to use vte's palette? Or is this desired / written in the specs etc. that you need to diverge?
At least in recent sixel decoder implementation(e.g. xterm), the sixel palette is independent of text palette. It prevents from breaking text palette with careless sixel.

> with the limitation like 6 pixels per vertical pixel column of a cell (did I understand it right?)
sixel cursor is not same as text cursor. It has diffirent height(6px) from text cell.
https://github.com/mintty/mintty/issues/572#issuecomment-236438360

> but this feature feels quite old-fashioned for me,
I see, but sufficient for me. The reason why I write this patch is just because I like sixel.

Bastien, 

> If there are bits of the code that's cut'n'paste from another library, I'd try to reduce the changes compared to the upstream libraries, and add those in separate commits before adding them to the build system to document where they're from, and how to update them.
sixel.cc and sixel_hls.cc are almost same as mintty's implementation(sixel.c/sixel_hls.c) once I wrote.
Comment 11 Christian Persch 2017-06-25 18:59:04 UTC
Thanks for the update.

This isn't a full review, but I have had a quick look at the patch.

* I think there should be API on VteTerminal to en/disable sixel.

* There needs to be a resource limit on the images data, and API to set it. When the limit is reached, either evict old images, or not append new ones?

* You freeze the images that have scrolled out-of-view in ::widget_draw()... IMHO that should happen when doing the scrolling itself.

* You're using first/last_displayed_row() repeatedly inside the loop over all images; just get their value once before the loop.

* Make VteImage a C++ class instead of using the _init/_fini design, and use a suitable std container for the list instead of a GList. Maybe sort the list by first (or last?) cell occupied by the image?

* On-screen images maybe should create a similar surface (gdk_window_create_similar_surface) for the image.

* A a few fundamental questions:
What should happen when you place an image on a cell that already holds some text? Should a cell be allowed to both have text and an image? Should creating an image erase all the cells in its area (maybe make it contain U+FFFC OBJECT REPLACEMENT CHARACTER, and not draw that as a character)? And then what should happen if a subsequent cursor move and insertion places a character inside the image's range; should that erase the image or not insert the character?
When I copy a selection range that includes an image in it, what should the resulting text be?
Comment 12 joel.forums 2017-07-10 15:45:32 UTC
This is great! Thank you so much for working on sixel support for vte!!!
Comment 13 Jürgen Mangler 2017-09-07 16:32:36 UTC
any progress?

@christian persch: i'm not entirely sure if i understand your question. I think an appropriate way to imagine sixel images is not as images, but as a collection of normal characters that just happen to each look like a small part of an image. Thus there is no holding of text and images in a "cell". It just holds a character, either a fancy one, or a normal one (i was reading http://vt100.net/docs/vt3xx-gp/ for entertainment).

In order to confirm the expected behavior i did some quick tests using mlterm (which has sixel support built in), and a patched testvte.

mlterm:
Printing an image behaves just like printing characters in this area. Placing a character leaves the rest of the image intact, just places the character with its default background and foreground color at that position ( little square with the character in it appears in the picture). mlterm thus from a user perspective behaves exactly as expected (although the implementation internally may be very different).

testvte:
Here the character inserted has hovers on top of the image with no separate background color. Not entirely the expected behavior, but i understand that is because of the way it is implemented. if as a trick all characters on top of the image could have background color, from user perspective it would be correct.

For both:
Copying a mix of "normal" characters and sixel characters to clipboard didn't do the thing I expected. Sixel content was replaced with spaces, all "normal" characters stayed in the right place. I expected to be even able to mark and copy partial sixel images. But i think this is not a big problem.

I think the patch would be a huge step in the right direction, although some little inconsistencies exist. Please make it happen.

cheers.
Comment 14 adam 2017-09-25 17:18:25 UTC
I'm also extremely excited about this. I add support for iTerm2's escape sequences to every programming environment I build. Looking forward to being able to the same for this family of terminal emulators!
Comment 15 Egmont Koblinger 2017-09-25 21:05:53 UTC
I've quickly glazed through the Sixel specs. It looks to me (correct me please if I'm wrong or missed something) that it's nothing more than an image format as an escape sequence. For what it's worth, it could be base64-encoded PNG embedded in a certain escape sequence, or anything like that. It's a quite simple format, writing a decoder is a pretty straightforward task (or we could also use an existing library).

It's also a quite old-fashioned format, has quite some limitations (e.g. it's unclear to me if 16M colors can be used without any problems – well, you'd have to define a palette for all the colors actually used, so it's sure problematic), plus the data stream size is enormous for photo-like pictures.

If we add Sixel support, I guess we should go for separating the possibility of displaying images from the input stream format, to easily allow other input formats (such as the aforementioned base64-encoded JPG or PNG) in the future.

For the rest of this comment I don't care about the input format. I'd like to discuss the big picture of the "images in terminal emulators" story.

When a Sixel image is displayed, xterm adds a gap in the text flow just high enough to contain the image – at the current font size. If you change the font size later, the text is resized but the image isn't, and the overall layout falls apart (either an extra unused gap appears, or the image overflows into the text if you've decreased the font size).

Now this is something I truly do not like.

If I had an infinite amount of time to work on VTE (I don't), I would probably do the model/view split (14-year-old bug 103770). There'd be a lower level library (with tons of unittests) doing headless terminal emulation, and on top of this there'd be a higher level doing the Gtk+ stuff to show its contents (and so there could easily be multiple views at different font sizes to the same terminal – or maybe no view at all at a given time). These would be two separate libraries, and one could easily implement cool-retro-term clones, or heavy use of ligatures, or improved complex text rendering etc. on top of the lower one only.

By the way, "screen" already has something analogous to this (see its "-x" "Multi display mode") and I'd be surprised if tmux didn't have one as well.

This direction, however, would exclude the possibility of certain features, e.g. reporting the cell size (bug 782576). And it would also make it pretty much impossible to show pictures the way currently xterm shows Sixels because the notion of cell size wouldn't exist, and as such, there would be no way to calculate the desired space for the picture.

Terminology implements something different. At least the "tyls" or "tyls -m" (multi-column) command shows thumbnails along the filenames that scale according to the font size change. I haven't studied the specs (escape sequence etc.) but I _guess_ it goes something like: hey, please display this image in the following WxH _character_ cells. And then terminology scales the image.

This approach still isn't absolutely free from problems. E.g. even the character cell's ratio isn't exactly known to apps, so is the image always scaled down so that it isn't cropped? Or can the emitting app define some stretching/scaling/cropping method? But the good news is, this model _is_ compatible with the headless / multi-view approach. As such, I like this one much better.

But I'm afraid Sixel is mostly used for displaying graphs and users probably wouldn't want terminal emulators to scale them straight away. And from most users' point of view, this model/view split and headless mode is something they don't care about, and they don't change font size too often so probably they don't mind it too much if then the display falls apart. Shall we say that this 14-year old feature request, internal cleanup, unittestability etc. probably won't ever get implemented, and we don't mind drifting further away from this goal by introducing features that depend on the current font size??

Can we maybe think of some compromise between these two worlds?? Maybe some compatibility layer somewhere, e.g. an external luit-like translator (how'd that know the font size tho')?? Or could we get buy-in from key apps that emit Sixel to add support for some new format as well?? Or if they all emit Sixel via a library then have a drop-in replacement for that library with the same API but different emitted escape sequences?? Or inline the images with a hardcoded scaling factor, and show the unscaled version on hover?? This is a pretty open-ended question.

Sounds to me that we have to make a decision in which way the project is heading, and sacrifice some (planned) features accordingly.

(By the way, yet another truly problematic question with both approaches: What shall "rewrap on resize" do when such inlined pictures are encountered?)
Comment 16 adam 2017-09-25 21:30:12 UTC
Egmont, you raise excellent points and highlight some important concerns. Thanks for your thorough response!

For my own part, I care most about the ability to embed bitmap images at the current cursor location. The ability to do this often drives which environment I decide to use. For example, I use iTerm on macOS instead of Terminal almost exclusively because of its image support. iTerm's implementation is essentially just an escape sequence followed by a base64 encoded png (or other image file): https://www.iterm2.com/documentation-images.html

One example of how I use this functionality is while writing tests for web apps. When a test fails, I drop into an interactive prompt that shows the current state of the headless browser as a bitmap image along with various helpful annotations (like the bounding box of the element that failed the assertion). I have many other examples that I am happy to share if they will help identify possible requirements or narrow the design space.

While there are a number of ways to break this functionality (including changing font size, word wrap, and use of \r like escape sequences), the ability to embed bitmaps in a terminal is absolutely transformational. As a user and library author, I really just want something that works for the most common use cases. It's fine (and understandable) if the implementation isn't perfect for all combinations of interactions. It's also fine if it's yet another escape sequence I need to emit, based on some detection heuristic, as long as it's possible.

Others may disagree, and perhaps the discussion should move to another thread if it's about something different than just Sixel support (so we don't hijack the intent of this thread)... but I'm EXTREMELY excited about a solution in this space.

P.S. The mosh project has a headless terminal emulator implementation they wrote when they couldn't find an existing one: https://github.com/mobile-shell/mosh/tree/master/src/terminal
Comment 17 Egmont Koblinger 2017-09-25 22:00:05 UTC
(In reply to adam from comment #16)

> One example of how I use this functionality is while writing tests for web
> apps. When a test fails, I drop into an interactive prompt that shows the
> current state of the headless browser as a bitmap image [...]

I assume you have a browser open as well if you're developing a web app, right?

Totally unrelated to this topic, iTerm2 and us have recently introduced a new escape sequence that allows arbitrary visible text to become hyperlinks to web pages. This would of course result in a totally different (and perhaps not that optimal) workflow for you, but you could quickly go from the terminal to a web page that shows the given picture and perhaps even more, and doesn't pollute your terminal with potentially very long URLs.

> I have many other examples that I am happy to share if they will
> help identify possible requirements or narrow the design space.

I'd love to hear them, I guess these stories would be truly useful!

> While there are a number of ways to break this functionality (including
> changing font size, word wrap, and use of \r like escape sequences), [...]

I totally understand your point as a user. As a developer the main reason I don't like this direction is because such sloppiness often leads to untestable, unmaintainable, buggy or if we're not truly careful then even crashing code. Or, just people who complain that "hey it's broken after a font change" and there's nothing you can do about that.

Another interesting question for Sixel (or any solution where you'd expect the bitmap to remain at the exact pixel size): If you have a certain font specified in your termial's settings and you zoom with Ctrl-Plus, would you expect the picture that's already on the screen to zoom? And would you expect newly appearing ones to appear magnified? I guess either approaches have their pros and cons.

> P.S. The mosh project has a headless terminal emulator implementation they
> wrote when they couldn't find an existing one:
> https://github.com/mobile-shell/mosh/tree/master/src/terminal

I've just come across libvterm the other day. If anything, looking at their APIs could be interesting. I definitely wouldn't want to replace our emulation layer with any of these.
Comment 18 adam 2017-09-29 19:48:22 UTC
> I assume you have a browser open as well if you're developing a web app, right?

Often I don't have a way to see the specific page in question, because I'm testing some complex interaction and using automation. Sometimes I'm running many tests in parallel which make this even harder. That's why I love dropping into a debugger with screenshots and annotations.

> Totally unrelated to this topic, iTerm2 and us have recently introduced a new escape sequence that allows arbitrary visible text to become hyperlinks to web pages. This would of course result in a totally different (and perhaps not that optimal) workflow for you, but you could quickly go from the terminal to a web page that shows the given picture and perhaps even more, and doesn't pollute your terminal with potentially very long URLs.

Oh, very very cool. Thanks for the pointer :)

> I'd love to hear them, I guess these stories would be truly useful!

The basic theme of my use cases all revolve around iterating on things that involve a visual or graphic aspect. I gave one example of writing UI tests. Other examples include:

- doing image manipulation and image generation
- computer vision, where you'd like to live inspect data sets
- machine learning, where you'd like to see a model's annotation of an image
- high performance computing, where you'd like to see visualizations of a large algorithm running

It's true that many of the above can be implemented in a browser, but the power of a terminal has yet to be matched in a browser. Rather than re-implement terminals in browsers, I'm hoping we can support some of the image formats that work in browsers in terminals.

Regarding resize semantics and relationships with fonts, I believe iTerm2 allows you to specify widths and heights in characters or pixels. Dimensions measured in characters obviously grow and shrink when a font is resized (I assume by resizing the image based on new coordinates). Dimensions measured in pixels mean the image doesn't grow or shrink. iTerm2's image support only gets weird when word wrap kicks in. Then you can see that images are actually drown in character-sized rows ;)

Does any of the above help? Would videos help drive home the power of supporting something like sixel or have the above posts made a strong enough case to move forward on even basic support? :)
Comment 19 Stuart Axon 2018-06-12 19:53:36 UTC
Hi,
   My web development workflow would definitely be improved by this as well.
I'm not a C++ developer, but had a quick look at the code.

Some minor things -

A couple of small things to improve the English 
- "freezed", should be "frozen"
- "is_freezed" should be "is_frozen"
- "thawn" should be "thawed.

The code looks pretty fairly understandable overall even without being a C++ coder :)

@adam, I reckon videos could definitely help, there are lots of workflows that can be improved with better integration between the textual work and the gui world.
Comment 20 Jürgen Mangler 2018-08-16 21:44:26 UTC
Any progress or plans for alternate implementations, that we should be exited about?
Comment 22 Hans Petter Jansson 2020-05-21 20:13:19 UTC
I'm interested in adding sixel support to VTE, as a new patch or by bringing saitoha's work to completion -- time permitting.

I think sixels is an ok choice for a graphics protocol, mostly because it's the most widely supported one, but also because I've managed to convince myself the quantization it requires on the application side can be fairly efficient. The palettized and RLE-encoded data strikes a reasonable balance between bandwidth and CPU usage with a fairly simple encoder. Palette size also provides a size/quality knob, much like it does in PNG.

On top of that, ssh allows for compression, which can mitigate the ASCII redundancy if you've got the CPU headroom for zlib and don't mind the cryptanalysis implications.

Here are some thoughts I've had wrt. points made above. Keep in mind this is pretty rough:

* While a sixel image is being received by the terminal, we may not know its correct size until all of the data has been processed. If the dimensions are specified up front, we can use it for optimization by preallocating the destination buffer and rendering to it in a single pass. If not, we need to analyze the data, establish the maximum dimensions and then allocate and render in a separate pass. If the size is specified, any image data that falls outside the bounding box should be discarded.

* If we're processing incoming data in two passes due to missing size metadata, there must be an upper limit on the size of the receive buffer. If the limit is exceeded, the remaining image data is ignored and the image is discarded.

* When a sixel image has been written to the terminal and converted to an internal pixmap, it is added to the tail of a double-ended queue. We perform rendering and pruning from the head of this queue (i.e. oldest first).

* In order to prevent DOS, we must establish and observe an upper limit on total in-memory image data. If a new image would cause us to exceed this limit, the queue is pruned, oldest first, until there's space for it. If a new image would exceed the limit by itself, it is immediately discarded and nothing gets pruned. The limit should not be set too high, as it'll be the primary mechanism by which images expire and we don't want to bloat the process too much.

* We may want an additional upper limit on image dimensions (say, 8192x8192) for rendering efficiency reasons. Images bigger than this would be discarded.

* When sixel scrolling is enabled, images set to scroll with the text can also expire when the surrounding text leaves the scrollback buffer. Sixel scrolling should probably be enabled by default. Cf. the DECSDM CSI.

* If an image is fully overwritten by newer images, it must expire. This is especially important to supporting animations efficiently. An implementation that only considers full overlap by a single image would be sufficient for this, but going down the image queue while accumulating a region would be even better.  An exception from this is when the image is left partially visible due to transparency.

* On clear screen: All sixel images that are fully or partially visible must expire. We could improve on this in the future by cropping out the visible parts of images, but it's probably not a must-have feature.

* On redraw: First all images are drawn, oldest first. Then the character cells are drawn, possibly overwriting the images. Spaces overwrite images too, so there must be a cell state that's explicitly transparent to images.

* On font change: Images in the queue should be rescaled so as to retain their dimensions in terms of character cells. In order to save memory, the original image should not be kept. This means that if there are many font changes, images will become progressively blurrier, but assuming font changes are rare, it seems like an acceptable tradeoff. Rescaling can be done lazily on the assumption that you may have a large number of images in scrollback. VTE should ensure struct winsize.ws_xpixel and .ws_ypixel are available via the TIOCGWINSZ ioctl.

* Scrolling: Images that are set to scroll with the text must do so vertically, but they should probably just ignore any attempts to shift individual rows left/right due to inserted characters and the like and instead allow themselves to be overwritten in those cells.

* Terminal resize: Similarly to scrolling, rows inserted/removed above/below images due to wrapping should cause them to scroll vertically. However, I doubt it's worth the effort to even consider tackling wrapping text around images when both are in the same row.

* Images don't interact with copy/paste at all.

* When there's doubt as to rendering priority, text always wins. E.g. rewrapped text is displayed on top of images.

* To keep it simple, no guarantee is made as to how images printed over text will look (though I tend towards "a cell contains either image or text, not both"). However, text cells printed over an image should replace, not blend, in order to retain legibility.

Thoughts?
Comment 23 xavier.bestel 2020-05-22 13:38:57 UTC
> * On font change: Images in the queue should be rescaled so as to retain their dimensions in terms of character cells. In order to save memory, the original image should not be kept. This means that if there are many font changes, images will become progressively blurrier, but assuming font changes are rare, it seems like an acceptable tradeoff. Rescaling can be done lazily on the assumption that you may have a large number of images in scrollback. VTE should ensure struct winsize.ws_xpixel and .ws_ypixel are available via the TIOCGWINSZ ioctl.

How about instead always storing the unscaled image and scaling at render time ?
Comment 24 Hans Petter Jansson 2020-05-22 14:14:04 UTC
(In reply to xavier.bestel from comment #23)
> > * On font change: Images in the queue should be rescaled so as to retain their dimensions in terms of character cells. In order to save memory, the original image should not be kept. This means that if there are many font changes, images will become progressively blurrier, but assuming font changes are rare, it seems like an acceptable tradeoff. Rescaling can be done lazily on the assumption that you may have a large number of images in scrollback. VTE should ensure struct winsize.ws_xpixel and .ws_ypixel are available via the TIOCGWINSZ ioctl.
> 
> How about instead always storing the unscaled image and scaling at render
> time ?

It's a possibility, but then there's a processing penalty on each redraw. I don't know exactly how the rendering code works (yet) -- if scaling is implicit and thus unavoidable, or if we scale and composite on the GPU, scaling at render would be the best solution.
Comment 25 Christian Persch 2020-05-22 15:04:40 UTC
> I'm interested in adding sixel support to VTE, as a new patch or by bringing saitoha's work to completion -- time permitting.

Even though I would have preferred having a native image protocol first and then implementing sixel as an afterthought, I do recognise that many of the problems that needs to be solved are applicable also to any image protocol not just sixel. So I'd be okay with finishing sixel support first. Also it's not every day we get an offer for a substantial amount of work, so thanks for that :-)

AFAIK there were major issues with the proof-of-concept patch, but I need to refresh my memory first. Let me do that and I'll have more comments tonight.
Comment 26 Egmont Koblinger 2020-05-22 21:43:00 UTC
Hans: Thanks a lot for your input and passion!

I'm especially interested in the bigger picture (no pun intended). This comment is about that, I'll have another for some of the concrete details. You have many good points, and some that make me worry (talking about text overwriting a picture). But I mostly fear about getting the fundamentals wrong, as sixel does.

---

In my opinion, the next thing that VTE and GNOME Terminal need the most, is image support.

In my opinion, the next thing that the terminal emulation ecosystem (terminals, screen drawing libraries, applications) need the most, is to agree on _one single_ properly designed image protocol, and start migrating.

Because currently there are at least 5 (!!!) image protocols (sixel, regis, iterm2, kitty, terminology), and at least one more proposed (a truly pathetic one that doesn't aim to address the problems with the earlier ones – terminal-wg/12). This situation is terrible. Terminals don't know which one to implement, apps don't know which one to implement, and it's getting worse by attempts of "improving" the situation by adding yet another "good enough for me" protocol. This mess is obviously the reason for not having image support in many apps that really wish to have it, and it's the reason for hacks like apps trying to find out where the terminal window is and painting natively there. It obviously cannot go on like this.

---

I have a draft document, which I'll finally have time next week, or the week after the latest (assuming nothing unexpected, e.g. sickness happens) to polish and publish, and I'm definitely planning to do so.

This document builds up step by step how, and why, a good image protocol – one that is suitable for all the terminals and all the applications to migrate to – needs to look like.

I have not yet looked at the regis, iterm2, kitty, terminology protocols. But I've looked at sixel, and I know that it's terrible. A quick teaser from the study I'll publish soon: It has at least 3 different fundamental flaws, each of which on its own would make it absolutely unsuitable to become that one protocol everyone could migrate to:
 1. It breaks the principle that the emulation behavior doesn't depend on the font size. (As a consequence, it's incompabile with our planned model/view split, as already mentioned in this thread.)
 2. It cannot implement complex layouts, required by presumably quite a few apps, e.g. tmux or mc.
 3. It cannot transfer pixel-perfect photo-quality images.

To address these and much more, as I'll come to the well-founded conclusion in my document, a good image support must be character cell based. Each character cell might contain either a Unicode letter (or half of a double-wide letter), as now, [exclusive]or a picture fragment. In the latter case: it contains a pointer to a full picture, a size in characters that that image is stretched to (e.g. 80 character columns wide, 50 character rows tall), a scaling/stretching method (e.g. keep the aspect ratio and crop to fill the available space), and naming which character position within that image is contained in this very cell (e.g. this character cell contains row 20, column 30 of that scaled/stretched image).

Please don't question now why it has to be like this, my document will explain. Also please don't complain about the lack of 1:1 size, it will be possible, the document will show how.

---

Having sixel support surely looks like a great win in the short term: more apps can display pictures, users are happier.

I'd argue however that in the long term, adding sixel support is straight harmful, for three main reasons.

One is that it pushes the entire ecosystem in the wrong direction. If VTE, which has about 50% of market share within Linux, add this, it will encourage more terminals and more apps to add this absolutely broken format, or happily stick to this if it's already added, rather than looking for a better solution. It makes it harder to argue and cooperate across terminals for a better one.

Another is that if an xterm-like sixel support is implemented, or if it's implemented in a way that even raises the question how two images or an image vs. text should overlap, then it's going in the wrong direction sourcecode-wise, presumably requiring complex solutions that wouldn't normally be necessary. The charcell-based approach that I advocate for is – perhaps surprisingly – not just by far the cleanest design and the best compromise for the entire terminal ecosystem, but is also the one that's the easiest to implement.

And the third is: This would prevent us from working on the model/view split. Or if we ever decide to complete that work, it'll have to kick out sixel support, and it's much harder to communicate such a regression to users than the complete lack of sixel support.

---

If it was me (it's not – Christian can override me), I'd veto sixel support.

If it's decided to have sixel support, though, then what I'm really asking for is to build up the _internals_ (in-memory storage, in-stream storage, painting etc.) according to the soon-to-be-published document, i.e. in a charcell-based approach; which by the way automatically solves the overwriting issue; and restrict sixel handling (along with its braindamaged, model/view-split-incompatible fontsize-dependant logic) to the input processing component.

---

In the mean time, work should begin to study the existing image protocols to see whether any of them matches the document, or could be adjusted to match – I'm secretly hoping for iterm2 to be a good one –; and start cooperating. If VTE and iTerm2 agrees, it's already quite nice, and then there's a good chance we could convince Windows Terminal, too.

---

A lot of these might not make sense yet. Please bear with me until I get to polish and publish that document, which I've been planning to for half a year, but next week is the first when I'll really have time for that.
Comment 27 Egmont Koblinger 2020-05-22 22:06:16 UTC
> * In order to prevent DOS, we must establish and observe an upper limit on
> total in-memory image data.

Currently the text contents (and their attributes, and hyperlink targets) are scrolled out to the so-called "streams" (3 of them) which are backed up by temporary files. This is done in order to support enormous (even unbounded) scrollback, and not to result in OOM but rather filling up the disk (which has typically much more space, and is much easier to recover from if filled up). Also there are many other inconvenient stuff, e.g. arrays that grow but never shrink, in order to avoid long-term memory fragmentation as well as OOMs.

Would be nice to store the images in a stream, too, so that we don't have to worry about memory problems.

> * [...] This is
> especially important to supporting animations efficiently.

To be honest, I don't care about animations at all. If one needs animation, the graphical subsystem (with frame clock / vsync, GPU rendering etc.) is a great platform for that.

> * On font change: Images in the queue should be rescaled so as to retain
> their dimensions in terms of character cells. In order to save memory, the
> original image should not be kept.

I disagree with this. First, I don't find it acceptable to lose quality on font size change. Second, it might be a cool feature to offer (in the right-click menu) to display the image at 1:1 size either by VTE or GNOME Terminal itself, or to open by an external app. Let's keep the original. Scaled versions can be cached in memory up to a limit (e.g. dropped from the in-memory cache once it's no longer in the current viewport).

> Rescaling can be
> done lazily on the assumption that you may have a large number of images in
> scrollback. VTE should ensure struct winsize.ws_xpixel and .ws_ypixel are
> available via the TIOCGWINSZ ioctl.

Yes and yes. For the latter see bug 782576. Also note that .ws_[xy]pixel must not include the padding.

> * Scrolling: Images that are set to scroll with the text

Are there other kinds of images? Ones that don't scroll? Please no!

> * Images don't interact with copy/paste at all.

Or maybe U+FFFC Object Replacement Character? Dunno.

> * To keep it simple, no guarantee is made as to how images printed over text
> will look (though I tend towards "a cell contains either image or text, not
> both"). However, text cells printed over an image should replace, not blend,
> in order to retain legibility.

Well, you've entered the territory of designing an image protocol, rather than just implementing one. I sincerely hope you'll like my document, it probably aligns with your imaginagion in many aspects – although sure not everywhere.

I've not responded to bullet points which don't make sense in the context of my document. Apologies for keeping you waiting with that :)
Comment 28 Christian Persch 2020-05-22 22:27:27 UTC
> then what I'm really asking for is to build up the _internals_ (in-memory storage, in-stream storage, painting etc.) according to the soon-to-be-published document, i.e. in a charcell-based approach; which by the way automatically solves the overwriting issue; and restrict sixel handling (along with its braindamaged, model/view-split-incompatible fontsize-dependant logic) to the input processing component.

Yes, that's just about the same what I meant with preferring to have a 'native' image format first: building the foundation on which the legacy sixel protocol can be bolted on.

> In the mean time, work should begin to study the existing image protocols to see whether any of them matches the document, or could be adjusted to match – I'm secretly hoping for iterm2 to be a good one –;

(I haven't seen the document obviously, so let me just briefly say that AFAIR *all* of the image protocols are unfixably bad. E.g. the iterm2 one uses a OSC with key=value arguments, plus a base64 data payload. All of that is not terminal-like. We have a perfectly fine parser for CSI and DCS sequences; we should use it here. And pushing the payload data through a 6-bit protocol when we have an 8-bit channel is just... no. :-)  Also separating the data payload upload from the controls to use the data is necessary, IMHO.)

---

I've looked at the sixel branch again, and while there had been a bit more work on it than what I remembered, and some of it is to start addressing some of the problems it had (i.e. the storage problem), it was done before vte really started using more C++, so it can probably only be taken as hints, not forward-ported. Plus the code style, etc.

I particularly didn't like the actual sixel parser in it; it mixes parsing of the actions and executing the actions, and therefore duplicates e.g. parsing the parameters. That should be split; the design of the main parser (parser.cc) is the model how this should look (IMO). Also the parser can't be hooked up to vte the way it used to (the branch is from 2017, and in 2018 vte got a completely new parser). I have some work *planned* to make it possible to hook up extra parsers (for DCS control strings), but until that's ready (*not for a while*), you can just parse the control string completely in ::DECSIXEL (but need to temporarily bump VTE_SEQ_STRING_MAX_CAPACITY).
Comment 29 Christian Persch 2020-05-22 22:31:48 UTC
> Would be nice to store the images in a stream, too, so that we don't have to worry about memory problems.

Yes. However, it might be acceptable as a first step to only have in-memory storage and brutally prune (like dropping images once they're scrolled out of the writable ring).

> > * Images don't interact with copy/paste at all.
> Or maybe U+FFFC Object Replacement Character? Dunno.

Well, we have copy-as-html, so... :-)
Comment 30 Egmont Koblinger 2020-05-22 22:56:16 UTC
> Yes. However, it might be acceptable as a first step to only have in-memory storage and brutally prune (like dropping images once they're scrolled out of the writable ring).

That's quite harsh. I'd love to do an `ls --thumbnails` in my image folder and scroll back :) Of course it's a nice first step during development.

Since the stream is quite a complex and special format, and – if we separate the upload from the use of the data – when to purge will be a tough story, we might be thinking in some brand new approaches.

Maybe a temporary directory for each VTE for its blob uploads, where each blob is stored in its own separate file, encrypted of course, under some nested directories (like first and second letter of the filename, as usual for potentially big directories).

The only problem is that it's not cleaned up on a crash, but meh, we never crash :-D
Comment 31 Hans Petter Jansson 2020-05-23 01:27:03 UTC
Thanks for the thorough replies! I'm going to cherry-pick a bit due to time constraints, hope that's ok. Addressing Egmont in comment #26 and comment #27:

> One is that it pushes the entire ecosystem in the wrong direction. If VTE,
> which has about 50% of market share within Linux, add this, it will encourage
> more terminals and more apps to add this absolutely broken format, or happily
> stick to this if it's already added, rather than looking for a better
> solution. It makes it harder to argue and cooperate across terminals for a
> better one.

The good is the enemy of the perfect? I think there will be appetite for improvement even after sixels. I see it more as a stepping stone, with the scarce resource being time and energy to implement solutions.

> Another is that if an xterm-like sixel support is implemented, or if it's 
> implemented in a way that even raises the question how two images or an image 
> vs. text should overlap, then it's going in the wrong direction sourcecode-
> wise, presumably requiring complex solutions that wouldn't normally be 
> necessary.

It's a question that must be answered, since you can place the cursor anywhere and then print either image data or text data. Either may support "transparency". The simplest answer is that printing is destructive, cells are all-or-nothing, and transparency blends to background color. And that's a fine answer.

> Because currently there are at least 5 (!!!) image protocols (sixel, regis, iterm2, kitty, terminology

Does Terminology actually have a graphics protocol? Last time I looked I only found the ability to embed images from local paths. If I do "tycat $FILE >out" it just hangs forever... I think Sixel, iTerm2 and Kitty are the only real contenders in our category, and the latter two are tied to their respective terminal implementations.

> > * [...] This is
> > especially important to supporting animations efficiently.
> 
> To be honest, I don't care about animations at all. If one needs animation,
> the graphical subsystem (with frame clock / vsync, GPU rendering etc.) is a
> great platform for that.

Nonetheless, it's better to just face the fact that people will successively print image frames on top of each other to get animation effects. It's easy enough to support, and there are terminals that perform fine in doing so.

> > * Scrolling: Images that are set to scroll with the text
> 
> Are there other kinds of images? Ones that don't scroll? Please no!

Ha ha! See here: https://vt100.net/docs/vt3xx-gp/chapter14.html under "Sixel Scrolling Mode". I've never seen fixed-position images being used before though, and I think VTE could get away with supporting scrolling images only. It'd certainly make for a saner world :)

> > * Images don't interact with copy/paste at all.
> 
> Or maybe U+FFFC Object Replacement Character? Dunno.

My worry is where in the stream it belongs if you, say, print an image in the left margin of text spanning multiple lines. It wouldn't be a great experience if you go to copy out the text and get image junk interleaved between the lines. You could make a detailed policy of the circumstances under which image data should be selected, and that might make sense to do. But if it were me implementing it I'd put it off until a later milestone.

> > * On font change: Images in the queue should be rescaled so as to retain
> > their dimensions in terms of character cells. In order to save memory, the
> > original image should not be kept.
> 
> I disagree with this. First, I don't find it acceptable to lose quality on 
> font size change. Second, it might be a cool feature to offer (in the right-
> click menu) to display the image at 1:1 size either by VTE or GNOME Terminal 
> itself, or to open by an external app. Let's keep the original. Scaled
> versions can be cached in memory up to a limit (e.g. dropped from the in-
> memory cache once it's no longer in the current viewport).

What we're discussing here is really just optimization; the simplest approach may be to just do what Xavier proposed in comment #23 and add caching later, guided by feedback and profiling.

> Well, you've entered the territory of designing an image protocol, rather than
> just implementing one. I sincerely hope you'll like my document, it probably
> aligns with your imaginagion in many aspects – although sure not everywhere.

On protocol design; I disagree. My comments on image/text priority are about how we reconcile the data models, which in the form they've been discussed so far, are hybrid internally. I agree with you that the ideal result as presented to the user should be character cell-based and not hybrid. However there's uncertainty there, as sixels support 1-bit transparency. I think the user expectation is that printing sixels over sixels will result in the underlying image showing through transparent areas, but I'm not 100% certain what the established expectation is for how sixel transparency interacts with text. We could probably just decide that sixels entering a cell destroy its previous contents, and leave it at that. It's not a big deal as I see it.

I suspect my mindset generally is a bit more pragmatic than yours; I'd like an image layer and (parts of) the most popular delivery protocol to be implemented first, and then for it to be refined by adding more protocols and behaviors later. The idea is that by having basic support for something that's a) simple and b) popular lowers the time to a functioning ecosystem and also reduces the risk of overengineering and adding features no one'll be using.

Perspective: There have been terminals with image support for almost 40 years -- most of them sixel-based -- and this bug has been around for six years so far :)

VTE has significant market share, and sixel support would add a lot of utility to the CLI space quickly. That buys time for the really good stuff.
Comment 32 Hans Petter Jansson 2020-05-23 20:40:46 UTC
(In reply to Christian Persch from comment #28)

> vte got a completely new parser). I have some work *planned* to make it
> possible to hook up extra parsers (for DCS control strings), but until
> that's ready (*not for a while*), you can just parse the control string
> completely in ::DECSIXEL (but need to temporarily bump
> VTE_SEQ_STRING_MAX_CAPACITY).

Understood. I've rebased Hayaki's latest patch now. I brought it up to speed with the C++ changes and got it building. It's not functional yet, though, as the parsing is not hooked up.

I plan to publish a new draft patch/branch once it's somewhat working. Christian, would an MR in gitlab work, or do you prefer attachments to this bug?

Hayaki -- since upstream changed significantly, the easiest way to rebase your work was to squash the commits from your v4 branch into a single diff and copy and paste from there. Please let me know if you have particular preferences as to how it's integrated. I'll make sure your authorship is reflected in the final commit log/copyright notices.
Comment 33 Egmont Koblinger 2020-05-23 21:08:06 UTC
(In reply to Hans Petter Jansson from comment #31)

> The good is the enemy of the perfect?

I guess you meant to say "the perfect is the enemy of the good"?

Yup. But sixel is bad, mmmkay :)

> It's a question that must be answered, since you can place the cursor
> anywhere and then print either image data or text data. Either may support
> "transparency". The simplest answer is that printing is destructive, cells
> are all-or-nothing, and transparency blends to background color. And that's
> a fine answer.

I agree. There's no support for overstriking two letters, and I can't recall anyone asking for that. I can't imagine overstriking two picture fragments, or a picture fragment and a letter (in either order) being useful anywhere, but supporting them would come with a giant cost, both protocol-wise (designing the exact behavior, e.g. how to enable this overstriking mode, how to wipe out everything), and implementation-wise. It's simply not worth it.

> Does Terminology actually have a graphics protocol? Last time I looked I
> only found the ability to embed images from local paths.

Maybe, I don't know. But I did mean to include those, too, I mean it also comes with an escape sequence and a documented behavior for that, right?

> Nonetheless, it's better to just face the fact that people will successively
> print image frames on top of each other to get animation effects. It's easy
> enough to support, and there are terminals that perform fine in doing so.

Well, if there's an image protocol defined, then certainly, yes, people will use it for animation, and it has to work functionally correctly, otherwise the implementation in the terminal is buggy.

There's no guarantee however that it'll be smooth, 60 Hz, aligned to vsync etc. Terminals are even free to do their own periodic GC, or written in a Java-like language that does it behind the curtains, i.e. sometimes stop for a noticeably longer time to internally clean up the expired images, and as a side effect, halt the animation for some time.

What I meant to say was, and sorry that I wasn't clean: as long as it's about overwriting one image with another, sure we must support that; but if it was about a guarantee that we can do smooth 60 Hz animation then that's something I really don't think we should aim to address, neither in the protocol (e.g. escape sequence to wait for the next clock frame) nor in the implementation (i.e. there's no point in optimizing image support more than necessary for a nice experience with static images).

> > Or maybe U+FFFC Object Replacement Character? Dunno.
> 
> My worry is

And also re Christian's comment:

I realize U+FFFC wasn't a good idea here. Let's just omit them. It's pretty much incompatible with the "each charcell holds a picture fragment" approach, I mean I wouldn't want to go into the business of connecting neighboring charcells that hold correctly neighboring pictures, and then figuring out how many FFFCs to emit, or handle partially visible images, or handle images spanning across multiple lines with text overflowing around them and convert to HTML's "float" and whatnot. Given how the terminal is a weird mixture of logical data (possibly spanning across multiple lines) vs. visual data (fields nicely aligned underneath each other), it's impossible to convert it to HTML which is logical only, and has to look nice at every possible width.

So yup, let's omit pictures when copy-pasting, even in HTML.

(What I mixed it up with for a moment when I commented that: My proposal says that for the sake of running the BiDi algorithm, as per its recommendation, these charcells should be treated as FFFCs.)

> On protocol design; I disagree. My comments on image/text priority are about
> how we reconcile the data models, which in the form they've been discussed
> so far, are hybrid internally. I agree with you that the ideal result as
> presented to the user should be character cell-based and not hybrid.

I'm firmly against implementing a hybrid data model. The data model must remain charcell-based.

> However
> there's uncertainty there, as sixels support 1-bit transparency. I think the
> user expectation is that printing sixels over sixels will result in the
> underlying image showing through transparent areas, but I'm not 100% certain
> what the established expectation is for how sixel transparency interacts
> with text. We could probably just decide that sixels entering a cell destroy
> its previous contents, and leave it at that. It's not a big deal as I see it.

I don't understand how it's related to the hybrid vs. charcell-based model (the first part of your paragraph, which I just split in two in order to comment in the middle). Anyway.

As for transparency, repeating myself from the top of this comment: I don't see the point in a mode where painting an image would preserve whatever is underneath at the transparent pixels of the new image. It's also against many design decisions we want to go for. Thus I fully support the last two quoted sentences.

> I suspect my mindset generally is a bit more pragmatic than yours;

I don't think so. :)

> I'd like
> an image layer and (parts of) the most popular delivery protocol to be
> implemented first, and then for it to be refined by adding more protocols
> and behaviors later.

I want the internals to be done right, really-really right. And right doesn't mean sixel-driven. I don't mean to fully support every single weird thing sixel might need. I want a strictly charcell-based picture fragment approach, which is much cleaner, much easier to implement than a hybrid one. This, in turn, prevents some of the misfeatures, such as overstriking one image with another in a way that the old one shines through transparent pixels of the new. I think we've agreed that it's fine to not support this.

This charcell-based internal representation is absolutely suitable for simple sixel handling (the behavior on font size change will differ from xterm's, IMHO for the better). This charcell-based approach is also what Christian wants, and you're fine with it too, if I understand you correctly.

Then the next step is to implement – or not to implement – sixel as (one of the) input format(s). I'd veto it, but it's not my decision, and Christian wants it, and you want it too, so apparently once the internals are done, parsing sixel will be added too. I've accepted that I've lost this battle. :)
Comment 34 Christian Persch 2020-05-23 22:39:08 UTC
> I plan to publish a new draft patch/branch once it's somewhat working.
> Christian, would an MR in gitlab work, or do you prefer attachments to this
> bug?

vte doesn't use MRs; a branch in either your gitlab.g.o fork of vte, or under wip/$username in the main vte repo is equally fine.


(In reply to Egmont Koblinger from comment #33)
> (In reply to Hans Petter Jansson from comment #31)
> I agree. There's no support for overstriking two letters, and I can't recall
> anyone asking for that.

Just wait until someone proposes the non-spacing mosaics sets C and D from ARIB B24 for unicode ;-)

> > > Or maybe U+FFFC Object Replacement Character? Dunno.
> > 
> > My worry is
> 
> And also re Christian's comment:
> 
> I realize U+FFFC wasn't a good idea here. Let's just omit them. It's pretty
> much incompatible with the "each charcell holds a picture fragment"
> approach, I mean I wouldn't want to go into the business of connecting
> neighboring charcells that hold correctly neighboring pictures, and then
> figuring out how many FFFCs to emit, or handle partially visible images, or
> handle images spanning across multiple lines with text overflowing around
> them and convert to HTML's "float" and whatnot. Given how the terminal is a
> weird mixture of logical data (possibly spanning across multiple lines) vs.
> visual data (fields nicely aligned underneath each other), it's impossible
> to convert it to HTML which is logical only, and has to look nice at every
> possible width.
> 
> So yup, let's omit pictures when copy-pasting, even in HTML.

Hmm maybe I misunderstood, or was misunderstood; I actually do think that copying images to plain text should emit OBJ (or *something*, not simply SPACE) for each cell that's covered by the image.
 
> > I'd like
> > an image layer and (parts of) the most popular delivery protocol to be
> > implemented first, and then for it to be refined by adding more protocols
> > and behaviors later.

There may be a missed opportunity that if we implement sixel first, applications adding graphis support will support that, whereas if we did a native protocol first, then there'd be more uptake right from the start.

> I want the internals to be done right, really-really right.

I agree.

> Then the next step is to implement – or not to implement – sixel as (one of
> the) input format(s). I'd veto it, but it's not my decision, and Christian
> wants it, and you want it too, so apparently once the internals are done,
> parsing sixel will be added too. I've accepted that I've lost this battle. :)

Well, I don't 'want it' in that I personally want to use it, just that I think that when the foundations are laid, adding sixel will be a low-cost way to add support for existing applications that already do support sixel. Although I may contradict myself here with my above observation about lost opportunities :-)
Comment 35 Hans Petter Jansson 2020-05-23 22:51:37 UTC
(In reply to Egmont Koblinger from comment #33)
> (In reply to Hans Petter Jansson from comment #31)
> 
> > The good is the enemy of the perfect?
> 
> I guess you meant to say "the perfect is the enemy of the good"?

No, but it was a play on it.

> Yup. But sixel is bad, mmmkay :)

It's bad, but if you factor in the ecosystem, it's also the best. And unfortunately, it's likely to stay that way for many years.

I think we've more or less reached agreement on the rest and are just talking past each other on some of the details. I'm looking forward to reading your spec!
Comment 36 Hans Petter Jansson 2020-05-24 01:44:37 UTC
It works well enough to play around with now, although it has a few obvious issues.

https://gitlab.gnome.org/hansp/vte/-/tree/sixel-v5-hpj
Comment 37 Hans Petter Jansson 2020-05-25 15:23:27 UTC
Christian, is there a place where we can have a more high-bandwidth discussion about implementation details? I don't want to spam this bug too much.
Comment 38 Christian Persch 2020-05-25 16:57:06 UTC
New issue(s) in gitlab.g.o/gnome/vte perhaps?
Comment 39 Egmont Koblinger 2020-05-27 08:56:43 UTC
(In reply to Christian Persch from comment #34)

> There may be a missed opportunity that if we implement sixel first,
> applications adding graphis support will support that, whereas if we did a
> native protocol first, then there'd be more uptake right from the start.

I'm glad that you – at least to some extent – share one of my big concerns with the sixel approach :-)
Comment 40 Egmont Koblinger 2020-05-27 09:33:04 UTC
As for implementation:

I was thinking about two areas.

---

In-memory storage:

I thought we were using about 12 bytes per charcell: 4 for the Unicode codepoint, 8 for colors, 1-ish for misc attributes, 2 for hyperlink target index – geez I had already done the math wrong :-) I was about to comment that I think it's fine to increase the storage, but it's already bigger, probably big enough.

The 4-byte field containing the Unicode character is already a "union" (not strictly as in a C union, but effectively it is): values 0..10FFFF are Unicode, values above U+80000000 are vteunistr (i.e. our indexed solution for addressing a base character + combining accents on 4 bytes). We could easily squeeze room for image pointers: a range of values would index to the pool of the "image, with size in characters, and scaling-aligning method" possibilities we have. Maybe the clean allocation is to move vteunistr to the 110000..7FFFFFFF range, and let images have the high bit. (And assert that our charset decoder cannot emit anything above 10FFFF.) Plus, we'd need two per-character fields, preferably at least 10 bits each (to support image dimensions somewhat larger than 256 characters) that specify which position within that image is contained in the particular cell.

If really-really required, we could steal some bits, e.g. foreground/underline color, explicit hyperlink target, bold, italic, reverse. But I wouldn't want to, I'd rather have a cleaner code, more flexible functionality, paying the price of somewhat larger memory usage (if needed). Foreground/underline colors are used for strikethrough/overlining/underlining, which I would still permit over images, so the feature can be used to display tiny letter-sized (1x1 and 2x1) images that still get all the visuals. Fg color might also be used when mouse highlighting. Also why not allow images to be explicit hyperlinks too.

---

Painting:

Right now we do a 2-pass painting (the two big loops of draw_rows()): backgrounds first, followed by letters. This is so that letters can overflow (to some extent) vertically or horizontally to neighboring cells. This way accents don't get cropped off, glyphs that are wider than their designated area are fully visible (even if this is just a final subpixel column, it makes the letter look much nicer than if cropped).

We should think about what we want to do on the boundary of text and image. I see three possibilities: Never allow a letter to overflow to the image area, allow overflow only into transparent regions of the image, always allow overflow.

We'll probably want to have a 3-pass painting: either the second or the third round devoted to images. The order, and the clipping regions will have to be chosen according to the behavior we wish to implement.

An extremely easy optimization, probably a must-have for performance reasons: Within a character row, locate continuous runs of picure fragments (i.e. cells witht he same image and size to stretch to and scaling-aligning method, also the character row position is the same, and the character column position is increasing(*) by 1), and crop+copy them from the actual image in a single run. Pretty much what we already do when painting the background. A trickier, optional optimization is to do this even across rows.

---

There are of course many other areas that I haven't touched now.
Comment 41 Egmont Koblinger 2020-05-27 09:38:03 UTC
(*) Special considerations will apply for RTL/BiDi as per the image protocol design doc.
Comment 42 Christian Persch 2020-05-27 10:39:50 UTC
(In reply to Egmont Koblinger from comment #40)
> ---
> 
> In-memory storage:
> 
> I thought we were using about 12 bytes per charcell: 4 for the Unicode
> codepoint, 8 for colors, 1-ish for misc attributes, 2 for hyperlink target
> index – geez I had already done the math wrong :-) I was about to comment
> that I think it's fine to increase the storage, but it's already bigger,
> probably big enough.
> 
> The 4-byte field containing the Unicode character is already a "union" (not
> strictly as in a C union, but effectively it is): values 0..10FFFF are
> Unicode, values above U+80000000 are vteunistr (i.e. our indexed solution
> for addressing a base character + combining accents on 4 bytes). We could
> easily squeeze room for image pointers: a range of values would index to the
> pool of the "image, with size in characters, and scaling-aligning method"
> possibilities we have. Maybe the clean allocation is to move vteunistr to
> the 110000..7FFFFFFF range, and let images have the high bit. (And assert
> that our charset decoder cannot emit anything above 10FFFF.) Plus, we'd need
> two per-character fields, preferably at least 10 bits each (to support image
> dimensions somewhat larger than 256 characters) that specify which position
> within that image is contained in the particular cell.
> 
> If really-really required, we could steal some bits, e.g.
> foreground/underline color, explicit hyperlink target, bold, italic,
> reverse. But I wouldn't want to, I'd rather have a cleaner code, more
> flexible functionality, paying the price of somewhat larger memory usage (if
> needed). Foreground/underline colors are used for
> strikethrough/overlining/underlining, which I would still permit over
> images, so the feature can be used to display tiny letter-sized (1x1 and
> 2x1) images that still get all the visuals. Fg color might also be used when
> mouse highlighting. Also why not allow images to be explicit hyperlinks too.

I think it would be bad for both memory efficiency and painting time to slice images up into tiny cell-sized bits and store/paint them individually. I'd rather have images be contiguous, and paint them in one go. (Esp. for the gtk4 future.) Of course need to make sure only to paint where not overwritten by actual text, but that can be done by clipping, or cutting a 'hole' in it, or overpainting (draw images first, text on top).

And I don't think we need to (nor should) support hyperlinks on images. Then the hyperlink_idx field is a much better candidate to be re-named and re-purposed for storing extra info: On an image cell (can one bit in the attrs) it carries image info, otherwise it carries hyperlink info.

> (And assert that our charset decoder cannot emit anything above 10FFFF.)

That much is already true; decoders output UTF-32.
Comment 43 Christian Persch 2020-05-27 10:48:11 UTC
> Foreground/underline colors are used for strikethrough/overlining/underlining, which I would still permit over images, so the feature can be used to display tiny letter-sized (1x1 and 2x1) images that still get all the visuals.

I don't think images should paint any attributes. Attributes apply to text; text and image content should be mutually exclusive.

Also, having that would encourage abusing images as a 'custom font' feature; but IMO if we want that feature, it should be done separately.

> Plus, we'd need
> two per-character fields, preferably at least 10 bits each (to support image
> dimensions somewhat larger than 256 characters) that specify which position
> within that image is contained in the particular cell.

AFAIR we limit to 511 columns/rows, so 2*9 bits would be enough, but even 2*10 bits fine into the 32 bits of the hyperlink_idx field, leaving 14 or 12 bits for e.g. an index into the pool of visible images.
Comment 44 Egmont Koblinger 2020-05-27 11:19:59 UTC
> I think it would be bad for both memory efficiency and painting time to slice images up into tiny cell-sized bits and store/paint them individually.

I definitely don't want to slice them up either in the storage. It would be the entire image (probably the original blob unchanged), and a cached variant thereof which is converted to pixmap and scaled/stretched to the desired size (and invalidated whenever the font size changes, or whenever else we please).

It's just that with the picture-fragment approach, according to the soon-to-be-published design doc, images can be cropped arbitrarily along character edges. E.g. the following is a perfectly valid "screenshot" from a two-panel Midnight Commander, quick-viewing the image in the right panel, with an overlapping dialog in the middle:

+------------------------+------------------------+
| file1      blah1       |      p      i          |
| file2      bla+--------+--------+        c      |
| file3      bla|   popup text    |           t   |
| file4      bla+--------+--------+   u           |
| file5      blah5       |      r        e        |
+------------------------+------------------------+

In this case, and even more complex cases, it's IMO cumbersome to write the code that sets up the clipping rectangles to then display the picture in a single step.

It's much easier to write the code so that only contiguous runs within a character row are located and painted in a single step. This _might_ be fast enough, I don't know. If it turns out to be slow then we definitely need the more complex logic: setting up the clipping area so that then it can be painted in one step. Or if whoever implements it decides to go for the complex logic and presumably faster painting right away, so it be! :) It's just that _I_ would probably start with the simpler logic.

> Also, having that would encourage abusing images as a 'custom font' feature

That's an interesting point.

I was thinking about it as a possibility to add a few glyphs that are usually missing, or provide powerline-like experience without having to install powerline. But it sure can be misused if used too agressively.

And this sure might deserve a different brand new feature, such as allowing to upload font files (TTF etc.) to the SGR 11..19 slots.

> AFAIR we limit to 511 columns/rows

We don't. It's a limitation of ncurses and slang, and IIRC newest versions of slang dropped this limitation. Maybe ncurses dropped it too, I don't know. Apparently there's public demand for more than 512 columns. I'm pretty sure VTE is good up to int16 (32767) columns and rows. [TAB handling stops at 999, IIRC.]

Also note that we're not talking about the screen column here, but the column within the image. Now, I don't really care about the use case when someone wants to say: "stretch this image to 100000x100000 charcells, and display this 80x24 part of it". But maybe someone wants to "just print" a heavily vertical picture, aligning its width to the terminal's (e.g. 500), and then it'll occupy much more than 500 char rows. IMO it doesn't hurt if we allow a bit more than what seems to be common sense use, e.g. 4096 charcells (1.5 bytes) per axis.

> but even 2*10 bits fine into the 32 bits of the hyperlink_idx field, leaving 14 or 12 bits for e.g. an index into the pool of visible images.

And then you haven't even used the Unicode codepoint for this. If you use the Unicode codepoint for the index, you can have image sizes up to 64k chars.

The point is: we have plenty of bits to choose from.

And even if we don't support underlines or hyperlilks over images, if one day there becomes a public demand for that, it'll be really easy to add it (perhaps paying the price of a few new bytes). So I'm not worried if we disallow them now.
Comment 45 Jürgen Mangler 2020-05-28 18:36:15 UTC
Regarding transparency:

> The simplest answer is that printing is destructive, cells are all-or-nothing, and transparency blends to background color.

> I think the user expectation is that printing sixels over sixels will result in the underlying image showing through transparent areas, but I'm not 100% certain what the established expectation is for how sixel transparency interacts with text. We could probably just decide that sixels entering a cell destroy its previous contents, and leave it at that.

I think the all or nothing approach makes for the cleanest/simplest implementation. Every newly printed cell erases all cells at the print position.

* Text printed over images 'punches a hole' into the image, text has the set background.
* Partial image over image is also destructive - no transparency. For transparent parts always the terminal background colour is used.
* Image over text erases the text, no matter the transparency of the image. Again background terminal colour is used for transparent parts.

If you have an app with overlapping pics, you can always use/design a clever lib that does that for you, *before* printing it. And nobody can and should stop anybody from mixing text into the image - simulating all kinds of transparency - through this fictional lib. In short - if a lib can do it, dont build it in.

As for hyperlinks: every cell no matter if it displays image or text should retain function as a hyperlink; i think that could be used in creative ways by app devs. Overline/under/striketrhough and other shenanigans seem pointless (see fictional lib).

I think the implementation should be as straight forward as possible, i think there is no need for any fancy combination and transparency behaviour. I also assume that this is a mental model that is easier to convey to developers than anything else.
Comment 46 Christian Persch 2020-05-28 21:51:07 UTC
> As for hyperlinks: every cell no matter if it displays image or text should retain function as a hyperlink; i think that could be used in creative ways by app devs.

Not that I agree that images *should* have hyperlinks, but since the hyperlink is set to the 'current hyperlink' when the character content of the cell is set, there could only be one hyperlink for the whole image (the current hyperlink at the time the image is put in that rect), not one per cell the image covers. So it can be stored in the image, not the cell, and they hyperlink_idx field remains free to use for the image.
Comment 47 Hans Petter Jansson 2020-05-28 23:43:21 UTC
(In reply to Christian Persch from comment #38)
> New issue(s) in gitlab.g.o/gnome/vte perhaps?

Sounds good.

https://gitlab.gnome.org/GNOME/vte/-/issues/253

I made a couple of sub-issues so we can compartmentalize the discussion and review part of the diff in each one.
Comment 48 Egmont Koblinger 2020-05-31 11:32:45 UTC
I've posted the document I promised:

https://gitlab.freedesktop.org/terminal-wg/specifications/-/issues/26
Comment 49 GNOME Infrastructure Team 2021-06-10 14:50:55 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/vte/-/issues/2084.