Bug 779734 – Escape sequence and UI to mark text as hyperlink with URL

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 779734 - Escape sequence and UI to mark text as hyperlink with URL


Summary:	Escape sequence and UI to mark text as hyperlink with URL


Status:	RESOLVED OBSOLETE

Product:	gnome-terminal
Classification:	Core
Component:	general
Version:	unspecified
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	GNOME Terminal Maintainers
QA Contact:	GNOME Terminal Maintainers

URL:	https://gist.github.com/egmontkob/eb1...
Whiteboard:

Depends on:	780785
Blocks:

Reported:	2017-03-07 22:38 UTC by Josh Triplett
Modified:	2021-06-10 21:15 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Demo implementation (vte bits) (15.78 KB, patch) 2017-03-13 23:23 UTC, Egmont Koblinger	none	Details \| Review
Demo implementation (gnome-terminal bits) (8.99 KB, patch) 2017-03-13 23:23 UTC, Egmont Koblinger	none	Details \| Review
Demo implementation (vte bits, v2) (16.88 KB, patch) 2017-03-14 09:43 UTC, Egmont Koblinger	none	Details \| Review
Test file (8.03 KB, text/plain) 2017-03-19 12:32 UTC, Egmont Koblinger		Details
VteCell refactoring, no more union with int (8.52 KB, patch) 2017-03-21 22:10 UTC, Egmont Koblinger	committed	Details \| Review
Vte OSC 8 placeholder (2.29 KB, patch) 2017-03-21 22:21 UTC, Egmont Koblinger	committed	Details \| Review
vte, work in progress 1 (34.29 KB, patch) 2017-03-24 00:27 UTC, Egmont Koblinger	none	Details \| Review
vte, work in progress 2 (43.58 KB, patch) 2017-03-26 00:17 UTC, Egmont Koblinger	none	Details \| Review
vte, work in progress 3 (49.62 KB, patch) 2017-03-27 00:03 UTC, Egmont Koblinger	none	Details \| Review
Test file, v2 (8.65 KB, text/plain) 2017-03-27 20:58 UTC, Egmont Koblinger		Details
vte, work in progress 4 (51.02 KB, patch) 2017-03-27 21:02 UTC, Egmont Koblinger	none	Details \| Review
vte, work in progress 5 (55.28 KB, patch) 2017-03-28 22:50 UTC, Egmont Koblinger	none	Details \| Review
Test file, v3 (22.10 KB, text/plain) 2017-03-31 21:08 UTC, Egmont Koblinger		Details
vte, work in progress 6 (56.98 KB, patch) 2017-03-31 21:13 UTC, Egmont Koblinger	none	Details \| Review
vte, work in progress 7 (59.31 KB, patch) 2017-04-01 22:50 UTC, Egmont Koblinger	none	Details \| Review
vte, work in progress 8 (64.08 KB, patch) 2017-04-03 21:58 UTC, Egmont Koblinger	none	Details \| Review
gnome-terminal, work in progress 8 (14.00 KB, patch) 2017-04-03 22:00 UTC, Egmont Koblinger	none	Details \| Review
vte, work in progress 9 (92.72 KB, patch) 2017-04-07 00:01 UTC, Egmont Koblinger	none	Details \| Review
gnome-terminal, work in progress 9 (13.64 KB, patch) 2017-04-07 00:01 UTC, Egmont Koblinger	none	Details \| Review
vte, work in progress 9a (92.65 KB, patch) 2017-04-07 08:09 UTC, Egmont Koblinger	none	Details \| Review
Show file preview in tooltip (demo) (5.95 KB, patch) 2017-04-08 14:48 UTC, Egmont Koblinger	none	Details \| Review
gnome-terminal, work in progress 10 (14.98 KB, patch) 2017-04-10 21:40 UTC, Egmont Koblinger	none	Details \| Review
Show file preview in tooltip (demo), work in progress 10 (7.01 KB, patch) 2017-04-10 21:43 UTC, Egmont Koblinger	needs-work	Details \| Review
Screenshot of file preview in tooltip (71.26 KB, image/png) 2017-04-10 21:50 UTC, Egmont Koblinger		Details
ls --hyperlink (coreutils 8.27) draft patch (8.00 KB, patch) 2017-04-11 23:02 UTC, Egmont Koblinger	none	Details \| Review
less -R (487) draft patch (6.07 KB, patch) 2017-04-12 23:57 UTC, Egmont Koblinger	none	Details \| Review
less -R (487) draft patch v2 (7.13 KB, patch) 2017-04-15 13:03 UTC, Egmont Koblinger	reviewed	Details \| Review
vte, work in progress 11 (95.36 KB, patch) 2017-04-19 20:19 UTC, Egmont Koblinger	none	Details \| Review
vte, work in progress 12 (103.84 KB, patch) 2017-04-24 00:07 UTC, Egmont Koblinger	none	Details \| Review
gnome-terminal, work in progress 12 (15.20 KB, patch) 2017-04-24 00:07 UTC, Egmont Koblinger	committed	Details \| Review
vte, work in progress 13 (104.62 KB, patch) 2017-04-24 16:32 UTC, Egmont Koblinger	none	Details \| Review
vte, work in progress 14 (109.96 KB, patch) 2017-04-24 23:04 UTC, Egmont Koblinger	committed	Details \| Review
less -R (520) draft patch (8.27 KB, patch) 2017-10-20 11:34 UTC, Egmont Koblinger	reviewed	Details \| Review

Description Josh Triplett 2017-03-07 22:38:05 UTC

(This would require changes in both VTE and gnome-terminal; filing it on gnome-terminal first for discussion.)

I'd like to have a terminal escape sequence that marks subsequent text as a hyperlink with a specified URL, and a sequence based on the same escape that marks the end of the link.  That would allow programs running in the terminal to provide hyperlinks based on context that wouldn't work via general URL recognition.

As an example: apt-listchanges knows that its content consists of Debian changelogs.  Debian changelogs have a standard pattern to specify a Debian bug number.  When displayed in a web browser, apt-listchanges turns those URLs into hyperlinks.  If gnome-terminal (and VTE) supported an escape sequence for hyperlinks, then apt-listchanges could emit that escape sequence to turn bug numbers into hyperlinks within less and gnome-terminal.

Within VTE, this would require adding such an escape sequence, remembering the URL, and providing that information to the program embedding VTE.  Within gnome-terminal, this would require extending the existing hyperlink treatment to show an URL on hover or in the context menu, and provide the "Open Link" and "Copy Link Address" menu items.

Does this seem like reasonable functionality to support?

Comment 1 Egmont Koblinger 2017-03-07 23:54:06 UTC

This is a truly interesting idea. I have also thought about the possibility a while ago.

Technically:

I think it's doable, although requires considerable amount of work in VTE (plus tons of boring boilerplate in both components).

The biggest part is probably figuring out how to store these meta-info in the scrollback. Probably we should go for something like run-length encoding just as with the current attribute stream. Given that URLs have no (reasonable) upper limit on their length, we can't go with fixed records. Maybe we'll need a fourth stream that just concatenates these URLs, and attr_stream would contain pointers to this. Luckily, increasing attr_stream's record size is not really an issue due to the compression. Having a 4th stream, however, would noticeably increase VTE's memory footprint (write buffer and read cache). Or maybe we can introduce some tricky extension to attr_stream's format so that it contains the URL inlined and then switches back to fixed records at an aligned offset.

Remembering these URLs for every single onscreen character cell is also nontrivial in the ring. Strdup'ing for each cell is a total waste. So, we'd need some kind of hashing/indexing, and refcount/gc blahblahblah...

Note that just by VTE and gnome-terminal supporting this, you don't get too much benefits. You mentioned "less", that would need to be patched too. Ncurses and Slang libs probably wouldn't receive support in the foreseeable future. Not sure if apt-get folks would be open to this idea. Also not sure how to detect whether the terminal emulator supports this feature, and/or what escape sequence should we invent that is safely ignored by all other terminal emulators.

Alas, there's a heavy resistance against innovation in the terminal emulator world, even in much smaller things (or maybe _because_ they are less important?). E.g. ncurses/terminfo's maintainer still refused to make mouse clicks beyond 223 work by default in ncurses apps, and refused to introduce a terminfo entry denoting truecolor support (even despite many terminal emulators and many key apps already supporting truecolor). I really don't see a chance he'd introduce a flag to denote this new feature in terminfo, so we'd have to go our way.

If we implement this feature, I'd like to at the very least:

- coordinate with iTerm2's and konsole's authors,

- figure out how to make it generic enough (so that it's not necessarily a webpage where the link points to -- maybe allowing other namespaces in addition to file/ftp/http/https such as "manpage:bash(1)" is enough, not sure at this moment).

From the users' point of view:

It would obviously be a great convenience feature in lots of situations. I can imagine "git diff" marking the commit IDs to links depending on the reporitory's configs, a file manager marking all filenames with "file:..." URLs for easy opening in a browser, etc.

It could help with line-breaking problems such as https://midnight-commander.org/ticket/3229 (although just as the hard vs. soft EOL issue is not solved in Ncurses/Slang, this feature would probably not be implemented either in these libs either), and could even address the case when the URL is partially scrolled out in the application.

On the other hand, it would kinda go against the basics of terminal emulation that the only data is whatever's printed on the screen, there's no "hidden" contents behind. Not sure if breaking this philosophy is okay or not.

Note that Ubuntu patches gnome-terminal so that it recognizes "LP:number"-style bug numbers. Debian seems to use "#number" which is quite generic. You're free to add a handler for this, taking you to that Debian number, and just live with the false positives this gives you. I understand it's not ideal and not suitable for mainstream or distro inclusion, but might be a perfect quick hack for your personal needs.

Comment 2 Josh Triplett 2017-03-08 03:32:39 UTC

(In reply to Egmont Koblinger from comment #1)
> The biggest part is probably figuring out how to store these meta-info in
> the scrollback. Probably we should go for something like run-length encoding
> just as with the current attribute stream. Given that URLs have no
> (reasonable) upper limit on their length, we can't go with fixed records.
> Maybe we'll need a fourth stream that just concatenates these URLs, and
> attr_stream would contain pointers to this. Luckily, increasing
> attr_stream's record size is not really an issue due to the compression.
> Having a 4th stream, however, would noticeably increase VTE's memory
> footprint (write buffer and read cache). Or maybe we can introduce some
> tricky extension to attr_stream's format so that it contains the URL inlined
> and then switches back to fixed records at an aligned offset.
> 
> Remembering these URLs for every single onscreen character cell is also
> nontrivial in the ring. Strdup'ing for each cell is a total waste. So, we'd
> need some kind of hashing/indexing, and refcount/gc blahblahblah...

How about an ordered list of spans (row and start/stop column), together with a map from row number to first span in that row?  The ordered list would make it easy to walk and render, append new spans as soon as you get the "end of hyperlink" sequence, and prune from the beginning when things fall off the start of a limited scrollback.  (You'd also need a single "currently open span" entry with no ending as part of the terminal state.)  The map would make it fast to ask "is there a hyperlink in this row" as you scroll and render.  And this approach would also keep things small and efficient if you don't have any hyperlinks at all.

If needed for further efficiency, the onscreen character cells could then just have a flag for "I'm part of a hyperlink" (and/or "I'm the start/end of a hyperlink"), and you can look up the URLs as needed.

> Note that just by VTE and gnome-terminal supporting this, you don't get too
> much benefits. You mentioned "less", that would need to be patched too.
> Ncurses and Slang libs probably wouldn't receive support in the foreseeable
> future. Not sure if apt-get folks would be open to this idea.

I'll happily write the patches for apt and less myself.  (For less, it just needs to pass through that escape sequence and know that it doesn't take up any width.)

> Also not sure
> how to detect whether the terminal emulator supports this feature, and/or
> what escape sequence should we invent that is safely ignored by all other
> terminal emulators.

Yeah, that seems like the most annoying part of this.  For detection, ideally you'd use either $TERM or terminfo.  That proves challenging though, since gnome-terminal uses existing values of $TERM and doesn't have its own terminfo entry, and since as you noted below terminfo has stagnated.  Detection based on $VTE_VERSION won't suffice here, either, because we want this to work portably to other terminals, as well as across remote connections.  And something querying the terminal and requesting a report back won't work either, since that doesn't support run-once-and-done applications that don't want to delay for a response.

Given that gnome-terminal and multiple other terminals already uses OSC 7 for "set current directory" (so that opening a new tab keeps the same working directory), that seems like a sensible pattern to follow.  So I think you'd want to use "OSC Ps ; Pt BEL" with a new Ps value, such as 8, and the URL as Pt; that sequence with an empty Pt would act as an "end tag".  gnome-terminal already uses OSC 7 for "set current directory".  xterm and some other terminals already ignore that.  Current gnome-terminal doesn't, though.

This does seem trivial to encode in terminfo, though, if we thought terminfo would accept it: somenewcap=\E]8;%p1%s\007

In any case, it seems worth implementing this first, which would allow applications to experiment with it, and worrying about detection methods in parallel.

> Alas, there's a heavy resistance wagainst innovation in the terminal emulator
> world, even in much smaller things (or maybe _because_ they are less
> important?). E.g. ncurses/terminfo's maintainer still refused to make mouse
> clicks beyond 223 work by default in ncurses apps, and refused to introduce
> a terminfo entry denoting truecolor support (even despite many terminal
> emulators and many key apps already supporting truecolor). I really don't
> see a chance he'd introduce a flag to denote this new feature in terminfo,
> so we'd have to go our way.

Yeah.  Unfortunate. :(

> If we implement this feature, I'd like to at the very least:
> 
> - coordinate with iTerm2's and konsole's authors,

I'd suggest also coordinating with xterm, as the maintainer of http://invisible-island.net/xterm/ctlseqs/ctlseqs.html .  They could at least add a "reserved" entry (which they should also do for OSC 7 as currently used by gnome-terminal)

> - figure out how to make it generic enough (so that it's not necessarily a
> webpage where the link points to -- maybe allowing other namespaces in
> addition to file/ftp/http/https such as "manpage:bash(1)" is enough, not
> sure at this moment).

I think a URL with potentially arbitrary scheme would suffice.  People already use a "man:..." scheme for manpages in other places.

> From the users' point of view:
> 
> It would obviously be a great convenience feature in lots of situations. I
> can imagine "git diff" marking the commit IDs to links depending on the
> reporitory's configs, a file manager marking all filenames with "file:..."
> URLs for easy opening in a browser, etc.

Exactly.  I could even imagine making a fancy "ls" that links to files in a file manager, or just a subset of files such as images or archives; people have done hacks like that in interesting new terminal emulator prototypes, but this would make them supportable in a mainstream terminal.

> It could help with line-breaking problems such as
> https://midnight-commander.org/ticket/3229 (although just as the hard vs.
> soft EOL issue is not solved in Ncurses/Slang, this feature would probably
> not be implemented either in these libs either), and could even address the
> case when the URL is partially scrolled out in the application.

Nice idea; if the application running in the terminal knows about the full off-screen URL, it can do a better job than the terminal can.

> On the other hand, it would kinda go against the basics of terminal
> emulation that the only data is whatever's printed on the screen, there's no
> "hidden" contents behind. Not sure if breaking this philosophy is okay or
> not.

Terminals already don't follow that; they have extensive formatting information for text, not to mention things like the "alternate screen".  This just seems like an additional kind of formatting: "format this as a hyperlink to the specified URL".

> Note that Ubuntu patches gnome-terminal so that it recognizes
> "LP:number"-style bug numbers. Debian seems to use "#number" which is quite
> generic. You're free to add a handler for this, taking you to that Debian
> number, and just live with the false positives this gives you. I understand
> it's not ideal and not suitable for mainstream or distro inclusion, but
> might be a perfect quick hack for your personal needs.

Not looking for the quick hack here; I'd like to find the right general solution.

Comment 3 Egmont Koblinger 2017-03-08 22:02:39 UTC

(In reply to Josh Triplett from comment #2)

I was thinking about it all day :) More and more sounds like a cool feature, and also the technical challenge makes it more interesting to me than most other vte/gnome-terminal bugs :) If I had time, probably this is the next issue I'd pick right now. Hope this feeling of mine won't change until I actually have time (which, unfortunately, will probably not happen in the near future).

Random thoughts (some are really insignificant):


** User-facing questions

You mentioned OSC 8 as a possibility, with an empty argument to terminate the hyperlink. OSC 4, 5, 10-19 have their resetting counterparts a hundred above, instead of taking an empty argument. Following that pattern, OSC 108 for terminating might be more consistent. However, I'd personally rather go with your proposal: no different syntax or different OSC number, just the empty parameter (perhaps allowing the omission of the semicolon) to terminate the link. Also let's note that OSC ("operating system command") indeed sounds the right set of escape sequences to go with.

As you correctly pointed out, VTE is pretty much the only terminal emulator that prints garbage upon receiving an unknown OSC (bug 403130) so there's no real compatibility issue as long as we're the first to implement this feature :D

Do we want to highlight when not hovered? I'd say yes, probably with some light dotted underline. Otherwise, as opposed to URLs (where the displayed text follows a pattern like "http://blahblah") or numbers (which again follows a pattern, plus the extra info in the right-click menu is a really unimportant convenience feature), there would be absolutely nothing suggesting that the given text is a hyperlink, and whatever is printed by enhanced apps (such as patched apt) would remain pretty much undiscoverable.

Do we want to highlight (differently) on hover? I'd say yes, perhaps with a normal underline just as we do with "http://blahblah" text currently.

Do we want to automatically join adjacent links that are pointing to the same address, yet weren't printed in a single OSC 8 run? E.g. you print "foo" as a hyperlink pointing to "http://example.com" and much later, after drawing lots of other things elsewhere, you print "bar" next to it as a different link to the same address. This question only makes sense if we highlight on hover: do we highlight either "foo" or "bar" separately, or the entire "foobar"? I'd say let's automatically join them (although this is not what browsers do). This is much easier to implement (see later). Also this allows screen drawing libraries to keep their current screen drawing optimization strategies (if they ever implement this feature).

Do we want to allow multiple, overlapping links? E.g. a git commit ID could be two hyperlinks at once, one pointing to "http://git.example.com/browse/repo/?id=xxxxxxx", and another one could point to "shell:git show xxxxxxx" which would open a new gnome-terminal window. The right-click menu could then show both and let you choose. I'd say let's _not_ do this. This would overcomplicate things a lot for marginal benefits which can be achieved by printing additional words (links) by the app, and would also raise questions like what to do on right click. Also web browsers don't have anything like this. (Also, the "shell:" scheme sounds extremely dangerous, let's not have it.)

What if a cell belongs to an explicit link as well as an autodetected one at the same time? As mentioned in my previous comment, a cool new possibility introduced by this feature is to allow a text viewer to properly linkify a URL that's partially scrolled out. As such, the explicit link has to have priority (probably completely suppressing the other one).

Do we worry about malicious uses when the text printed is actually a URL and is a link to a different URL? I don't think we should worry, browsers don't either. If opening an unintended webpage can have bad consequences (e.g. data loss) then that site already suffers from CSRF-like bugs that can be exploited anyways. Also, it's a perfectly legitimate use case, as just mentioned with partially scrolled-out URLs.

What to do with unrecognized schemes? Following what browsers do, the simplest is if vte just doesn't care, and gnome-terminal probably brings up a popup box when ctrl-clicked. Knowing in advance not to treat them as links would complicate the design a lot for hardly any benefits (gnome-terminal would need to register either the recognized patterns (regexes?) or a callback validator, ouch).

Do we want to show the target of the URL on hover (or Ctrl+hover), e.g. in a bottom corner as browsers do? It raises too many questions like what color, font etc. to use there, how to trim, should we hardwire these or have an API, or should vte ask gnome-terminal to do it... I'd say let's not do this in the first round, but keep the idea open for further brainstorming.


** Technical questions

In the CellAttr object we're already using 63 out of 64 bits. Since we need some external info here (the link target itself), I don't want to waste the last bit on this feature, even though I'd prefer to render these cells specially (with the dotted underline). Presence/absence of a hyperlink would solely be denoted by the existence/lack of hyperlink data.

In the "streams" (once the link is scrolled out), at this moment I'm in favor of inlining the link target in attr_stream. There's absolutely no need to have fixed-length records there or even to align them. Each current record would be followed by two bytes (link length as unsigned short), followed by the link itself (most of the time just two 0 bytes for the length and no link - compresses great). It's much easier to read the URL if its length is known in advance, rather than being 0-terminated. One drawback of this approach is that if some attribute changes within the link, the same link is stored again. This sounds a really nontypical scenario to me (not a scenario to optimize for), and compression would probably do a reasonably good job filtering out these duplicates. That is, I'm happy to live with this drawback (I find it better than introducing a 4th stream for the URLs which wouldn't have this issue, but would have other drawbacks as mentioned in my earlier comment).

In the "ring" (the onscreen contents) I'd create a pool of currently onscreen URLs, each having a 2-byte ID which ID would be stored for each cell (obviously 0 meaning the cell is non-link). Whenever an OSC 8 is encountered, the URL's ID would be looked up, or a new one created. As subsequent characters are printed, the ID is already known and no lookup is required, this ID would be stored just as any other attribute. Garbage collection could happen whenver a line is scrolled out (or on every Nth line getting scrolled out), or if we're short of free IDs, or maybe whenever a new ID is requested.

Let's not forget that malicious apps should not be able to cause DoS, OOM, out of disk... Right now the memory fingerprint is pretty low, and disk usage is also pretty well limited by the number of scrollback lines. This new feature would allow an attacker to place magnitudes more data in the scrollback files. There's no sane way to prevent this, but I'd have a safety cap both on the length of each URL, and on the max number of different target URLs the onscreen contents can have. Both limits would be under 64k so that 2-byte short integers can be used for the length and ID. (If there are too many different URLs on the onscreen contents, encountering yet another in an OSC 8 would treat it as if it was an empty, terminating OSC 8.) Let's say the limit would be 2000 for both. This means vte potentially using up to 4MB of RAM (plus a bit of growing memory fragmentation - unless we allocate for the longest supported URL all the time and never free), this is not that bad at all. What's way worse is what can happen to the file descriptors, not sure if there's anything we can do there.


> How about an ordered list of spans (row and start/stop column), together
> with a map from row number to first span in that row?

Due to the nature of terminal emulation (randomly positioning the cursor, overwriting cells in the middle of a previous link, inserting/removing characters in a way that shifts the rest of the line, etc.) I'd rather avoid thinking in terms of "spans" as much as possible. In the "ring", which represents the active onscreen contents, it'd be a nightmare to maintain it.

It's okay to translate to spans for the "streams" that represent the scrollback buffer (that's what we already do there with the continuous runs of identical attributes).


> I'll happily write the patches for apt and less myself.  (For less, it just
> needs to pass through that escape sequence and know that it doesn't take up
> any width.)

It's more complicated with "less", you'd have to emit the starting sequence even if you only start displaying the link text from its middle, etc. This is what "less -R" already does with the colors, so it shouldn't be that hard.


> > how to detect whether the terminal emulator supports this feature
> 
> In any case, it seems worth implementing this first, which would allow
> applications to experiment with it, and worrying about detection methods in
> parallel.

Yeah I guess so.

Comment 4 Egmont Koblinger 2017-03-08 22:18:21 UTC

typo:

Do we want to allow multiple, overlapping links? [...] and would also raise questions like what to do on ***ctrl***-click.

Comment 5 Egmont Koblinger 2017-03-08 22:19:30 UTC

Christian, I'd really love to hear your opinion :)

Comment 6 Josh Triplett 2017-03-09 01:46:53 UTC

(In reply to Egmont Koblinger from comment #3)
> (In reply to Josh Triplett from comment #2)
> 
> I was thinking about it all day :) More and more sounds like a cool feature,
> and also the technical challenge makes it more interesting to me than most
> other vte/gnome-terminal bugs :) If I had time, probably this is the next
> issue I'd pick right now. Hope this feeling of mine won't change until I
> actually have time (which, unfortunately, will probably not happen in the
> near future).
> 
> Random thoughts (some are really insignificant):
> 
> 
> ** User-facing questions
> 
> You mentioned OSC 8 as a possibility, with an empty argument to terminate
> the hyperlink. OSC 4, 5, 10-19 have their resetting counterparts a hundred
> above, instead of taking an empty argument. Following that pattern, OSC 108
> for terminating might be more consistent. However, I'd personally rather go
> with your proposal: no different syntax or different OSC number, just the
> empty parameter (perhaps allowing the omission of the semicolon) to
> terminate the link. Also let's note that OSC ("operating system command")
> indeed sounds the right set of escape sequences to go with.

Omitting the semicolon would introduce an inconsistency with the rest of the
OSC family of sequences, and might complicate someone's parser; I think I'd
rather specify it more strictly.

> As you correctly pointed out, VTE is pretty much the only terminal emulator
> that prints garbage upon receiving an unknown OSC (bug 403130) so there's no
> real compatibility issue as long as we're the first to implement this
> feature :D

Sounds reasonable.

> Do we want to highlight when not hovered? I'd say yes, probably with some
> light dotted underline. Otherwise, as opposed to URLs (where the displayed
> text follows a pattern like "http://blahblah") or numbers (which again
> follows a pattern, plus the extra info in the right-click menu is a really
> unimportant convenience feature), there would be absolutely nothing
> suggesting that the given text is a hyperlink, and whatever is printed by
> enhanced apps (such as patched apt) would remain pretty much undiscoverable.

As long as doing this doesn't conflict with app formatting (for instance, the
app may want to emit its own color or formatting codes for its hyperlinks).
But sure, a subtle underline seems reasonable.

> Do we want to highlight (differently) on hover? I'd say yes, perhaps with a
> normal underline just as we do with "http://blahblah" text currently.

Yes, I think it should work exactly like the existing hyperlink support on hover.

> Do we want to automatically join adjacent links that are pointing to the
> same address, yet weren't printed in a single OSC 8 run? E.g. you print
> "foo" as a hyperlink pointing to "http://example.com" and much later, after
> drawing lots of other things elsewhere, you print "bar" next to it as a
> different link to the same address. This question only makes sense if we
> highlight on hover: do we highlight either "foo" or "bar" separately, or the
> entire "foobar"? I'd say let's automatically join them (although this is not
> what browsers do). This is much easier to implement (see later). Also this
> allows screen drawing libraries to keep their current screen drawing
> optimization strategies (if they ever implement this feature).

I agree: you don't need to distinguish between adjacent links with identical
target URLs.

> Do we want to allow multiple, overlapping links? E.g. a git commit ID could
> be two hyperlinks at once, one pointing to
> "http://git.example.com/browse/repo/?id=xxxxxxx", and another one could
> point to "shell:git show xxxxxxx" which would open a new gnome-terminal
> window. The right-click menu could then show both and let you choose. I'd
> say let's _not_ do this. This would overcomplicate things a lot for marginal
> benefits which can be achieved by printing additional words (links) by the
> app, and would also raise questions like what to do on right click. Also web
> browsers don't have anything like this.

Right; any given cell can only hyperlink to one URL, and specifying another
overrides the first.  (Combining that with the previous point, that does mean
if you print some hyperlinked text and then go back and print more text in the
middle with either a different hyperlink or no hyperlink, you now have two
separate regions for the previous hyperlink.)

> (Also, the "shell:" scheme sounds
> extremely dangerous, let's not have it.)

Agreed.

> What if a cell belongs to an explicit link as well as an autodetected one at
> the same time? As mentioned in my previous comment, a cool new possibility
> introduced by this feature is to allow a text viewer to properly linkify a
> URL that's partially scrolled out. As such, the explicit link has to have
> priority (probably completely suppressing the other one).

Agreed; suppress any autodetected link if the application provides an explicit
one.

> Do we worry about malicious uses when the text printed is actually a URL and
> is a link to a different URL? I don't think we should worry, browsers don't
> either. If opening an unintended webpage can have bad consequences (e.g.
> data loss) then that site already suffers from CSRF-like bugs that can be
> exploited anyways. Also, it's a perfectly legitimate use case, as just
> mentioned with partially scrolled-out URLs.

I don't think you should worry about that, and in particular I don't think
gnome-terminal should attempt to second-guess the application.

> What to do with unrecognized schemes? Following what browsers do, the
> simplest is if vte just doesn't care, and gnome-terminal probably brings up
> a popup box when ctrl-clicked. Knowing in advance not to treat them as links
> would complicate the design a lot for hardly any benefits (gnome-terminal
> would need to register either the recognized patterns (regexes?) or a
> callback validator, ouch).

I'd definitely agree that you should just treat all hyperlinks as hyperlinks,
and not try to un-linkify those without a sensible target.  The tooltip and
context menu (see below) could provide a message about the unhandled scheme
instead, and ctrl-clicking can then just do nothing.

> Do we want to show the target of the URL on hover (or Ctrl+hover), e.g. in a
> bottom corner as browsers do? It raises too many questions like what color,
> font etc. to use there, how to trim, should we hardwire these or have an
> API, or should vte ask gnome-terminal to do it... I'd say let's not do this
> in the first round, but keep the idea open for further brainstorming.

I would suggest showing it as a standard tooltip on hover (using the standard
tooltip color and formatting), and showing it in the context menu.

> ** Technical questions
> 
> In the CellAttr object we're already using 63 out of 64 bits. Since we need
> some external info here (the link target itself), I don't want to waste the
> last bit on this feature, even though I'd prefer to render these cells
> specially (with the dotted underline). Presence/absence of a hyperlink would
> solely be denoted by the existence/lack of hyperlink data.

You know the internals best, so it's your call.  You'd need the link target to
know whether to join adjacent linked cells anyway, if you don't use a
span-based approach, and spans *would* make it harder to apply overwrites in
the middle.

> In the "streams" (once the link is scrolled out), at this moment I'm in
> favor of inlining the link target in attr_stream. There's absolutely no need
> to have fixed-length records there or even to align them. Each current
> record would be followed by two bytes (link length as unsigned short),
> followed by the link itself (most of the time just two 0 bytes for the
> length and no link - compresses great). It's much easier to read the URL if
> its length is known in advance, rather than being 0-terminated. One drawback
> of this approach is that if some attribute changes within the link, the same
> link is stored again. This sounds a really nontypical scenario to me (not a
> scenario to optimize for), and compression would probably do a reasonably
> good job filtering out these duplicates. That is, I'm happy to live with
> this drawback (I find it better than introducing a 4th stream for the URLs
> which wouldn't have this issue, but would have other drawbacks as mentioned
> in my earlier comment).

Given that you don't need to worry about any cross-version compatibility in the
streams format, that seems fine.  (If you needed to worry about cross-version
compatibility, I'd suggest using the standard variable-length integer encoding:
1 or more bytes consisting of 7 bits of number and the high bit indicating more
bytes to follow.)

> In the "ring" (the onscreen contents) I'd create a pool of currently
> onscreen URLs, each having a 2-byte ID which ID would be stored for each
> cell (obviously 0 meaning the cell is non-link). Whenever an OSC 8 is
> encountered, the URL's ID would be looked up, or a new one created. As
> subsequent characters are printed, the ID is already known and no lookup is
> required, this ID would be stored just as any other attribute. Garbage
> collection could happen whenver a line is scrolled out (or on every Nth line
> getting scrolled out), or if we're short of free IDs, or maybe whenever a
> new ID is requested.

That sounds fine for a first pass.  You may want to profile fast terminal
rendering/scrolling/output to see if this change has an impact; if it doesn't,
no sense prematurely optimizing.

> > How about an ordered list of spans (row and start/stop column), together
> > with a map from row number to first span in that row?
> 
> Due to the nature of terminal emulation (randomly positioning the cursor,
> overwriting cells in the middle of a previous link, inserting/removing
> characters in a way that shifts the rest of the line, etc.) I'd rather avoid
> thinking in terms of "spans" as much as possible. In the "ring", which
> represents the active onscreen contents, it'd be a nightmare to maintain it.

Fair enough.  A pointer for each active cell doesn't seem *that* awful.

> > I'll happily write the patches for apt and less myself.  (For less, it just
> > needs to pass through that escape sequence and know that it doesn't take up
> > any width.)
> 
> It's more complicated with "less", you'd have to emit the starting sequence
> even if you only start displaying the link text from its middle, etc. This
> is what "less -R" already does with the colors, so it shouldn't be that hard.

Right; I'd planned to leverage the existing "less -R" support.

> > > how to detect whether the terminal emulator supports this feature
> > 
> > In any case, it seems worth implementing this first, which would allow
> > applications to experiment with it, and worrying about detection methods in
> > parallel.
> 
> Yeah I guess so.

Comment 7 Egmont Koblinger 2017-03-09 09:08:08 UTC

(In reply to Josh Triplett from comment #6)

I love seeing how much we're on the same page :)

> As long as doing this doesn't conflict with app formatting (for instance, the
> app may want to emit its own color or formatting codes for its hyperlinks).
> But sure, a subtle underline seems reasonable.

Obviously I meant to apply this dotted underline on top of all other app formatting, not instead of. Even the explicit underline (\e[4m) is drawn at a different height than the auto-link underline so that wouldn't conflict either.

> I would suggest showing it as a standard tooltip on hover (using the standard
> tooltip color and formatting), and showing it in the context menu.

Standard tooltip didn't occur to me, sounds good. In the context menu we'd need to truncate long URLs, otherwise the entire menu could look awful.

> Given that you don't need to worry about any cross-version compatibility in
> the
> streams format, that seems fine.

Yup, no cross-version and no cross-architecture (no need to worry about byte order either).

> (If you needed to worry about cross-version
> compatibility, I'd suggest using the standard variable-length integer
> encoding:
> 1 or more bytes consisting of 7 bits of number and the high bit indicating
> more
> bytes to follow.)

I've seen this encoding and I like it (reminds me of UTF-8 -- or is it actually the same? :) whatever) but it didn't occur to me to use it here. Indeed it's simpler to go with a fixed length in our case.

Comment 8 Egmont Koblinger 2017-03-09 09:33:46 UTC

I've started wondering whether the dotted underline is actually a good idea. Isn't it better if the app decides how it wants for format hyperlinks?

Anyway, we could always introduce a profile prefs for this.

Comment 9 Egmont Koblinger 2017-03-09 20:44:12 UTC

What to do on \e[@ (insert blank character) if a hyperlink is active, shall the new cell be hyperlinked or not? See bug 779817 :-)

Comment 10 Josh Triplett 2017-03-10 03:01:23 UTC

(In reply to Egmont Koblinger from comment #9)
> What to do on \e[@ (insert blank character) if a hyperlink is active, shall
> the new cell be hyperlinked or not? See bug 779817 :-)

..."meh". :)

Doesn't seem particularly critical.  Gut reaction: if you have a hyperlink "open" and haven't closed it, and you then emit a movement escape and print more characters, the hyperlink should remain "open", so the new characters hyperlink to the same target.  So, if you "insert" a character, the same thing should happen.

Conversely, if you don't have a hyperlink open, and you move to a location within an existing hyperlink and "insert" a character, the new character shouldn't have a hyperlink.

In my opinion, formatting depends on the current state, not on the state of the moved-to location.  Which seems consistent with VTE's current behavior.

Comment 11 Egmont Koblinger 2017-03-10 09:55:11 UTC

A few boring technical details for my future self:

- Not sure if we ever need to walk backwards in attr_stream. If we do, we can store the URL's length both before and after the URL. For 0-length this could be optimized to be stored just once, although that requires an extra branch, and probably 4 zeros compresses almost as good as 0 zeros.

- In the rare case when a frozen (stream) row is thawn and put back into the ring (this can only happen when the widget is enlarged), we need to allocate IDs for all the links encountered there, just as if we freshly received those links via escape sequences. Okay, it's easy.

- When a row from the stream is read solely in order to be displayed (the user has dragged back the scrollbar) (can't remember: is this also called "thawing"?), I really don't want to allocate IDs for the hyperlinks we see there. This would make garbage collection of them quite a bit harder (keep the union of ring-links and onscreen-links), plus waste memory for something that's already stored somewhere.

- We could give them the fake ID of -1 so that we see it's a URL and needs to be dotted-underlined, and read the row again to get the actual URL only when needed, that is, on a ctrl-click or right-click.

- Even better, we could assign them the fake IDs -1 and -2 alternating, so that within a row we can already see where a hyperlink terminates (even if it's adjacent to another hyperlink). This is useful for the solid underlining on hover.

- Stitching together links that span multiple rows (even at the boundary of the ring and stream) for hover underlining is going to be a painful tedious task. At this moment I can't think of any magic big hammer that would conveniently, automagically do this for us. Preferably we should also have some kind of caching (we should check how the regex highlighting is done, IIRC we have something like this there) so that we don't do CPU-expensive stuff when the mouse is moved to the next charcell of the same hyperlink.

Comment 12 Egmont Koblinger 2017-03-10 09:59:06 UTC

typo: probably 4 zeros compresses almost as good as *2* zeros.

Comment 13 Egmont Koblinger 2017-03-12 22:06:17 UTC

The issue that a malicious app can fill up /tmp still somewhat bothers me.

Currently if a malicious app wants to place as much data as possible in vte's file descriptors, it would output a few combining characters for each cell (we support at most 10 combining characters per cell, they are all at most 3 bytes each in UTF-8, so that's (along with the base char) at most 34 bytes uncompressed, and presumably they compress quite well), and a 16 byte record in attr_stream (8 for the offset and 8 for the attributes). That's 50 uncompressed bytes per cell in the worst case.

There's no standardized maximum for the URL length, but plenty of sites claim 2000-ish as the de facto maximum, and 2083 as the maximum supported by Internet Explorer.

Assuming we want to be able to store a URL of this length for each cell, scrollback data might consume up to 2140-ish bytes per cell in the worst case. That's a roughly 40x increase.

Given the defaults of gnome-terminal: a 80 wide terminal and 10'000 lines of scrollback, a malicious app can currently write about 40 MB to /tmp (that's a safe upper estimate I believe because it doesn't count for compression yet), whereas with the hyperlink feature it would grow to 1.6 GB (minus compression, the worst case ratio depends on whether we store the URI-encoded version or the decoded one, I don't know which one has the 2083-char limit in IE, let's not go into such details now). Or e.g. 160 wide terminal and 50'000 lines of scrollback => the previous numbers 10-fold.

I believe this is a problem, I believe it's safe to assume that /tmp has a free 40 or so megabytes on most systems, but not necessarily several gigabytes.

I was thinking about what we could do here. At this moment I have two ideas, and obviously I'm open for new ones.

I definitely wouldn't want to introduce much shorter limits, or limits that could locally turn into a bottleneck. E.g. for a monitoring utility I can easily imagine that it would output a screenful of lines, each containing multiple URLs that are each extremely long, and that's okay.

I also don't want to change the Prefs UI from specifying the number of lines to specifying the amount of memory, that's so user-unfriendly.

Idea #1: Throttle the writing speed to attr_stream. Each time a row is written with an unusually large amount of data going to attr_stream (e.g. more than 50 bytes per cell in average), wait a tiny little bit. This would remain unnoticeable at interactive use, yet slow down significantly the speed at which a malicious app fills up /tmp. An advantage is that it also offers a nice protection if the scrollback size is unlimited. A disadvantage is that eventually (if the malicious behavior remains unnoticed) it could still write tons of data.

Idea #2: Regularly check the size of attr_stream (e.g. fstat's st_blocks), and if it becomes larger than a reasonable safety cap (e.g. width * scrollback_lines * 50 bytes) then drop a few lines from the tail of the scrollback buffer, temporarily making it shorter than desired.

Note again that I'm talking about the worst case scenario when an attacker can send arbitrary data to gnome-terminal, and I do not want to introduce visible limitations for the normal regular usage patterns.

Comment 14 Josh Triplett 2017-03-13 03:28:31 UTC

I think it's exceedingly unlikely that legitimate applications would output many URLs per line with thousands of characters per URL.  I'd go with idea #2, but I'd suggest dropping just the URLs rather than complete lines of scrollback.  That way, you preserve the content, but in extreme cases discard the URLs and no longer treat some of the oldest links in scrollback as links.

Comment 15 Egmont Koblinger 2017-03-13 09:07:07 UTC

(In reply to Josh Triplett from comment #14)

> but I'd suggest dropping just the URLs rather than complete lines of
> scrollback.

If the links are stored in attr_stream (my current candidate design) then this is pretty much impossible to implement.

It's easy to implement though if we introduce a 4th stream for the links (as mentioned earlier) which is not that bad of a design after all. Maybe I'll change my mind and go for that.

Comment 16 Egmont Koblinger 2017-03-13 23:23:15 UTC

Created attachment 347891 [details] [review]
Demo implementation (vte bits)

Comment 17 Egmont Koblinger 2017-03-13 23:23:38 UTC

Created attachment 347892 [details] [review]
Demo implementation (gnome-terminal bits)

Comment 18 Egmont Koblinger 2017-03-13 23:36:09 UTC

I've created a quick demo implementation of this feature.

It's truly a resource-hog since for each cell there's always a 256 byte buffer allocated for the URL, there's no smart memory management at all. Similarly, the records of /tmp's attr_stream have also grown by 256 bytes each. The URLs here are zero-padded and hence compress quite well. Still, ugh...

I do not intend to build the proper solution on this. My only intent right now was to showcase the feature, and allow to progress with patches to less, apt and other apps.

Josh, let us know please if you have created such patches (I'm especially interested in less), please post links to mainstream bugreports, or even draft patches themselves if you feel like it's too early to contact mainstream.

In the current vte/g-t patch:

- OSC 8 as discussed above
- hyperlinks are underdotted
- no solid underline on mouseover (will be cumbersome to implement)
- URL length limited to 255 bytes
- no URI-decoding performed
- ctrl+click opens this hyperlink (and has priority over auto-links)
- right-click menu entries that show, open or copy the link
- of course no limit against malicious overuse of this feature

Comment 19 Egmont Koblinger 2017-03-13 23:45:46 UTC

Just for fun:

function osc8ls() {
  for i in *; do echo -e "\e]8;file://$PWD/$i\a$i\e]8;\a"; done
}

Comment 20 Josh Triplett 2017-03-14 06:23:25 UTC

I built libvte and gnome-terminal with that patch, and tried out the escape sequence (manually and with your sample osc8ls function), but didn't seem to get any hyperlinks.

Comment 21 Egmont Koblinger 2017-03-14 07:17:19 UTC

You should quit all gnome-terminal windows at once (otherwise the old process is asked to open a new window), or start up in two steps specifying an app-id as shown at https://wiki.gnome.org/Apps/Terminal/Debugging (without the gdb bits).

If still doesn't work, first try ./src/testvte within vte's source, it should underline and print the URL to stderr on a right-click. Does this work?

Comment 22 Egmont Koblinger 2017-03-14 07:21:07 UTC

... and in case it _still_ doesn't work: Does osc8ls print "garbage" (literal "8;file:///blah") or does it work like "ls -1 --color=never"?

Comment 23 Egmont Koblinger 2017-03-14 07:22:27 UTC

One more TODO that I have totally forgotten about so far: Change the mouse pointer shape on hover.

Comment 24 Josh Triplett 2017-03-14 08:12:39 UTC

(In reply to Egmont Koblinger from comment #21)
> You should quit all gnome-terminal windows at once (otherwise the old
> process is asked to open a new window), or start up in two steps specifying
> an app-id as shown at https://wiki.gnome.org/Apps/Terminal/Debugging
> (without the gdb bits).

I made sure I was running the new gnome-terminal.

> If still doesn't work, first try ./src/testvte within vte's source, it
> should underline and print the URL to stderr on a right-click. Does this
> work?

It prints:

Extra regex didn't match
Not hyperlink

(In reply to Egmont Koblinger from comment #22)
> ... and in case it _still_ doesn't work: Does osc8ls print "garbage"
> (literal "8;file:///blah") or does it work like "ls -1 --color=never"?

It doesn't print garbage; it just shows the filenames, like ls -1 does.

Comment 25 Egmont Koblinger 2017-03-14 09:43:37 UTC

Created attachment 347904 [details] [review]
Demo implementation (vte bits, v2)

Fix: \e[m or \e[0m no longer terminates a hyperlink.

(It's unlikely to fix the bug why this patch doesn't work at all for Josh.)

Comment 26 Egmont Koblinger 2017-03-15 08:20:05 UTC

Josh, you're doing everything fine, I have no idea why it doesn't work you, it works fine for me. Must be a nasty bug in my patch, like a memory overrun. Would you mind helping me with debugging? Are you okay doing that in private email (I don't want to pollute this bugzilla page)?

Comment 27 Josh Triplett 2017-03-15 10:52:27 UTC

(In reply to Egmont Koblinger from comment #26)
> Josh, you're doing everything fine, I have no idea why it doesn't work you,
> it works fine for me. Must be a nasty bug in my patch, like a memory
> overrun. Would you mind helping me with debugging? Are you okay doing that
> in private email (I don't want to pollute this bugzilla page)?

Sure, happy to do so via email, or coordinate a time to do so interactively via IRC.

Comment 28 Josh Triplett 2017-03-15 21:33:56 UTC

(In reply to Josh Triplett from comment #27)
> (In reply to Egmont Koblinger from comment #26)
> > Josh, you're doing everything fine, I have no idea why it doesn't work you,
> > it works fine for me. Must be a nasty bug in my patch, like a memory
> > overrun. Would you mind helping me with debugging? Are you okay doing that
> > in private email (I don't want to pollute this bugzilla page)?
> 
> Sure, happy to do so via email, or coordinate a time to do so interactively
> via IRC.

I received your previous mail, and sent you a reply; however, it doesn't look like that reply went through, since you followed up saying that you hadn't seen a reply.  (Check your spam folder perhaps?)

I bounced you a second copy of that reply.

Comment 29 Egmont Koblinger 2017-03-16 20:59:17 UTC

Note: It's working now for Josh, he was hit by bug 780163.

Comment 30 Egmont Koblinger 2017-03-16 23:51:12 UTC

A few more cases where this feature could become useful (subject to the relevant apps adding support too):

- tmux and URLs that are partially scrolled out vertically (geez we haven't even fixed that in vte yet: bug 754936)

- tmux and vertical splits (panes next to each other)

- formatted man pages which wrap URLs by adding a hyphen and a huge whitespace margin on both sides

Comment 31 Egmont Koblinger 2017-03-17 00:08:04 UTC

iTerm2 already has such a feature request:
https://gitlab.com/gnachman/iterm2/issues/5158

Comment 32 George Nachman 2017-03-17 05:57:47 UTC

>  different syntax or different OSC number, just the empty parameter (perhaps allowing the omission of the semicolon)

This is a bad idea. This is sometimes used as a synonym for OSC 0.

Comment 33 Egmont Koblinger 2017-03-17 08:23:04 UTC

>>  different syntax or different OSC number, just the empty parameter (perhaps
>> allowing the omission of the semicolon)
> This is a bad idea. This is sometimes used as a synonym for OSC 0.

Do you mean OSC 8 is a synonym of OSC 0? Or what exactly? The bits you quoted from me confuse me and I'm not sure what exactly you're referring to. Also, do you have any references?

We can go for OSC 9 then. I like though that the number 8 looks a bit like a link :)

1337 is the standard for iTerm2-extensions, right? I'd like to make this a bit more popular feature and so I'd prefer to have its own dedicated number. (I'm adding another relevant comment to iTerm2's tracker.)

Comment 34 Egmont Koblinger 2017-03-17 21:32:16 UTC

Just a few minutes ago I actually first used "osc8ls" and then ctrl+click to quickly check which picture is the one I'm looking for in a particular directory, this feature saved me perhaps like 5 seconds, yay! :D

As a followup feature, gnome-terminal could actually show a preview of the file's contents (e.g. a thumbnail of the picture) as a tooltip.

---

In the mean time I'm still heavily thinking about the data structure to use in the streams. All the ideas so far have downsides that I really don't like.

- Inlining the links in attr_stream has two major drawbacks:

- If there's an attr change within a link, the URL is stored again (and again...). This becomes a real problem if the link text is CJK, since there there's tons of back-n-forth between single and double wide characters, each being an attr change. (We should actually test how good zlib compresses in this case.)

- We can't drop the links from the tail of the stream, in case we encounter way too many of them. We'd need to drop all the attributes, but since some attributes (like width of TAB or CJK) are essential for the text, we'd need to drop text as well.

- Having a link_stream in the way I was thinking about it so far (that is: it would go hand in hand with attr_stream, records of row_stream containing pointers to both) would double row_stream's size and would probably make rewrapping (or perhaps even normal data processing) noticeably slower. (The current demo patch does make vte 2-3x slower than before! It's important that the final versrion doesn't.)

My current idea is:

Introduce a 4th stream that contains links only. Pointers to them would not go to row_stream though but to attr_stream only when there's actually a link starting. Multiple adjacent attr_stream records that point to the same URL would be optimized to point to the same record in link_stream.

(It would also be possible to go for some brand new kind of data structure instead of a stream, like a file-backed hashmap with refcounting, or sqlite, or anything like these. I don't want to go in this direction, I'd much rather reuse existing components of vte. Plus, no new dependencies. If glib happens to have anything like this then that's great, but I really doubt it.)

Comment 35 George Nachman 2017-03-17 23:33:59 UTC

> Do you mean OSC 8 is a synonym of OSC 0? Or what exactly? The bits you quoted from me confuse me and I'm not sure what exactly you're referring to. Also, do you have any references?

OSC 8 is fine. I thought you were originally proposing an OSC without a number (\e ] ; blah \a)

Comment 36 Egmont Koblinger 2017-03-18 09:20:58 UTC

(In reply to George Nachman from comment #35)

> I thought you were originally proposing an OSC without a number
> (\e ] ; blah \a)

No way :D

Comment 37 Egmont Koblinger 2017-03-18 22:50:49 UTC

At this moment I'm pretty sure I'll go for inlining the URLs in attr_stream. This feels to be the one that requires the least change to the code and introduces the least additional resources. Moreover:

>   - If there's an attr change within a link, the URL is stored again (and
> again...). This becomes a real problem if the link text is CJK, since there
> there's tons of back-n-forth between single and double wide characters, each
> being an attr change. (We should actually test how good zlib compresses in
> this case.)

I've tested it now. (Actually ran "gzip -1" but that's pretty much the same as zlib's "-1" mode, apart from the headers, right?)

I took a 64k uncompressed, unenctrypted segment of attr_stream (using pre-compression, pre-encryption vte version 0.38), and added a few URLs between record boundaries here and there. Then added a 200-char long real-world URL somewhere, and interestingly, the gzipped size also grew by exactly 200 bytes. (I hoped for something better here, but it's irrelevant now.)

Then I added this very same URL again, 10 more times, one by one, after skipping an 8-byte block every time. During these, the gzipped size grew by between 2-13 bytes in each step, 5.7 bytes per step in average. That is, the 2000 byte increase in uncompressed size resulted in 57 byte increase after compression.

I repeated the same experiment with a much shorter URL, namely the URL of this bug (49 bytes), inserting at a different position of the file. At the first insertion the gzipped size grew by 9 bytes, and then by 4 bytes in average on the next ten repetitions.

In other words, gzip's "-1" mode already does a darn good job in finding nearby duplicates, we can absolutely safely rely on this. CJK links, where the visible text has alternating single and double wide characters, is a non-issue.

The URL is obviously repeated when a 64k boundary is crossed in attr_stream, but that's relatively rare, again nothing to be worried about.

Dropping from the tail (in case of malicious overuse) is still going to be uglier than in the other possible design, but it's not the use case we should build around.

Comment 38 Egmont Koblinger 2017-03-18 22:59:06 UTC

> At the first insertion the gzipped size grew by 9 bytes

Figured out I had already inserted this URL to the file. Makes sense now :)

Comment 39 Egmont Koblinger 2017-03-18 23:07:29 UTC

One more thing I'm pondering about:

Should the URL pool (id -> url mapping) be per-terminal or per-screen?

Per-terminal requires an ugly design. The ring is per-screen, so the pool should reside out of the ring, probably should be a new object. The ring needs to know its pool, and the pool needs to know both of its rings in order to GC. Quite ugly.

Per-screen is much easier to implement and the pool can mostly remain the private business of the ring. This one feels the right design to me. There's one tricky case though: Switching between the screens. This operation keeps the colors and other attributes, so it would make sense to keep the currently active URL as well. Its ID, however, obviously cannot be kept. It's quite easy to work it around, and I'm also fine with just closing the OSC 8 when switching between screens, but it's important not to forget about this corner case and not to let the current attributes live on with an unmodified active link ID.

Comment 40 Josh Triplett 2017-03-18 23:11:40 UTC

(In reply to Egmont Koblinger from comment #39)
> I'm also fine with just closing the OSC 8 when switching between screens

Closing an OSC 8 link when switching screens seems like a bug.  Switch tabs at the wrong time while an app outputs text, and only half the link gets linked.

Comment 41 Egmont Koblinger 2017-03-18 23:13:52 UTC

Don't worry :) By screens I mean the "normal" vs. "alternate" (e.g. the one "less" switches to) screen of a particular terminal.

Comment 42 Josh Triplett 2017-03-18 23:17:43 UTC

(In reply to Egmont Koblinger from comment #41)
> Don't worry :) By screens I mean the "normal" vs. "alternate" (e.g. the one
> "less" switches to) screen of a particular terminal.

Oh, I see.  Yeah, if the application switches to the alternate screen, closing the open link seems like less of a problem.

Comment 43 Egmont Koblinger 2017-03-19 00:17:27 UTC

See also https://www.mail-archive.com/kragen-tol@canonical.org/msg00304.html.

Comment 44 Egmont Koblinger 2017-03-19 09:47:20 UTC

Note: that's just some random guy using his mailing list kinda like a blog. He has nothing to do with Canonical (Ubuntu).

I love the use case he brings up, and I really dislike his choice of escape sequence.

Comment 45 Egmont Koblinger 2017-03-19 12:32:47 UTC

Created attachment 348257 [details]
Test file

A test file with tons of links. Just cat it and play with it :)

(Note: underscores disappear from gnome-terminal's context menu because they're interpreted as mnemonics. Ignore this problem for now.)

Comment 46 Egmont Koblinger 2017-03-19 13:33:27 UTC

I guess I'm gonna go for allowing ~4000 links at a time (per screen).

On one hand: As seen in the demo patch, the 8-byte limit for VteCellAttr is practically not needed, it's easy to extend it, and the most convenient way of implementing this whole feature is to extend it with the URL attribute. That field will need to be overloaded, though. In streams it'll contain the length of the URL, whereas in the ring it'll be a unique at a time ID.

Following the general recommendation and IE's limit, I'll limit URLs to 2083 bytes, which requires 12 bits to encode their length. Wishing to have the same structure and bit layout in the rings, these 12 bits give us ~4090 possible IDs (there might be a few special reserved values).

On the other hand: I was thinking about a quite extreme, yet valid use where there are tons of URLs. One I could think of is Midnight Commander >= 4.8.15 with brief file listing and 9 columns per panel. This, with a fullscreen terminal, on my particular setup and my favorite font, lists ~1100 files at a time. (Of course we'd still need to implement the URL-ify feature in mc, that's another story.) So 4000 feels like a reasonable upper limit. If someone goes more extreme than this and requests, we can add an additional bit or two, but I find it unlikely to happen.

For the time being, it means that VteCellAttr will grow from 8 to 10 bytes, but we'll have a spare of 5 bits without further growing the size.

Comment 47 Egmont Koblinger 2017-03-20 07:11:18 UTC

Geez, iterm2's developer has just implemented this feature.

I got super excited so I want to do this as fast as I can, but it'll take a week or two. Then we can move forward with sending out feature requests to various apps.

Comment 48 Christian Persch 2017-03-20 21:41:00 UTC

I have only skimmed this bug since it's so big already, so I apologise if the following remarks have already been addressed.

* First let me say that I've wanted this for a long time, too :-)

* For the storage problem, bug 729204 (SIXEL support) also has the requirement of storing (even more vast) quantities of data. If possible, the design should accomodate both in just one extra FD, and applying resource limits (settable by API) to them as one.

* Enlarging the cell size by another 64bit wouldn't be too bad, IMHO.

* The solution to the 'malicious use' problem should be the resource limits (possibly with a callout that allows the containing app (gnome-terminal) to ask the user if use exceeds some limits).

* As for the concern about the amount of extra data this could use even in the legimitate use case (e.g. in a linkifying ls(1)), one way to reduce storage need would be to additionally introduce OSC sequences for setting a 'base URI' and then have another OSC sequence for specifying URIs that are relative to that (current) base URI. (And then also perhaps one variant that takes file paths instead of URIs, so that saves the 'file://' part.)

* In order for this to gain widespread adoption (e.g. in coreutils), I think we need to add these sequences as capabilities to terminfo, and therefore need to solve the terminfo problem (refer bug 778958.)

* While this seems fine for non- and semi-interactive programmes (ls, find, etc), I'm not sure this works as well for fully interactive programmes (ncurses/slang based, and things like w3m/elinks). Those replace characters at specific positions all the time, and there probably isn't a good place to emit these OSC sequences. They however do know where on screen the 'links' are, so IMHO an extension of the mouse protocol would be better suited for them, ie on receiving a mouse move event they could reply with a sequence that tells us whether or not there's link at this position, and a way for vte to query the app for the link URI.

Comment 49 Egmont Koblinger 2017-03-20 22:16:10 UTC

(In reply to Christian Persch from comment #48)

> * First let me say that I've wanted this for a long time, too :-)

Yay! I was about to ask you whether you're conceptually happy with this feature, just didn't want to bother you before the 3.24 release :-)

> * For the storage problem, bug 729204 (SIXEL support)

I haven't taken a closer look at that one yet.

> * Enlarging the cell size by another 64bit wouldn't be too bad, IMHO.

I have just done some speed experiments. The refactoring that gets rid of the union and the VteInt* structures (and hence the 64bit limit) has no performance implications. Adding an extra 2 or 4 bytes (with attribute packed) slows down cat'ing my standard test file from 4.9s to 5.9s-ish. Adding an extra 8 bytes slows it down to 5.2s-ish only (memory operations obviously become faster than with 8+2 or 8+4 bytes, although the 64k buffer of the stream gets filled up quicker and so there's more compression and more disk operations going on).

I might end up going for two parallel structures: 16 bytes for in-memory, and 10 bytes packed for the stream. Or just 16 bytes. Or packed 10 bytes everywhere, the 4.9 -> 5.9 slowdown isn't terribly bad either. Whatever. It's pretty unimportant and we can always fine-tune later on.

> * The solution to the 'malicious use' problem should be the resource limits
> (possibly with a callout that allows the containing app (gnome-terminal) to
> ask the user if use exceeds some limits).

I don't want to overcomplicate, and I don't want to implement and maintain UI boxes that practically never show up to the user and hence practically never tested.

I'm thinking about a really poor man's throttling: record the timestamp when the 1st URL is written out, cumulate the overall length of all written URLs in a variable, and when a certain overall URL length (e.g. 1000 bytes) is reached check the clock again, and insert a bit of sleep if required to ensure that a certain amount of time (e.g. 0.01s) has elapsed. We can improve on this later on, but it's hopefully good enough for a start. I'm not familiar with sophisticated throttling algorithms and won't have time to study them (although I'd love to learn about this topic).

> additionally introduce OSC sequences for setting a
> 'base URI' and then have another OSC sequence for specifying URIs that are
> relative to that (current) base URI. (And then also perhaps one variant that
> takes file paths instead of URIs, so that saves the 'file://' part.)

Let's not do these, let's keep the interface (escape sequences) as simple (and by the way: as stateless) as possible. Whatever extension one would introduce would need to be supported by all terminal emulators that care about this feature (iterm2 and vte for now, hopefully more to join soon), and all other kinds of apps that need to parse these sequences (less, tmux/screen, ...).

> * While this seems fine for non- and semi-interactive programmes (ls, find,
> etc), I'm not sure this works as well for fully interactive programmes
> (ncurses/slang based, and things like w3m/elinks). Those replace characters
> at specific positions all the time, and there probably isn't a good place to
> emit these OSC sequences.

I tend to think about this new feature as something analoguous to the existing attributes like colors, bold, italic etc. It's just one more of them. These apps move the cursor around crazily and update tiny bits of data here and there, but always have to take care of picking the right color and attributes. If they can do it properly, doing the same with the URLs shouldn't cause any problem for them.

> They however do know where on screen the 'links'
> are, so IMHO an extension of the mouse protocol would be better suited for
> them, ie on receiving a mouse move event they could reply with a sequence
> that tells us whether or not there's link at this position, and a way for
> vte to query the app for the link URI.

I don't like the asynchronous nature of this approach, let alone that we can't underline properly etc. I'd much rather stay with the current design (and don't forget: iterm2 has already implemented this), apps telling us upfront where links are and where they point to.

Comment 50 Egmont Koblinger 2017-03-20 22:32:19 UTC

About URI encoding:

This concept has always been somewhat unclear to me. Usually it's the container that knows how to escape the data it embeds, e.g. in a C source file some chars need to be prefixed with a \, in HTML there's the &...; notation, and so on. URI is different: here the URI itself wishes to encode itself in a way that's safe to embed in pretty much every context.

Special chars in filenames do need to be %-escaped, otherwise there's no way to denote a filename containing a '\a' or '\e' character (OSC terminators). In turn, a literal '%' in a filename must be presented as "%25" in OSC 8 - just as with OSC 6 and OSC 7, done by vte-2.91.sh's __vte_urlencode(). Utilities like smart "ls", "find" etc. will also need to perform this escaping.

As seen in the test file I've attached, currently things Just Work in patched vte/g-t: If OSC 8 contains a file://...%25... then it opens the text file that has a '%' in its name. Also, web URLs work with raw UTF-8 as well as properly URI-encoded UTF-8 characters. If vte did some kind of encoding/decoding, gnome-terminal would need to undo that before opening the URI.

** As such, I believe the right thing for us to do is _not_ to any URI encoding/decoding whatsoever, just remember them and pass them to the container app unchanged. **

What we might do in a future version is URI-decode them solely for display purposes, e.g. if we show the target URI on the UI, so that we reconstruct e.g. the letter "Á" from "%C3%81".

Something I'm unsure about: What to do, vs. what does vte actually do if raw 8-bit bytes are encountered which are either not valid UTF-8, or the terminal's encoding is not UTF-8? Is there maybe some conversion going on that shouldn't happen? E.g. with OSC 7, if I emit it with a raw Latin-1 accent, I'm _not_ taken to that Latin-1 directory in a new tab. Could be a bug. I won't care about it for this feature, I'm just copy-pasting the parsing bits in vteseq from OSC 6/7. If there's a bug with these, we'll address later on in a separate bugreport. Anyways, it's just one of the reasons why raw 8-bit bytes should always be URI-encoded.

Comment 51 George Nachman 2017-03-20 22:54:15 UTC

What character set do we expect URIs to be in? The character set of the session in general? UTF-8? ASCII? I've been assuming UTF-8 because that's the way of the future, but perhaps I'm too optimistic. It should be specified, anyway.

Comment 52 Egmont Koblinger 2017-03-20 23:15:26 UTC

> What character set do we expect URIs to be in?

Ideally, all the bytes with the high bit set should be %-escaped before they even arrive to the terminal emulator.

URLs of http(s) pages are always encoded in UTF-8, and then URI-encoded. E.g. the wiki page of the letter Á should appear within an OSC 8 as https://en.wikipedia.org/wiki/%C3%81

For local files, with the Unix filename semantics (filenames are sequence of bytes, interpretation is up to whoever displays them) again I think the right thing to do for the app that prints them is to percent-escape each byte that has the high bit set, and not care about encodings at all. This should be enough to open the referred file. If you want to display its name, however, I believe you should be going with the terminal's encoding.

If bytes with high bits occur within OSC 8: I don't know what's right behavior. Ideally this shouldn't occur, but it surely will. Maybe the desired behavior differs for the "http://" and "file://" schemes, leave them as-is at "file://" so that blindly copying a filename byte-by-byte keeps working, but convert to UTF-8 for "http://" so that a visual letter Á actually takes you to its wiki page??? Really not sure at all :) I don't care too much, and I definitely won't implement scheme-depending conversions. It's a combination of two discouraged things: non-UTF8 as charset, as well as forgetting to properly URI-encode the link. I'd guess completely dropping such links is also okay. Let whoever prints OSC 8 fix its code to properly URI-encode.

Comment 53 Josh Triplett 2017-03-20 23:16:10 UTC

(In reply to Christian Persch from comment #48)
> * As for the concern about the amount of extra data this could use even in
> the legimitate use case (e.g. in a linkifying ls(1)), one way to reduce
> storage need would be to additionally introduce OSC sequences for setting a
> 'base URI' and then have another OSC sequence for specifying URIs that are
> relative to that (current) base URI. (And then also perhaps one variant that
> takes file paths instead of URIs, so that saves the 'file://' part.)

This seems like it adds a complex degree of statefulness, and compression ought to provide the same benefit.  (Compression wouldn't reduce the data transmitted between the application and the terminal, but that doesn't seem like as big of an issue as storage and display.)

> * In order for this to gain widespread adoption (e.g. in coreutils), I think
> we need to add these sequences as capabilities to terminfo, and therefore
> need to solve the terminfo problem (refer bug 778958.)

Probably, though I suspect solving that may require building a viable replacement.

> * While this seems fine for non- and semi-interactive programmes (ls, find,
> etc), I'm not sure this works as well for fully interactive programmes
> (ncurses/slang based, and things like w3m/elinks). Those replace characters
> at specific positions all the time, and there probably isn't a good place to
> emit these OSC sequences.

Such applications could track URIs the same way they do formatting, and emit them appropriately.  As long as we follow the rule that adjacent cells linking to the same place form part of the same link, that should allow replacing a character in the middle of a link without any issue.

Comment 54 George Nachman 2017-03-20 23:28:57 UTC

I'd support saying that URIs will be interpreted as UTF-8. URL-encoding with shell is a PITA and it's nice to be able to write echo -e "\e]8;http://المغرب.icom.museum\a' and have it do the right thing.

Comment 55 Egmont Koblinger 2017-03-21 00:00:26 UTC

Well, it's not a typical use case to _manually_ type echo -e "\e]8;http://blahblah" to then click on it and visit a page, you'd rather type it straight away into a browser :)

IDN is something I totally forgot about, but the %-escaping also seems to work there, at least in Firefox and Chromium, also via the current gnome-terminal patch:

echo -e "\e]8;http://%d8%a7%d9%84%d9%85%d8%ba%d8%b1%d8%a8.icom.museum\a click me \e]8;\a"

And %-escaping is a piece of cake to implement in apps, as opposed to the idn/punycode stuff. So, still no reason for any app to ever output non-ascii bytes here.

That being said, it's indeed not a proble if raw UTF-8 also works.

Comment 56 George Nachman 2017-03-21 00:17:22 UTC

I was thinking of shell scripts, not adhoc echo commands. When you're banging out a hacky shell script it's great when things like this just work.

Comment 57 Egmont Koblinger 2017-03-21 22:10:31 UTC

Created attachment 348448 [details] [review]
VteCell refactoring, no more union with int

This tiny bit of refactoring gets rid of the 8-byte limit for VteCell, and does make sense on its own without the hyperlink feature. As such, I'm planning to submit it separately from the big feature implementation.

Comment 58 Egmont Koblinger 2017-03-21 22:21:11 UTC

Created attachment 348449 [details] [review]
Vte OSC 8 placeholder

Placeholder for OSC 8 so that vte swallows it, rather than printing garbage. Aimed for the 0-48 branch, 0.48.1 release.

In the mean time it's time to decide what the internal naming should be: link, hyperlink, anchor, uri, ...?

Anchor, as far as I understand, much rather refers to the visible text, so it's probably out of question. Uri I don't like for some reason, don't know why.

Searching for the link vs hyperlink difference, there's no clear consensus on forums. Some say a hyperlink is one that you can navigate (typically <a>), whereas a link is also something when an external resource is fetched in the background (<img>, <embed>, <link>). Others say hyperlink is a short for "hypertext link", and we don't know if the target is hypertext (mind you, the hypertext transfer protocol is also being used to serve pretty much everything, not just hypertext).

Both raise a question whether it can be mistaken for the autodetected URLs in vte, although those are called regexes (since vte doesn't require them having to do anything with web addresses). I wouldn't want to go for overlong names like "explicit_link" throughout the source to make it more obvious.

I like the conciseness of the word "link" but it has the disadvantage that it's harder to grep for, there are many false positives (around software linking as well as blinking cursor/text).

So cast your votes please: link or hyperlink? :)

Comment 59 Josh Triplett 2017-03-22 03:09:04 UTC

In a terminal app context, "link" can also mean many other things.  I'd suggest the unambiguous "hyperlink".

Comment 60 Egmont Koblinger 2017-03-24 00:27:04 UTC

Created attachment 348627 [details] [review]
vte, work in progress 1

Misusing bugzilla for backup purposes :D

This is a heavily work in progress version, in the direction of the final design I have in mind. Onscreen links are kept in a pool, and are written to attr_stream as agreed. Reading back from the stream is missing, if you scroll back the links point to an unknown target. Plus, no hover underlining and no throttling yet.

Don't resize the window vertically across a power of 2 boundary after having hyperlinks, it'll crash. The do_truncate branch is truly a nightmare (even though I fixed it 3.5 years ago in bug 708496). Extremely hard to understand, and is walking backwards in attr_stream which we can no longer afford (we'll need to walk backwards in row_stream and look up the attr_stream pointer from there). I'm sure I'll go nuts while fixing it :) I'm also wondering whether we can simplify by not truncating some records, but rather simply allowing occasionally to have a "fake" record as if the attributes changed while they actually didn't. Or if it'll just make it even more complicated...

Comment 61 Egmont Koblinger 2017-03-24 16:03:06 UTC

> I'm also wondering whether we can simplify by not truncating some records,
> but rather simply allowing occasionally to have a "fake" record as if the
> attributes changed while they actually didn't.

Nope, I'm not going to do this, this would conflict with the following idea (next comment).

Comment 62 Egmont Koblinger 2017-03-24 16:26:18 UTC

Something that's been spinning in my mind for quite a while is stitching together the URIs at line boundaries on mouseover, especially in the scrollback stream, let alone at the boundary of the stream and the ring. With the current design it's terribly cumbersome. When thawing a row from the stream for display purposes only (i.e. the scrollbar is dragged back), I don't want to carry the actual URIs (could be quite expensive), and don't want to place them in the ring either (expensive, gc becomes hard, plus then dragging back the scrollbar has an influence on the functionality when the max count of URIs is reached). And if we don't know the URI itself, how do we figure out if it's the same as the one in the prev/next row? Solveable, but complicated and ugly.

I'm also not a fan of carrying checksums of URIs due to the possibility of false positives.

What I figured out that would simplify things a lot:

Stitching won't happen in the display phase, but when writing to attr_stream. Each record will have an additional bit telling whether at the end of the run repsesented by that record the URI also terminates (the next cell will be guaranteed to have a different one, or none). This bit of information can easily be carried on to the cells when thawing a row, so cells of thawn rows will be one of "no link", "non-last cell of an anchor text", "last cell of an anchor text". Cells of the ring will still contain the link ID instad, without knowing if it's the last cell (since it's not feasible to maintain it as the contents are changing randomly). This is sufficient to easily locate an anchor's first and last cell, even across the ring-stream boundary. (By the way, this design is also much closer to the link_stream approach mentioned above, mostly from the displaying part of view. Will come handy whenever we refactor to that design, which is probably never.)

I love and hate at the same time how this feature requires so much thinking :) Gives me a feeling that the ring+stream design is overly complicated, although I have no idea at the moment how to simplify, maybe it just needs to be this complex.

Comment 63 Josh Triplett 2017-03-24 16:50:56 UTC

Won't that approach make it much harder to handle going back and adding new characters at the start or end of a link with the same URL?

I think you really do want "adjacent with the same URL" as the rule for "same link". And "same ID" works fine as a proxy for that.

However, I think you could get away with only underlining the part of a link on the same line on hover, if that makes this any easier.

Comment 64 Egmont Koblinger 2017-03-24 17:11:06 UTC

(In reply to Josh Triplett from comment #63)

> Won't that approach make it much harder to handle going back and adding new
> characters at the start or end of a link with the same URL?

VTE uses a completely different data structure for the onscreen contents (plus a few more lines) and the scrollback. The onscreen bits are stored directly in the so-called "ring" (each cell stored separately along with its attributes), whereas the scrollback contents are placed in a few "streams" in a totally different format. In particular, "attr_stream" stores the attribute and the final position for each continuous run of identical attributes.

<nit>More precisely, the ring object takes care of handling the stream, so when I say "ring" I much rather mean "ring but not stream".</nit>

The more troublesome bit with hyperlinks is the stream, i.e. the contents that have been scrolled out. The good side of that, though, is that it's read-only. There are no escape sequences to go back there and partially override the data there. Once written into the stream, the boundaries of hyperlinks are known and final. (Records are only written out when we actually see that that particular run of identical records has finished, that is, we already see a subsequent cell with different attributes.)

Sometimes we need to "thaw" a row with "do_truncate", that is, place back from the stream to the ring. This can happen when the window is enlarged. It's a PITA anyways (e.g. caused tons of troubles with the encryption), and is a PITA here as well (this is where the wip1 patch crashes).

> However, I think you could get away with only underlining the part of a link
> on the same line on hover, if that makes this any easier.

No, I'm not doing that :) I want the user experience to be flawless.

Comment 65 Egmont Koblinger 2017-03-25 00:22:42 UTC

(In reply to Josh Triplett from comment #63)

> However, I think you could get away with only underlining the part of a link
> on the same line on hover, if that makes this any easier.

Not only did I hit some unexpected walls again while trying to implement this, but the above comment of yours made me think:

Underlining on hover somewhat defeats the purpose of this feature (at least some of the great use cases we had in mind). It's automatically stitching together some bits, potentially missing quite a few cases, while one of the great aspects of this feature is to overcome the autodetection stuff when URLs are displayed.

E.g. think of a URL in a manpage wrapped into two lines (formatted with groff and friends, having a large white margin on both sides, viewed with "less -R"), or an app implementing its own windowing system (tmux, mcedit with multiple files etc.) and having a URL wrapped into multiple lines in these. Assume that these apps have implemented and properly use OSC 8 where they still print the visible URLs.

In these cases hover underlining would again only underline parts of these URLs, similarly to the current behavior. Ctrl+click would open them correctly, but the underlining would totally look silly and go against the overall intent of this feature.

If we implement hover underlining at all, I'm wondering whether instead of stitching together the adjacent cells, could it be a better approach to underline _all_ the visible cells that point to the same destination? At least this would result in a better experience in the above outlined cases (although sure could look silly in many other ones, e.g. if your PS1 contains your PWD as a hyperlink).

Another approach could be to remember which cells were printed within the same OSC 8 run, but this, as discussed and agreed earlier, would probably cause problems with apps that optimize their display updates. So I'm not too keen on this idea. Plus, implementation doesn't sounds too easy either.

So, I'll postpone this for a while. I'd like to get all the other bits ready, reviewed and committed by 0.49.1 (Apr 24), and maybe think about hover underlining later. Or just drop this idea entirely.

Comment 66 Egmont Koblinger 2017-03-25 00:35:03 UTC

bash prompt, just for fun:

PS1='\u@\h:\[\e]8;file://$(__vte_urlencode "$PWD")\a\]\w\[\e]8;\a\] \$ '

Comment 67 Josh Triplett 2017-03-25 17:47:23 UTC

(In reply to Egmont Koblinger from comment #65)
> E.g. think of a URL in a manpage wrapped into two lines (formatted with
> groff and friends, having a large white margin on both sides, viewed with
> "less -R"), or an app implementing its own windowing system (tmux, mcedit
> with multiple files etc.) and having a URL wrapped into multiple lines in
> these. Assume that these apps have implemented and properly use OSC 8 where
> they still print the visible URLs.
> 
> In these cases hover underlining would again only underline parts of these
> URLs, similarly to the current behavior. Ctrl+click would open them
> correctly, but the underlining would totally look silly and go against the
> overall intent of this feature.
> 
> If we implement hover underlining at all, I'm wondering whether instead of
> stitching together the adjacent cells, could it be a better approach to
> underline _all_ the visible cells that point to the same destination? At
> least this would result in a better experience in the above outlined cases
> (although sure could look silly in many other ones, e.g. if your PS1
> contains your PWD as a hyperlink).

Interesting notion.  Seems like it'd produce reasonable results in many cases, though it might also lead to confusion (by making a distant part of the screen change and draw attention).

It seems like we've mixed presentation and semantics a bit here.  While hover-underlining does seem like it'd make this feature more discoverable, browsers don't actually do that unless the site tells them to; by default, browsers format links using static formatting.

I agree that we should go with static formatting for now, and worry about hover-underlining later if at all, based on experience with this feature.

Thanks again for working on this!

Comment 68 Josh Triplett 2017-03-25 17:50:41 UTC

I experimented with using regular expressions to turn bug closures and CVEs in Debian changelogs into links, and it worked quite well.

I also started working on less -R, though that will require more work, as less will actually need to keep track of the URLs attached to text much like gnome-terminal does.

Comment 69 Egmont Koblinger 2017-03-25 20:39:29 UTC

(In reply to Josh Triplett from comment #67)

> It seems like we've mixed presentation and semantics a bit here.

I was thinking about a design where the escape sequence contains an ID as well to tell the emulator which cells belong to the same link. But that would heavily overcomplicate things (and not just for terminal emulators, but also for apps) for marginal benefits. So no, I definitely don't want this. (Let alone the compability issue with iterm2.)

> While
> hover-underlining does seem like it'd make this feature more discoverable,
> browsers don't actually do that unless the site tells them to; by default,
> browsers format links using static formatting.

The built-in no-css default of browsers is IMO irrelevant. What might matter is the typical practice across the web, and it looks to me that most of the time the formatting changes on mouseover (typically becomes underlined, or of a different color). Plus, the target appears in the bottom corner.

> I agree that we should go with static formatting for now, and worry about
> hover-underlining later if at all, based on experience with this feature.

Yup. (I'd like to come up with something for 0.49.2 or so, but the good thing is that we can modify it even in years to come without breaking the functionality.)

We definitely do have to do something with overlapping implicit and explicit links, so that the underlining (vte's suggestion to the user on what would happen on a ctrl+click) matches the actual behavior (what g-t actually does on a ctrl+click). E.g. it's not okay if the autodetected link is underlined on hover, and then the explicit one is opened.

I also probably wouldn't like if autodetected links and explicit links looked different on hover. One possible approach is to relax the normal underline to a dashed one for autolink hover, so that they look the same. (Mind you, browsers don't even automatically recognize and linkify non-anchored links, you have to manually copy-paste them.)

> Thanks again for working on this!

Thanks for the great feature request, as well as looking into less! (I've taken a quick look at it, doesn't look easy. E.g. it believes ANSI color-changing escape sequences start at ESC and the next [ is just an internal byte of it. At some other point it walks backwards in the buffer, and whether ESC or 'm' comes first tells whether it's in the middle of a recognized escape sequence. These need to be reworked heavily. Good luck!)

Comment 70 George Nachman 2017-03-25 21:21:57 UTC

It would be wise to allow extra parameters such as an ID, target=, other presentation settings, and anything unforeseen in the future. It's a little tricky to get more than one parameter into this sequence since URLs can contain any character. Semicolons would be the obvious delimiter but they're used. Space might be a good choice.

Comment 71 Egmont Koblinger 2017-03-25 22:47:15 UTC

(In reply to George Nachman from comment #70)

> It would be wise to allow extra parameters

Let's do it, then :)

> such as an ID, target=,

I'd rather the target not look like an "extra parameter" but the main thing without a name; required to be either the first or the last (whichever we agree on) component. This also makes it easier to use this sequence in ad-hoc scripts.

> Space might be a good choice.

Yup. Or the pipe symbol. Or stick with semicolon, but require a special symbol (e.g. '$') for all names (and disallow this to be the first character of the URI). E.g.:

  $id=7;$foo=bar;http://blah.example.com

Any preference from you guys? (Josh? Christian?)

---

As for the ID, for the purpose of hover underlining:

Simple utilities, such as "ls", should not be bothered coming up with a random ID each time, nor should subsequent runs of them auto-glue the corresponding URLs. As such, the simplest default case (i.e. no ID) should treat all OSC 8s to open a new one. Internally, the terminal emulator could automatically assign a unique ID (from a namespace that's unreachable via the escape sequence) each time it encounters such an OSC 8 without an ID.

Explicit IDs probably don't make that much sense on the normal screen, because it's out of the app's control whether it glues together with ones from earlier runs (or even from other apps) or not. It's not a problem though and probably the terminal emulator should just not care and obey them. They make the most of sense on the alternate screen where fullscreen apps run typically. E.g. editors could use the file offset where the URI begins as the ID. Or whatever's convenient for them.

I'd like the ID to be an arbitrary string (as opposed to, let's say, an int64). This potentially makes it much easier for app creators, since they don't have to manage an app-global pool for this. E.g. when tmux encounters a link with "id=123" inside its pane number 45, it could convert it to an "id=tmux-45-123" towards the host terminal emulator, without having to worry about conflicts with other panes.

Do these make sense?

---

Implementation in VTE (on top of the existing code) is quite trivial. It's just a simple parsing and adding a unique ID if missing when the OSC 8 is encountered, and parsing again to strip off the ID when giving the value to gnome-terminal. The ring pool, streams can remain unmodified and work with the concatenated (id, target) values.

Comment 72 Egmont Koblinger 2017-03-26 00:17:05 UTC

Created attachment 348725 [details] [review]
vte, work in progress 2

Needs tons of cleanup, but hopefully there are no more bugs or crashes.

Figured out that reworking the stream truncation is kinda impossible without me going mad, so I just duplicate the URI's length after the URI, this makes it possible (and easy) to walk backwards in attr_stream.

One day we might rewrite this entire ring / stream / last_attr craziness from scratch and come up with something nicer :)

Comment 73 Josh Triplett 2017-03-26 01:38:37 UTC

(In reply to Egmont Koblinger from comment #71)
> (In reply to George Nachman from comment #70)
> 
> > It would be wise to allow extra parameters
> 
> Let's do it, then :)
> 
> > such as an ID, target=,
> 
> I'd rather the target not look like an "extra parameter" but the main thing
> without a name; required to be either the first or the last (whichever we
> agree on) component. This also makes it easier to use this sequence in
> ad-hoc scripts.

If you want to allow extra parameters, they need to come before the URL; the URL needs to come last, because depending on the scheme, it could contain almost any character.  

> > Space might be a good choice.
> 
> Yup. Or the pipe symbol. Or stick with semicolon, but require a special
> symbol (e.g. '$') for all names (and disallow this to be the first character
> of the URI). E.g.:
> 
>   $id=7;$foo=bar;http://blah.example.com
> 
> Any preference from you guys? (Josh? Christian?)

To reserve the possibility of future extension, and make parsing unambiguous, I'd suggest saying that the first character of the URL can't be ';'.  So, a normal URI escape looks like OSC 8 ; URI \a , and a URI escape with parameters looks like OSC 8 ;; params ; URI \a .  Since we can define "params" arbitrarily, you can easily just define it to not contain any semicolons; for instance, key=value,key2=value2.

That said, I also don't think we should define any parameters yet.  But reserving a spot for them seems fine for forward-compatibility.

> As for the ID, for the purpose of hover underlining:
> 
> Simple utilities, such as "ls", should not be bothered coming up with a
> random ID each time, nor should subsequent runs of them auto-glue the
> corresponding URLs. As such, the simplest default case (i.e. no ID) should
> treat all OSC 8s to open a new one. Internally, the terminal emulator could
> automatically assign a unique ID (from a namespace that's unreachable via
> the escape sequence) each time it encounters such an OSC 8 without an ID.
> 
> Explicit IDs probably don't make that much sense on the normal screen,
> because it's out of the app's control whether it glues together with ones
> from earlier runs (or even from other apps) or not. It's not a problem
> though and probably the terminal emulator should just not care and obey
> them. They make the most of sense on the alternate screen where fullscreen
> apps run typically. E.g. editors could use the file offset where the URI
> begins as the ID. Or whatever's convenient for them.
> 
> I'd like the ID to be an arbitrary string (as opposed to, let's say, an
> int64). This potentially makes it much easier for app creators, since they
> don't have to manage an app-global pool for this. E.g. when tmux encounters
> a link with "id=123" inside its pane number 45, it could convert it to an
> "id=tmux-45-123" towards the host terminal emulator, without having to worry
> about conflicts with other panes.
> 
> Do these make sense?

I don't really think the concept of an ID makes sense; I think it'll become one of those weird, complex corner cases that nobody uses but terminals have to support.  Especially if the only useful purpose it serves is to determine how much to underline on hover.

I'd instead just suggest adding a couple of heuristics for hover underlining (e.g. underline two links with identical URLs if only whitespace appears between them, which solves the word-wrap problem).  If you get those heuristics wrong, it really doesn't cause any significant problem, as the point where the user hovers will still get underlined.

Comment 74 George Nachman 2017-03-26 16:22:26 UTC

> To reserve the possibility of future extension, and make parsing unambiguous, I'd suggest saying that the first character of the URL can't be ';'.  So, a normal URI escape looks like OSC 8 ; URI \a , and a URI escape with parameters looks like OSC 8 ;; params ; URI \a .  Since we can define "params" arbitrarily, you can easily just define it to not contain any semicolons; for instance, key=value,key2=value2.

I like this idea. The params would need to not have semicolons anyway. It says you must always specify parameters, even if there are none. I'll update iTerm2 to use this if there are no objections.

Comment 75 Josh Triplett 2017-03-26 16:55:11 UTC

(In reply to George Nachman from comment #74)
> It says you must always specify parameters, even if there are none.

What do you mean by this?

Comment 76 George Nachman 2017-03-26 20:12:46 UTC

> What do you mean by this?

Just a restatement of what you wrote. The grammar is:

 OSC 8 ; parameters ; url ST

where `parameters` may be empty.

Comment 77 Egmont Koblinger 2017-03-26 20:25:16 UTC

(In reply to Josh Triplett from comment #73)

> If you want to allow extra parameters, they need to come before the URL; the
> URL needs to come last, because depending on the scheme, it could contain
> almost any character.

URI-escaping is scheme-agnostic, isn't it? That means URIs cannot contain space (they need to be escaped to %20), so space is actually a viable choice.

> I'd suggest saying that the first character of the URL can't be
> ';'.  So, a normal URI escape looks like OSC 8 ; URI \a , and a URI escape
> with parameters looks like OSC 8 ;; params ; URI \a .

This is an absolutely unusual pattern. If we choose this approach, the two cases (one or two semicolons at the beginning) should be swapped. That way the lack of params is just the lack of params: OSC 8 ; params ; URI \a, or OSC 8 ; ; URI \a.

> Since we can define
> "params" arbitrarily, you can easily just define it to not contain any
> semicolons; for instance, key=value,key2=value2.

As I was further thinking about it, I get to like George's proposal with the spaces more and more.

Thinking about what presentational attributes might appear in the future, one example that occurred to me (based on common practices on the web) is to change color or other attributes on hover. And if we add support to that, I'd denote the new colors and attributes using the numbers from the ANSI escape sequence. E.g. hoverattrs=1;3;38;5;255. This contains semicolons, having to replace them with something else adds another bit of complexity. This might be one weak reason to go for space as separator.

As weird as space sounded to me in the first place, I just realized this what HTML pages also have. And since we're copying a feature of the web, it could be a fun thing to go with their syntax. I also changed my mind with the URI itself not having to have a name, I actually don't mind if it's just another attribute, but to follow HTML, I'd name it "href" instead of "target". E.g.:
  OSC 8 ; h r e f = URI space i d = ID \a

Making the href syntactically optional also allows in the future to extend OSC 8 for things that aren't links, yet want to take advantage of the additional infrastructure we add here (e.g. specify which cells belong together for underlining on hover, even if they are not links). But I don't want to go in this direction now.

> I don't really think the concept of an ID makes sense; I think it'll become
> one of those weird, complex corner cases that nobody uses but terminals have
> to support.  Especially if the only useful purpose it serves is to determine
> how much to underline on hover.

Well, tough question... I understand your point and also share some of your concerns. It'll damn easy to implement in terminal emulators though, and apps are free to choose whether they take advantage of this feature or not.

With the semantics of ID that I outlined, the simplest case (omitting the ID) is good for pretty much every simple utility that prints text, and just blindly adding an ID=1 everywhere is good enough in the first round for most fullscreen apps (the emulator will underline all cells that point to the same destination.) App developers are then free to decide if and when to improve their software with proper ID support.

[Note: in order for the ID to work this way, and in order for the ID to make sense anyway, cells that have the same ID _and_ the same target URI should be underlined on hover. Same ID but different target should be treated as the same as different ID.]

> I'd instead just suggest adding a couple of heuristics for hover underlining
> (e.g. underline two links with identical URLs if only whitespace appears
> between them, which solves the word-wrap problem).  If you get those
> heuristics wrong, it really doesn't cause any significant problem, as the
> point where the user hovers will still get underlined.

It's an implementation private business of VTE, but even stitching together adjacent cells of the same URI turned out to be very complex due to the complex underlying data structures we have. I definitely don't want to go for something even more complicated heuristics.

Implementing IDs, on the other hand, will be quite simple.

(In reply to George Nachman from comment #74)

> I like this idea. The params would need to not have semicolons anyway. It
> says you must always specify parameters, even if there are none. I'll update
> iTerm2 to use this if there are no objections.

Please don't rush with the implementation until we come to an agreement :)

We have two proposals now that I'd be okay with (I've somewhat changed both of them):

- George's, with s/target/href: space-separated list of params, like:

  \e]8;href=http://blah\a
  \e]8;id=123 href=http://blah foo=bar\a

- Josh's, with swapping ; and ;; :

  \e]8;;http://blah\a
  \e]8;id=123,foo=bar;http://blah\a

Out of these two, I cast my vote for George's.

Comment 78 Christian Persch 2017-03-26 20:47:34 UTC

My preference in comment 77 would be the 'Josh' form, because it's more like how other sequences are structured.

(In reply to George Nachman from comment #54)
> I'd support saying that URIs will be interpreted as UTF-8. URL-encoding with
> shell is a PITA and it's nice to be able to write echo -e
> "\e]8;http://المغرب.icom.museum\a' and have it do the right thing.

Agreed. 

However if the terminal is running in non-UTF-8, the spec should make it clear that you need to URL-encode any bytes where the (automatic in vte's case) conversion from terminal encoding to UTF-8 will change the URI. Plus, we should reject any URI containing control characters (0x01..0x1F, 0x7F..0x9F), ie. those need to be URL-encoded.

Oh, and another point, which I forgot to make in comment 48, is that while in theory URIs can be of arbitrary length, that shouldn't stop us from imposing a reasonable length limit (say, 1k or 2k bytes).

Comment 79 Egmont Koblinger 2017-03-26 20:48:18 UTC

(In reply to George Nachman from comment #56)
> I was thinking of shell scripts, not adhoc echo commands. When you're
> banging out a hacky shell script it's great when things like this just work.

Just FYI: With arbitrary order of parameters and space as a separator, such hacky scripts that are incorrectly lazy to URI-escape will fail at filenames containing spaces. The other recommendation doesn't suffer from this "problem". Not sure if we should care about it :)

Comment 80 Egmont Koblinger 2017-03-26 20:53:27 UTC

(In reply to Christian Persch from comment #78)

> However if the terminal is running in non-UTF-8, the spec should make it
> clear that you need to URL-encode any bytes where the (automatic in vte's
> case) conversion from terminal encoding to UTF-8 will change the URI. Plus,
> we should reject any URI containing control characters (0x01..0x1F,
> 0x7F..0x9F), ie. those need to be URL-encoded.

I guess it's okay if the specs simply says "the behavior is undefined if you don't properly URI-escape".

> Oh, and another point, which I forgot to make in comment 48, is that while
> in theory URIs can be of arbitrary length, that shouldn't stop us from
> imposing a reasonable length limit (say, 1k or 2k bytes).

Definitely. I think George went for 2083 bytes in iTerm2, and so did I in my recent wip2 patch.

(No clue though how we handle if we keep getting bytes inside an OSC 8 without a terminator... do we have a safeguard there somewhere? Or will just VTE blow up?)

Comment 81 Josh Triplett 2017-03-26 23:54:07 UTC

I'm indifferent about how you handle parameters, except that I'd prefer not to have to parse for an "href" parameter to find the URI in things like less.

Comment 82 Egmont Koblinger 2017-03-27 00:03:11 UTC

Created attachment 348756 [details] [review]
vte, work in progress 3

This version introduces IDs (remembers them, stores them in the stream... but doesn't yet use them for anything).

The escape sequence I went for is the "swapped Josh", it's not necessarily a final desicion, but seems like we have 2 votes for that, and I'm also okay with that choice.

The test file needs to be search-n-replaced ]8; -> ]8;; (I'll attach an updated version with additional test cases once we agree on the escape sequence).

Comment 83 Egmont Koblinger 2017-03-27 20:58:05 UTC

Created attachment 348833 [details]
Test file, v2

Updated test file.

Switched to the OSC 8 ; [params] ; URI \a syntax, and added a few new test cases (including using the "id" parameter).

Comment 84 Egmont Koblinger 2017-03-27 21:02:00 UTC

Created attachment 348835 [details] [review]
vte, work in progress 4

Here's the newest work-in-progress version. It uses the "id" parameter (or lack thereof) to figure out which cells belong together semantically, and underlines them on hover.

Please test, especially with the newly added test cases at the end of the updated test file. I'm curious if you guys like the user-facing behavior (I do).

TODO:
- massive code cleanup
- don't underline regex matches where there's an explicit hyperlink
- change mouse cursor shape to hand
- throttling

Comment 85 Egmont Koblinger 2017-03-28 22:50:15 UTC

Created attachment 348912 [details] [review]
vte, work in progress 5

Changes:
- don't underline regex matches where there's an explicit hyperlink
- change mouse cursor shape to hand (spider in debug mode :))
- tooltip showing the destination URI
- use reverse colors for the stream's contents in debug mode
- various fixes

TODO:
- make underlining more similar (both the code and the behavior) to regexes:
  - remove underlining if the mouse pointer is auto-hidden
  - update if the terminal contents change but the mouse doesn't move
- massive code cleanup
- throttling

Comment 86 Egmont Koblinger 2017-03-28 23:37:00 UTC

Hi guys,

i) Are we okay with settling with the

  OSC 8 ; params ; uri ST

format, whereas params looks like key1=value1,key2=value2 (and as such, lack of params is denoted by two semicolons next to each other)?

ii) Are we fine with my proposal for the "id" parameter?

To summarize: If an "id" is present, cells that share the same (id, uri) pair are underlined together on hover. If "id" is missing or empty, cells that were printed in the same OSC 8 run are underlined together on hover. (This latter one can easily be implemented e.g. by auto-generating a unique id each time an OSC 8 without an id is encountered.)

Comment 87 Josh Triplett 2017-03-29 03:03:03 UTC

(In reply to Egmont Koblinger from comment #86)
> Hi guys,
> 
> i) Are we okay with settling with the
> 
>   OSC 8 ; params ; uri ST
> 
> format, whereas params looks like key1=value1,key2=value2 (and as such, lack
> of params is denoted by two semicolons next to each other)?

Sounds good to me.

> ii) Are we fine with my proposal for the "id" parameter?
> 
> To summarize: If an "id" is present, cells that share the same (id, uri)
> pair are underlined together on hover.

This part seems fine, as a means of allowing underlining of discontiguous URIs.

> If "id" is missing or empty, cells
> that were printed in the same OSC 8 run are underlined together on hover.
> (This latter one can easily be implemented e.g. by auto-generating a unique
> id each time an OSC 8 without an id is encountered.)

That still seems painful for any app that redraws the screen without rewriting the entire URL.  Is it *that* bad to highlight all adjacent cells with the same URI, even if not printed with the same OSC 8?

(Using IDs or any other means to track things in scrolled-off-the-screen text seems fine, as you can't edit that text.)

Comment 88 Egmont Koblinger 2017-03-29 12:14:34 UTC

(In reply to Josh Triplett from comment #87)

> That still seems painful for any app

We have different examples in our minds when it comes to "any app", and sure it's difficult to foresee what "any app" will require :)

> that redraws the screen without
> rewriting the entire URL.  Is it *that* bad to highlight all adjacent cells
> with the same URI, even if not printed with the same OSC 8?

It is quite tricky in gnome-terminal, but that shouldn't be the concern when designing the behavior. If agreed that desirable, I'll go ahead and implement it.

The thing is, as I can see it now, it wouldn't offer a solution for some of the corner cases, and would heavily complicates the life of some apps.

A corner case in "less" where this approach wouldn't give the desired result: You view a file in 80 columns, in "chop long lines" mode. The file has an OSC 8 hyperlink in line 1, between columns 50-100, and another one to the same target starting at line 1 column 150, and going across a newline to line 2 column 20. What appears onscreen is the begining part of the first anchor text, and the trailing part of the second anchor. With auto-stitching, gnome-terminal would underline them together (until you resize the window to 101+ columns, from then on it will, as expected, not join them).

What I had in mind with the design of "id" was the apps on the two ends of the complexity scale: damn simple utilities that just print stuff, and fullscreen apps that have an exact internal idea of how the screen should look like. "less" is in a gray zone in between. Especially seeing the existence of the utterly brain-damaged and useless "less -r", it clearly doesn't have an internal picture of what's shown, it just uses magic hacks that make simple "less" and "less -R" work reasonably well.

"less" needs to do some mangling, suffling of the file contents. E.g. at the very least, when horizontally scrolled, it has to print all preceding ANSI escapes at the beginning of the line. Or perhaps keep track of the attributes and optimize away the ones that cancel out each other (not sure if it does it). The question is whether it's a fair requirement to mangle the "id" parameter of OSC 8. There are a few possibilities to go with:

- Without mangling the "id", if you can make it always print an entire row and just one row in a single run, you'd get the behavior you mentioned in the last paragraph of your comment 63: underlining would be restricted to a physical row.

- Adding an "id=1" to every hyperlink will result in a bit weird, but at least consistent behavior that we've discussed before, i.e. all hyperlink cells that have the same target will always be underlined together, but at least this behavior doesn't suffer from the inconsistency mentioned above. I don't think it's _that_ hard to implement it.

- And I don't think it's _that_ hard to implement the proper behavior either. Once the IDs are mangled, instead of chosing a fixed value you could go with the file offset, or (row,col) pair as the ID at the position where the OSC 8 escape sequence occurs. That would make the behavior perfect.

It doesn't have to be you yourself doing these, e.g. you might go for a simple and mostly okay approach as a proof of concept, and someone (e.g. less's developer(s)) implementing the proper one later on.

---

On the other hand, let's assume that auto-stitching behavior is available, and let's imagine we add OSC 8 support to a terminal multiplexer. Let's imagine that the terminal multiplexer wishes to provide you the exact same behavior as the raw terminal emulator itself, with regard to hover underlining.

Let's assume it can have panes next to each other (such as tmux with ^B %). [Going even further, although it may not be necessary for this discussion, we can even imagine that it supports overlapping panes (I've no clue if tmux or any other similar app supports this, but there's no intrinsic reason that prevents it).]

Let's look at the following onscreen contents, where all the 'x' letters are hyperlinks to the same destination, with auto-gluing as the desired behavior (no explicit "id"), and '|' is the separator between the panes:

leftpane xxxxxx  |rightpane
xx               |blah

and then at some point the gap is also filled with hyperlink:

leftpane xxxxxxxx|rightpane
xx               |blah

There's no way VTE could automatically make a distinction between these two cases, and figure out that it's semantically two separate hyperlinks in the first case, and one wrapped hyperlink in the second. As such, the multiplexer needs to explicitly help out the emulator via the ID, at least in the second case. That is, upon every modification of a cell, it needs to check its surroundings and update their hyperlink IDs, and potentially repaint plenty of surrounding cells which only changed their ID. Sounds like terribly cumbersome and painful to implement.

Another approach a terminal multiplexer could do is to track cursor motion, and underline itself. This again has severe issues. Again terribly complex to implement, potentially generate tons of network traffic, its asynchronous nature causes user-visible delays in underlining, etc.

Or it can just give up, which results in a degraded user experience inside the multiplexer wrt. hyperlinks.

---

To summarize: I believe auto-stitching makes it somewhat easier to come up with an almost perfect (but still buggy) support to "less -R", but it does not make it any easier to come up with a perfect support (which is not that hard either), and makes it magnitudes more complex to add perfect support to terminal multiplexers.

Comment 89 George Nachman 2017-03-30 04:52:36 UTC

I think it's clearly correct to optionally use id= to tie hyperlinked cells together. Any program that owns the screen and knows what's where can take advantage of it for nice behavior when there are split panes. I also believe that almost nobody will use it, but I'm going to add support for it anyway.

I am not going to try to keep an index from id to location in the buffer. I'm just going to search a few lines up and down for hyperlinked text with the right id, and I'll revisit that decision later if needed.

> i) Are we okay with settling with the
> 
>   OSC 8 ; params ; uri ST

LGTM

> If "id" is missing or empty, cells
> that were printed in the same OSC 8 run are underlined together on hover.
> (This latter one can easily be implemented e.g. by auto-generating a unique
> id each time an OSC 8 without an id is encountered.)

Seems like a missing id might more usefully mean "do your best to stitch adjacent cells" so that an implementer could avoid all the complexity of IDs if they so choose and still end up with something useful if bits of the screen get redrawn here and there. That doesn't mean restricting hover highlight to a single row. By all means search the neighborhood for likely candidates (iTerm2 hover-highlights across soft-wrapped lines as well as "soft boundaries" such as those formed by | characters in vim split panes).

Comment 90 Egmont Koblinger 2017-03-30 09:24:34 UTC

(In reply to George Nachman from comment #89)

> I am not going to try to keep an index from id to location in the buffer.

I cannot comment on this as I don't know iTerm2's internals :)

> >   OSC 8 ; params ; uri ST
> 
> LGTM

Okay, let's say then that we've agreed. Could you please update iTerm2 before too many people start using the old one? :)

As for the URI terminator, you currently require a semicolon, and so did I, so I think it's the simplest/cleanest to require two semicolons from now on, i.e.
  OSC 8 ; ; ST
Does this sound reasonable?

> Seems like a missing id might more usefully mean "do your best to stitch
> adjacent cells"

Most likely okay. Let me think whether I can foresee any problem with just having in the specification that lack of ID is "do your best", up to the terminal emulator. One thing is, an intermediate layer (e.g. a multiplexer) might interpret its internal app's behavior according to one particular semantics and explicitly implement that semantics towards the host emukator, making it somewhat different from running the same up directly in the emulator without the multiplexer. I think it's acceptable.

I'm planning to create a mini-webpage gist (similarly to xvilka's truecolor one) because it's practically impossible to dig through this thread and figure out what this feature has become. I'll mention there how iTerm2 and VTE are different in this "do your best".

Nitpicking: shall the "id" string (and all future keys; the keyword itself, not the value) be lowercase, uppercase or case insensitive?? /me votes for lowercase. And of course case sensitive comparison for the id values.

Comment 91 Josh Triplett 2017-03-30 15:48:14 UTC

(In reply to Egmont Koblinger from comment #90)
> Nitpicking: shall the "id" string (and all future keys; the keyword itself,
> not the value) be lowercase, uppercase or case insensitive?? /me votes for
> lowercase. And of course case sensitive comparison for the id values.

Yes, lowercase "id", and exact-byte-match comparison for id values.

Comment 92 George Nachman 2017-03-31 04:53:16 UTC

What character set are id's in? I would suggest ASCII for this and all parameters' keys and values in the params section. If we need to venture outside ASCII (e.g., for a hyperlink's "alt" text) we can always require a base64-encoded value.

What separates parameters in the params section? I need to know this to tell where the id's value ends if there is a second parameter. Perhaps space would be a workable choice here.

Comment 93 George Nachman 2017-03-31 05:23:06 UTC

If two links differ only in an unrecognized parameter should that keep them from highlighting together? I lean towards yes because that implies different intent. All the HTML5 attributes on <a> that we could foreseeably use would benefit from acting like a difference in ID.

Comment 94 George Nachman 2017-03-31 05:29:49 UTC

Commit 037cb5c adds the latest updates, using space delimited params and any difference in params implies no hover-highlighting together.

Comment 95 Egmont Koblinger 2017-03-31 05:52:37 UTC

> What character set are id's in? I would suggest ASCII for this and all
> parameters' keys and values in the params section. If we need to venture
> outside ASCII (e.g., for a hyperlink's "alt" text) we can always require a
> base64-encoded value.

Yup, absolutely! (Or URI-encoding to be consistent with the link itself :)) I'd not define the exact encoding now but rather whenever actually needed (and it might be different for each parameter, as it's a best fit for that). For "id" specifically, I'd just go with simple strcmp() comparison and not care about it at all. Mind you, I can't tell off the top of my head whether VTE performs a local_charset -> utf-8 conversion beforehand, and luckily, I can afford not to care :-).

> What separates parameters in the params section? I need to know this to tell
> where the id's value ends if there is a second parameter. Perhaps space
> would be a workable choice here.

The recommendation above was the comma ',' character. Another somewhat reasonable choice could be the colon ':' because it's (optionally) the 2nd level separator in ANSI color codes (e.g. 1;3;38:5:255). I'm not that keen on space, since as per Christian's comment 78 it's atypical as a separator in escape sequences.

> If two links differ only in an unrecognized parameter should that keep them
> from highlighting together? I lean towards yes because that implies
> different intent. All the HTML5 attributes on <a> that we could foreseeably
> use would benefit from acting like a difference in ID.

I guess I need to check what the HTML5 <a> attributes are :)

I'm not a fan of this approach. Apart from the <a> attributes you might have forgotten about the CSS styling possibilities. E.g. you might specify different fg/bg hover colors for each character. You can have a separate normal fg/bg color now for each cell in the terminal, and that's still a single link, so then why not be able to do this for hover colors as well.

If we say a difference in _any_ attribute causes separate underlining then 1) we have not introduced the "id" parameter specifically, the string "id" itself has no special meaning, and 2) we've already given some semantics to _all_ the possible attributes, undermining future extendability.

The "id" tag is meant to be used by complex and smart apps, if they ever introduce another parameter where they want links to be split, they can also express it by choosing a different "id".

Comment 96 Egmont Koblinger 2017-03-31 05:56:37 UTC

(In reply to George Nachman from comment #94)
> Commit 037cb5c adds the latest updates, using space delimited params and any
> difference in params implies no hover-highlighting together.

Please give us a day, or at least a few hours to discuss such question before changing your code :D

I don't like either of these, see the previous mid-air-collided answer.

Comment 97 Egmont Koblinger 2017-03-31 10:15:40 UTC

Choosing ':' as the parameter delimiter would have the funny advantage that if we ever introduce a parameter that takes colors and other attributes (e.g. for hover, or visited link) in the ANSI notation, we could "escape" the forbidden characters by omitting the upper dot, e.g. "1;3;38:5:255" could become "1,3,38.5.255". Hack hack hack, but visually funny, and probably not worse than any other kind of escaping.

I give 0.50 votes for ':', 0.49 votes for ',' and 0.01 for ' '.

Comment 98 Josh Triplett 2017-03-31 17:51:09 UTC

Not a fan of using space-separated parameters.  I originally would have suggested ',', but ':' seems fine too.

I'd suggest that ID should not specify a character set; just require that it not contain control characters, semicolon, or the parameter separator.  And when comparing it, do byte comparisons.  (However, you should set a length limit.)

If two links have the same ID, they should get treated as the same link for hover purposes, no matter what other attributes they have.  If two links don't have an ID, then use the URL as the ID.  I don't think we should take any other parameters into account.

Comment 99 Egmont Koblinger 2017-03-31 19:36:32 UTC

Christian, do you have a preference on ',' versus ':' for the parameter separator? George?

Comment 100 George Nachman 2017-03-31 20:22:21 UTC

> do you have a preference on ',' versus ':' for the parameter separator? 

: would be my choice for consistency with SGR.

> I'd suggest that ID should not specify a character set; just require that it not contain control characters, semicolon, or the parameter separator.

If ID does not have a character set then you can't very well search for a ':' or a ';' in it. UCS-4, for example, could contain byte matching one these ASCII characters as part of the four bytes making up a code point. It would be convenient if the parameters and URL of OSC 8 had the same character set as the terminal session in general.

> Could you please update iTerm2 before too many people start using the old one?
> Please give us a day, or at least a few hours to discuss such question before changing your code :D

Pick one :)

> The "id" tag is meant to be used by complex and smart apps, if they ever introduce another parameter where they want links to be split, they can also express it by choosing a different "id".

We can always require the id parameter be set when some future parameter (for example, target=) is supplied, and define what the behavior should be a if a single (url, id) has conflicting _other parameter_ values. I'm fine with segmenting hover only on id as long as we keep this in mind when adding new params.

Comment 101 Christian Persch 2017-03-31 20:29:55 UTC

(In reply to George Nachman from comment #100)
> > do you have a preference on ',' versus ':' for the parameter separator? 
> 
> : would be my choice for consistency with SGR.

I absolutely concur :-)

> > I'd suggest that ID should not specify a character set; just require that it not contain control characters, semicolon, or the parameter separator.

On the contrary, I'd prefer to be very strict and say that params ids must be purely [a-z] and param values after the '=' only contain [a-z0-9] and perhaps a few punctuation chars like [-._/] .

Comment 102 Egmont Koblinger 2017-03-31 20:44:21 UTC

> > Could you please update iTerm2 before too many people start using the old one?
> > Please give us a day, or at least a few hours to discuss such question before changing your code :D
> 
> Pick one :)

Josh's suggestion had then already included ',' as the separator, and in comments 89-90 I thought you agreed to that as well. I've just now realized that those comments indeed did lack this piece of information, it was there in earlier ones only. Hence the misunderstanding :)

Seems that ':' won.

Comment 103 Egmont Koblinger 2017-03-31 20:59:25 UTC

(In reply to George Nachman from comment #100)

> If ID does not have a character set then you can't very well search for a
> ':' or a ';' in it. UCS-4

By character set I only have ASCII-compatible ones in my mind :) I guess the entire escape sequence story would break big time with ASCII-incompatible ones like EBCDIC.

Does any terminal emulator anywhere use UCS-4, like, entirely in its data stream, e.g. the OSC "\e]" would be the 4-byte UCS-4 "\e" followed by the 4-byte UCS-4 "]"?

Obviously noone should be that stupid to try to embed UCS-4 in a UTF-8 data flow and expect the receiver to look for the terminator character keeping this in mind :) I'd just hide these under a "common sense" factor.

> It would be convenient if the parameters and URL of OSC 8 had the same
> character set as the terminal session in general.

I guess you can go for that.

In VTE I still don't know if the parameters are converted from the session's encoding to UTF-8 or not, and whether invalid UTF-8 is rejected/mangled inside the sequence if the session is in UTF-8 or not. It would take like a minute to try it out, but I don't want to do that because I just don't care :-)

E.g. a hyperlink is printed in UTF-8 mode with the UTF-8 id of "fúbár" (7 bytes) and then the encoding is switched to Latin-1 and then another hyperlink has the Latin-1 "fúbár" (5 bytes) as the id, will these two be joined? Honesly I don't care and I don't want to spend any time on this issue :) 

> We can always require the id parameter be set when some future parameter
> (for example, target=) is supplied,

You mean not any future parameter, but certain specific future parameters? Okay then. Sure you might one day introduce a "foobar" parameter which, by its nature, has to require an "id" to be set as well, and then of course make it a technical requirement.

But, what I find more likely, if a future parameter does not have anything to do with the semantics of which cells belong together to the same link then that parameter should be entirely left out of the "id" business.

(In reply to Christian Persch from comment #101)

> On the contrary, I'd prefer to be very strict and say that params ids must
> be purely [a-z] and param values after the '=' only contain [a-z0-9] and
> perhaps a few punctuation chars like [-._/] .

Maybe saying all bytes in the 32..126 range except for ';' and ':'?

Comment 104 Egmont Koblinger 2017-03-31 21:08:36 UTC

Created attachment 349081 [details]
Test file, v3

Switched to ':' as the parameter separator.

Added a few more tests, e.g. international domain names.

Comment 105 Egmont Koblinger 2017-03-31 21:13:58 UTC

Created attachment 349082 [details] [review]
vte, work in progress 6

Changes:
- Uses colon as parameter separator
- Tooltip shows URI-decoded and de-IDN-ed URI as well, if different
- A few small cleanups

(Goes on top of the "no more union" patch, as all previous ones, I forgot to mention it.)

Comment 106 Josh Triplett 2017-03-31 21:34:24 UTC

(In reply to Christian Persch from comment #101)
> On the contrary, I'd prefer to be very strict and say that params ids must
> be purely [a-z] and param values after the '=' only contain [a-z0-9] and
> perhaps a few punctuation chars like [-._/] .

Restricting parameters to ASCII 32..126 minus the significant punctuation seems fine, since people can always base64.  UCS-2 or UCS-4 would definitely not be acceptable; my intention was to just prohibit 0-31 and 127, to keep it parseable and avoid control characters.  That would allow ASCII, UTF-8, latin-1, or anything else vaguely sensible.  For IDs, something narrower doesn't seem like a problem, though.

(In reply to Egmont Koblinger from comment #103)
> E.g. a hyperlink is printed in UTF-8 mode with the UTF-8 id of "fúbár" (7
> bytes) and then the encoding is switched to Latin-1 and then another
> hyperlink has the Latin-1 "fúbár" (5 bytes) as the id, will these two be
> joined? Honesly I don't care and I don't want to spend any time on this
> issue :) 

That was the kind of case I had in mind when saying to compare bytes only.  Ignore the encoding, and just require the bytes to match exactly.

> Maybe saying all bytes in the 32..126 range except for ';' and ':'?

Yes, exactly.

Comment 107 Egmont Koblinger 2017-04-01 10:51:00 UTC

My two biggest pieces of work in VTE so far (rewrap 336238 and encryption 738601) both ended up at 111 bugzilla comments. This one will definitely join them in the top three. I'd like this one to end at 111 too, so no more comments please :D

(JK)

Comment 108 George Nachman 2017-04-01 20:27:30 UTC

> Does any terminal emulator anywhere use UCS-4, like, entirely in its data stream, e.g. the OSC "\e]" would be the 4-byte UCS-4 "\e" followed by the 4-byte UCS-4 "]"?


Some folks do use this, but I think it's rare. I'm sure they're used to suffering, so I'm content to ignore it.

I've updated iTerm2 to use a colon separator and segment highlighting on (id, url) equality if id is specified or url equality alone otherwise.

Comment 109 Egmont Koblinger 2017-04-01 22:50:29 UTC

Created attachment 349129 [details] [review]
vte, work in progress 7

This version now goes on top of these two patches:
- "no more union" from this bug
- the mouse pointer cleanup from bug 780785

Changes from the previous patch:
- mouse pointer is now hopefully always correct (i.e. even if the contents change under the mouse)

TODO:
- massive code cleanup
- throttling

Please start using it intensively and report any bugs or usability wishes :)

I'm planning to do the massive code cleanup next, so that we can review and submit as soon as possible. Throttling will come afterwards (or maybe in parallel with the reviewing process).

Comment 110 Egmont Koblinger 2017-04-03 21:58:08 UTC

Created attachment 349203 [details] [review]
vte, work in progress 8

Changes:
- Tooltip moved to gnome-terminal (requires updated g-t patch)
- Some cleanups, still plenty to go

Goes on top of two other patches, just like wip7 did.

Comment 111 Egmont Koblinger 2017-04-03 22:00:03 UTC

Created attachment 349204 [details] [review]
gnome-terminal, work in progress 8

Changes:
- Tooltip handling moved here from vte
- Various cleanups

Comment 112 Egmont Koblinger 2017-04-06 23:58:41 UTC

Review of attachment 348449 [details] [review]:

"osc8 placeholder" submitted to master & vte-0-48

Comment 113 Egmont Koblinger 2017-04-06 23:59:23 UTC

Review of attachment 348449 [details] [review]:

"osc8 placeholder" submitted to master & vte-0-48

Comment 114 Egmont Koblinger 2017-04-07 00:01:20 UTC

Created attachment 349422 [details] [review]
vte, work in progress 9

Changes:
- Bounding box passed as a signal parameter
- Optimized updates (only repaint what's necessary)
- Demo test file included in the patch
- Cleanups

Comment 115 Egmont Koblinger 2017-04-07 00:01:58 UTC

Created attachment 349423 [details] [review]
gnome-terminal, work in progress 9

Minor cleanups.

Comment 116 Egmont Koblinger 2017-04-07 00:16:37 UTC

Don't use wip9, it crashes quickly.

Comment 117 Egmont Koblinger 2017-04-07 08:09:10 UTC

Created attachment 349430 [details] [review]
vte, work in progress 9a

crash fixed

Comment 118 Egmont Koblinger 2017-04-07 11:41:10 UTC

Just for fun, updated (for the new escape sequence syntax, plus osc8ls properly uri-escapes now):

list files in the current directory, all formatted as hyperlink:

osc8ls() {
  for i in *; do
    echo -ne "\e]8;;"
    __vte_urlencode "file://${PWD%/}/$i"
    echo -e "\a$i\e]8;;\a"
  done
}

bash prompt with the current directory as a hyperlink:

PS1='\u@\h:\[\e]8;;file://$(__vte_urlencode "$PWD")\a\]\w\[\e]8;;\a\] \$ '

Comment 119 Egmont Koblinger 2017-04-08 14:48:55 UTC

Created attachment 349524 [details] [review]
Show file preview in tooltip (demo)

Demo: Show a preview of some popular file formats (pictures, pdf...) of file:/// targets in the tooltip.

Goes on top of the gnome-terminal wip9 patch.

Quick summary on how to try it:

1. Vte: apply the two prerequisite patches (see comment 109) and then wip9a. Compile, install.

2. Gnome-terminal: apply wip9 and then on top of that this new patch. Install gnome-desktop3-devel / libgnome-desktop-3-dev / whatever it's called in your distro. Re-run autogen.sh, compile, install.

3. Define osc8ls as per the previous comment, cd /usr/share/backgrounds (or wherever), run osc8ls, mouseover, enjoy! :)

Comment 120 Egmont Koblinger 2017-04-08 19:23:45 UTC

@all

What to do with URIs like file://some-other-host/path/to/file.txt ?

Firefox visually strips off that "some-other-host" and shows the local file. Chromium leaves it there but shows the local file. Gnome-terminal, when ctrl+clicking on such a regex or hyperlink, also happily opens the local file (via gtk_show_uri()).

In most of the software, incl. GNOME in general, it's usually fine not to care about the hostname component. It's hardly ever present and is of no practical use, since these systems don't have a concept analogous to ssh.

In terminal emulators, however, it's really bad and misleading if after logging in to a remote computer with ssh/telnet, output of "osc8ls" or similar commands that print such remote filenames are interpreted by the terminal emulator as local ones.

Here I believe the terminal emulator MUST reject opening their local counterpart, of showing a preview thereof.

Freedesktop file-uri-spec 1.0 [1] says:

When generating a file: uri the hostname part, if nonempty, should be
whatever is returned from gethostname(). This means that the name is
canonical for all users on the same machine, so that you can easily
see if the referenced file is on the current machine. Note that
"localhost" or an empty hostname needs to be handled specially, always
meaning the host the uri is being interpreted on.

RFC 8089 «The "file" URI Scheme» [2] says:

The "host" is the fully qualified domain name of the system on which
the file is accessible. This allows a client on another system to
know that it cannot access the file system, or perhaps that it needs
to use some other local mechanism to access the file.

As a special case, the "file-auth" rule can match the string
"localhost" that is interpreted as "the machine from which the URI is
being interpreted," exactly as if no authority were present. [...]

They slightly disagree, since gethostname() returns the short hostname on many systems. For practical reasons, it should not be expected from utilities that print hyperlinks to go into the business of hostname resolution, so I find freedesktop's approach to make more sense. I'd personally say utilities should preferably to for gethostname(), but could also use the env var $HOSTNAME if that's significantly more convenient (or cheaper, e.g. in shell scripts) for them.

Terminal emulators might go a bit loose and might, if they want to, accept variations (e.g. both short and fqdn) of the current hosts's hostname. But I think it's fine if they go for gethostname() only. According to both specs, the empty string or the literal string "localhost" should always be treated as matching.

The point is: in the presence of any other hostname, I believe the terminal emulator should really-really-really refuse to open the local counterpart of the file. Furthermore, all utilities should be strongly advised to fill out the hostname with the actual real value.

What do you guys think?

For g-t, since gtk_show_uri() doesn't do that, and neither does the current demo previewing code, we'd need to manually add that check. It's really not hard to do, although in the mean time probably we should file bugs against Gtk+, it should take care of this for us.

(And while we're at it, we should address the closely related bug 709055 as well.)

[1] https://www.freedesktop.org/wiki/Specifications/file-uri-spec/
[2] https://tools.ietf.org/html/rfc8089

Comment 121 George Nachman 2017-04-08 21:26:04 UTC

In commit 6b1bf27, iTerm2 prompts for a username and then downloads the file with scp when you open a file: url with a non-localhost non-empty host part.

Comment 122 Egmont Koblinger 2017-04-10 21:40:55 UTC

Created attachment 349625 [details] [review]
gnome-terminal, work in progress 10

Changes:
- Reject file://some-other-host/... (even for regex matches; I'm planning to submit this part as a preceding separate commit).
- Don't show both the unescaped and original escaped URI, it was so lame (user-unfriendly) especially with accented letters. Show the unescaped only.

Comment 123 Egmont Koblinger 2017-04-10 21:43:57 UTC

Created attachment 349626 [details] [review]
Show file preview in tooltip (demo), work in progress 10

Changes:
- Updated to go on top of gnome-terminal patch wip10
- Doesn't show preview for file://some-other-host/blahblah
- Doesn't regenerate the tooltip on every pixel movement of the mouse

Still generates the thumbnails synchronously, and as such, I intent this as a demo only, not (yet) for mainstream inclusion.

Comment 124 Egmont Koblinger 2017-04-10 21:50:10 UTC

Created attachment 349629 [details]
Screenshot of file preview in tooltip

Screenshots are always cool, so here's one :)

Comment 125 Egmont Koblinger 2017-04-11 06:34:32 UTC

The little tools updated to include the hostname:

list files in the current directory, all formatted as hyperlink:

osc8ls() {
  for i in *; do
    echo -ne "\e]8;;"
    __vte_urlencode "file://$HOSTNAME${PWD%/}/$i"
    echo -e "\a$i\e]8;;\a"
  done
}

bash prompt with the current directory as a hyperlink:

PS1='\u@\h:\[\e]8;;file://$(__vte_urlencode "$HOSTNAME$PWD")\a\]\w\[\e]8;;\a\] \$ '

Comment 126 Egmont Koblinger 2017-04-11 23:02:59 UTC

Created attachment 349698 [details] [review]
ls --hyperlink (coreutils 8.27) draft patch

This draft patch implements a new
  ls --hyperlink=always/auto/never
cmdline option in coreutils-8.27.

Most people/distros alias ls to 'ls --color=auto', a '--hyperlink=auto' can be added there as well.

The patch is clearly in draft state. I do not intend to clean it up and make it ready for upstream inclusion, it'll be their task if they like the idea.

Comment 127 Egmont Koblinger 2017-04-12 23:57:36 UTC

Created attachment 349753 [details] [review]
less -R (487) draft patch

Here's a demo patch in extremely draft state against less-487.

The patch is a terrible hack, the only thing I can guarantee is that it has survived 5 minutes of testing for me. Use at your own risk :P

As with "ls", I probably won't work on this any further, I'll leave it up to the maintainer to implement it properly (or Josh, if you feel like...).

Right now "less" considers the ESC byte itself as an ANSI opener (without even requiring the following '['), now this can conflict with both the opening ESC ] 8 and the closing ESC \ of the hyperlink escape sequence. To make it even worse, for each char it walks backwards trying to figure out if it's in the middle of an escape sequence which is not only a nightmare to implement (probably not even possible withouth slight bugs), but also results in an overall O(n^2) algorithm for no sane reason. It needs a thorough rewrite of this part, I'm not going to do this.

As mentioned on less's homepage with bug ref number 294 marked as "Probably will not fix", less resets the color at each newline of the input file. Although this results in a different look than cat'ing the file, it's a quite reasonable behavior for a file viewer. Carrying the colors across newlines would potentially make it magnitudes slower. E.g. when you jump to the end of the file, it'd still need to scan and parse the entire file to keep track of the colors. So, obviously, with this patch now the same goes for hyperlinks as well, which luckily simplifies our lives a bit.

On Unixes, less doesn't know/care anything at all about the meaning of the numbers inside the ANSI escape sequences. If the beginning of a line is not visible (either scrolled out at the top in folding mode, or scrolled out at the left in chopping mode), it just emits all the ANSI escape sequences from that undisplayed part at the beginning of the line, not optimizing anything at all. ANSI colors and attributes can add up to each other, hyperlinks cannot, they wipe out the previous value. Accordingly, the code should be optimized (it is not currently) to only print the last hyperlink before starting to print the contents. (There's an MSDOS code which does care about the actual colors and attributes. Needless to say, I didn't touch that.)

less is inconsistent in folding mode whether it prints wrapped lines in a single run (good for copy-pasting and for hyperlink hover underlining) or in separate consecutive steps (bad for both of these). The hyperlink hover underlining issue could be solved by adding/mangling the "id" attribute, but instead, probably it's better to just make sure to print entire lines of the input file (or "paragraphs" as we often call it in VTE) in a single run even if they span multiple physical lines. This would fix copy-pasting as well as hyperlink underlining without having to mangle the id.

Comment 128 Egmont Koblinger 2017-04-13 09:24:30 UTC

One more not-so-important thing missing from the VTE patch: HTML copy-pasting.

It'll probably need a variant of thawing that either places all the hyperlinks of a row in the hyperlink pool (rather than just the one under the cursor / click location), or allocates memory elsewhere for them (memory fragmentation is not a concern here).

The code around cellattr_to_html() would probably need to be refactored as well. For attrs like bold, italic, colors etc. it's okay to close and reopen their tags at any time, it has no side effects. The hyperlink, however, shouldn't be closed and reopened when some other attribute changes as that would result in undesired behavior (multiple separate links with the same target) where you paste it.

As such, the "<a>" tag has to be the outmost one, and kept open when other attributes change.

Instead of the current cellattr_to_html design, I'm thinking about a method that takes two VteCellAttr* parameters and is responsible for emitting the HTML tags denoting the difference, i.e. first close the ones that appear in attr1 but not in attr2, and then (in reverse order) open the ones that newly appear in attr2.

Comment 129 Behdad Esfahbod 2017-04-14 13:52:01 UTC

(In reply to Egmont Koblinger from comment #57)
> Created attachment 348448 [details] [review] [review]
> VteCell refactoring, no more union with int
> 
> This tiny bit of refactoring gets rid of the 8-byte limit for VteCell, and
> does make sense on its own without the hyperlink feature. As such, I'm
> planning to submit it separately from the big feature implementation.

Please commit this already!!!

Comment 130 Behdad Esfahbod 2017-04-14 13:55:02 UTC

(In reply to Egmont Koblinger from comment #60)
> Created attachment 348627 [details] [review] [review]
> vte, work in progress 1
> 
> Misusing bugzilla for backup purposes :D
> 
> This is a heavily work in progress version, in the direction of the final
> design I have in mind. Onscreen links are kept in a pool, and are written to
> attr_stream as agreed. Reading back from the stream is missing, if you
> scroll back the links point to an unknown target. Plus, no hover underlining
> and no throttling yet.

Reading the design up to this commit, my only concern is, I'd be rather pissed if I scroll way back and try to open a link and it's gone... So, as long as your throttling does not limit total number of links in an unlimited-scrollback scenario, sounds good to me.  But limiting total to ~4k might go against that.  I don't have better suggestions right now.  Still thinking about it.

Comment 131 Egmont Koblinger 2017-04-14 13:58:14 UTC

Yup it's 4k for the ring (circular buffer, not yet in stream) at a time. Unlimited for the scrollback stream.

Comment 132 Egmont Koblinger 2017-04-14 14:01:15 UTC

Actually I could get rid of this 4k limit. There's no point in bit-saving anymore. Instead of 12 bits, I could allocate an entire 4 bytes for the hyperlink index. Then it'd still be limited by the number of onscreen cells, which is not _that_ much more (e.g. 16k-ish with my screen and preferred font size).

Comment 133 Behdad Esfahbod 2017-04-14 14:02:42 UTC

(In reply to Egmont Koblinger from comment #64)

> Sometimes we need to "thaw" a row with "do_truncate", that is, place back
> from the stream to the ring. This can happen when the window is enlarged.
> It's a PITA anyways (e.g. caused tons of troubles with the encryption), and
> is a PITA here as well (this is where the wip1 patch crashes).

What if we make ring hold 256 lines exactly, and never thaw.  vte can limit height to 256 lines as well if you prefer to keep things simple.  Substitute 1024 instead of 256 if you wish.  I like that actually, simplifies things a lot.

Comment 134 Behdad Esfahbod 2017-04-14 14:05:31 UTC

(In reply to Egmont Koblinger from comment #65)

> In these cases hover underlining would again only underline parts of these
> URLs, similarly to the current behavior. Ctrl+click would open them
> correctly, but the underlining would totally look silly and go against the
> overall intent of this feature.

Or just don't hover-underline.  If there's already dotted-underline, then hovering should only show tooltip, no need to full underline if that makes it easier.  But if you are solving the hover problem for regular printed URLs as well, that's a different issue.

Comment 135 Egmont Koblinger 2017-04-14 21:24:55 UTC

(In reply to Behdad Esfahbod from comment #133)

> What if we make ring hold 256 lines exactly, and never thaw.  vte can limit
> height to 256 lines as well if you prefer to keep things simple.  Substitute
> 1024 instead of 256 if you wish.  I like that actually, simplifies things a
> lot.

Let's keep this in mind for a possible future refactoring. It would indeed simplify things a lot, we could remove the thawing/do_truncate code both from the ring and from the stream encryption.

Keep in mind, however, that it'd significantly increase the memory footprint of a VTE obbject.

Thawing ended up not being the pain point of the hyperlink feature, once I added the link's length after the link as well so that we can walk backwards. As such, I'd not do this refactoring now.

(In reply to Behdad Esfahbod from comment #134)

> Or just don't hover-underline.  If there's already dotted-underline, then
> hovering should only show tooltip, no need to full underline if that makes
> it easier.  But if you are solving the hover problem for regular printed
> URLs as well, that's a different issue.

This has been later solved by introducing the "id" link parameter. Hover underlining I believe is a must for a decent user experience, and is implemented now.

(There's a confusion of terminology here. Initially I used the word "id" for the position in the hyperlink pool in the ring for a particular hyperlink, an implementation detail. As we later introduced the "id" parameter to the OSC 8 sequence, I consistently renamed the other one to "idx" [index] to avoid confusion. Preceding bugzilla comments still use the old terminology.)

Comment 136 Behdad Esfahbod 2017-04-14 21:32:43 UTC

Now that you refreshed your vte skills, time to make "highlight all search results" happen? ;) Oh, and fix multiple search results in one line. And searching across lines :D.

Comment 137 Behdad Esfahbod 2017-04-14 21:33:30 UTC

Copying what I wrote to Egmont in email, just for the record:

Also, I think it's fine combining attr and text streams.  In that case, we can just write ANSI sequences instead of attrs, and just reinterpret that to display... The only reason I separated them was because I wanted to have text stream in plain UTF-8 such that searching, saving to file, etc are faster.  But that's certainly optimizing for the wrong thing.  And our current search implementation does not use this property anyway!

Comment 138 Behdad Esfahbod 2017-04-14 21:40:16 UTC

Review of attachment 349430 [details] [review]:

Overall, not too bad!  But yeah, wish we had a simpler ring :).

::: src/vterowdata.h
@@ +171,3 @@
+_attrcpy (void *dst, void *src)
+{
+	memcpy(dst, src, 8);

Ummmmmm, Please, sizeof(VteCellAttr) or something, not magic number.

Comment 139 Egmont Koblinger 2017-04-14 21:44:03 UTC

(In reply to Behdad Esfahbod from comment #136)

> Now that you refreshed your vte skills, time to [...]

I'll really have neither time nor motivation for these in the foreseeable future, sorry.

> Also, I think it's fine combining attr and text streams. [...]

Copying my email answer, just for the record:

One problem with a single text + ansi attr stream is you can't random access... I mean you'd get the text, but you'd have no clue about the current attributes. I can't really see how it could be solved there.

What I'd rather be thinking about, since we have compression anyways, is to just place VteCells in the stream. Hyperlinks would still need special treatment though, and probably combining accents too. And we'd still need to keep row_stream. Plus, the uncompressed data we'd store would grow by a lot, and hence the compressed one would presumably grow too. Sounds to me that it might be slightly a better/simpler design to begin with, but unlikely to be worth it to refactor now.

Comment 140 Egmont Koblinger 2017-04-14 21:53:37 UTC

(In reply to Behdad Esfahbod from comment #138)

> +_attrcpy (void *dst, void *src)
> +{
> +	memcpy(dst, src, 8);
> 
> Ummmmmm, Please, sizeof(VteCellAttr) or something, not magic number.

I have comments at VteCellAttr and VteStreamCellAttr, I'll add it here too. This "8" is the common prefix of the two structures (both are bigger than this). Factoring the common bits out to another struct would require adding an additional level throughout the entire codebase when accessing any attribute. Does C/C++ have mixins in addition to structs and unions? I don't think so, but I guess that's what we'd need here. I could do an ugly fragile magic with offsetof, relying on the name of the field following the common ones, but it sounds worse than an "8" with a comment. Any other nicer solution that I'm missing?

Comment 141 Behdad Esfahbod 2017-04-14 21:55:35 UTC

(In reply to Egmont Koblinger from comment #140)
> (In reply to Behdad Esfahbod from comment #138)
> 
> > +_attrcpy (void *dst, void *src)
> > +{
> > +	memcpy(dst, src, 8);
> > 
> > Ummmmmm, Please, sizeof(VteCellAttr) or something, not magic number.
> 
> I have comments at VteCellAttr and VteStreamCellAttr, I'll add it here too.
> This "8" is the common prefix of the two structures (both are bigger than
> this). Factoring the common bits out to another struct would require adding
> an additional level throughout the entire codebase when accessing any
> attribute. Does C/C++ have mixins in addition to structs and unions?

Plain inheritance?!

> I don't
> think so, but I guess that's what we'd need here. I could do an ugly fragile
> magic with offsetof, relying on the name of the field following the common
> ones, but it sounds worse than an "8" with a comment. Any other nicer
> solution that I'm missing?

Comment 142 Egmont Koblinger 2017-04-14 22:02:15 UTC

Review of attachment 348448 [details] [review]:

.

Comment 143 Egmont Koblinger 2017-04-14 22:05:41 UTC

(In reply to Behdad Esfahbod from comment #129)

> Please commit this already!!!

Done.

(In reply to Behdad Esfahbod from comment #141)

> Plain inheritance?!

Haha, indeed. Looks like I'm an old-fashioned C guy :)

Gotta check if we're C++ enough that it's allowed there, and if it copes with exactly one of the derived structs being attribute packed (even if at a future point the common bits grow to some silly size like maybe 9 bytes)...

Comment 144 Josh Triplett 2017-04-14 22:27:35 UTC

If inheritance doesn't work, then at a minimum, a static const name for the common prefix size would help.

Comment 145 Egmont Koblinger 2017-04-15 13:03:23 UTC

Created attachment 349890 [details] [review]
less -R (487) draft patch v2

Fixes search across hyperlink boundaries. All comments from comment 127 still apply.

Comment 146 Christian Persch 2017-04-16 19:21:12 UTC

(In reply to Egmont Koblinger from comment #103)
> In VTE I still don't know if the parameters are converted from the session's
> encoding to UTF-8 or not, and whether invalid UTF-8 is rejected/mangled
> inside the sequence if the session is in UTF-8 or not. It would take like a
> minute to try it out, but I don't want to do that because I just don't care
> :-)

In ::process_incoming(), the data received from the pty is converted to UTF-32 before being fed to the matcher, with invalid input being replaced by U+FFFD REPLACEMENT CHARACTER.

Comment 147 Egmont Koblinger 2017-04-16 23:05:53 UTC

(In reply to Behdad Esfahbod from comment #141)

> Plain inheritance?!

I've given up on this. I keep getting warnings along the lines of "ignoring packed attribute because of unpacked non-POD field", and error about the brace-initialized brace_cell: "could not convert ‘{0, [...]}’ from ‘<brace-enclosed initializer list>’ to ‘VteCellAttr’.

The latter one seems to be resolved by the C++17, at least according to http://stackoverflow.com/questions/16983539/why-can-i-not-brace-initialize-a-struct-derived-from-another-struct.

It's just too complicated of a maze for me to spend hours figuring out if a nicer solution exists. But if you show me the code that works, I'll add it :)

In the mean time, could you please explain the comment "Ordered by most commonly changed attributes, to optimize the compact representation" (commit c00499d) to me? Looks like we had a totally different "storage" then, and I guess this comment is no longer relevant, right?

(In reply to Josh Triplett from comment #144)

> at a minimum, a static const name for the common prefix size would help.

Yup, good point, I've added a #define, plus compile-time assertions on the "offsetof" the next attribute (which required removing bitpacking from them).

Comment 148 Egmont Koblinger 2017-04-19 20:19:51 UTC

Created attachment 350092 [details] [review]
vte, work in progress 11

Changes:
- Remove the 4k limit on number of hyperlinks in the ring at a time
- Remove bitpacking of hyperlink fields (at least for now)
- Refactor that magic "8" constant as discussed
- Added "offsetof" static assertions too
- Added local PNGs to the test file

Let's see what we can do with attribute packed (bug 781499) on top of this one...

Comment 149 Egmont Koblinger 2017-04-24 00:07:08 UTC

Created attachment 350276 [details] [review]
vte, work in progress 12

Changes:
- main API method renamed to vte_terminal_hyperlink_check_event()
- new API methods to allow/disallow the hyperlink feature, defaulting to disabled
- use boxed type in the signal

Hopefully API-wise this is the final version.

Comment 150 Egmont Koblinger 2017-04-24 00:07:50 UTC

Created attachment 350277 [details] [review]
gnome-terminal, work in progress 12

Update to the new API

Comment 151 Egmont Koblinger 2017-04-24 16:32:07 UTC

Created attachment 350312 [details] [review]
vte, work in progress 13

Changes:
- store GString* rather than GString in the GArray
- bitmap at GCing
- regular quote marks in debug msgs
- hyperlink_idx_t
- misc cleanups

Comment 152 Egmont Koblinger 2017-04-24 23:04:52 UTC

Created attachment 350345 [details] [review]
vte, work in progress 14

Changes:
- GC after processing some incoming bytes
- misc cleanups

Comment 153 Egmont Koblinger 2017-04-24 23:57:44 UTC

Comment on attachment 350345 [details] [review]
vte, work in progress 14

Committed to master.

Comment 154 Egmont Koblinger 2017-04-24 23:58:05 UTC

Comment on attachment 350277 [details] [review]
gnome-terminal, work in progress 12

Committed to master.

Comment 155 Egmont Koblinger 2017-04-25 00:04:19 UTC

Since this thread became enormous, I've created a separate page summarizing this feature:

https://gist.github.com/egmontkob/eb114294efbcd5adb1944c9f3cb5feda

Comment 156 Egmont Koblinger 2017-10-20 11:34:19 UTC

Created attachment 361932 [details] [review]
less -R (520) draft patch

Comment 157 Egmont Koblinger 2017-10-20 11:35:23 UTC

Comment on attachment 349698 [details] [review]
ls --hyperlink (coreutils 8.27) draft patch

"ls --hyperlink" patch is obsolete now, a proper implementation is available in official coreutils-8.28.

Comment 158 Christian Persch 2017-12-31 12:08:15 UTC

Comment on attachment 349890 [details] [review]
less -R (487) draft patch v2

(foreign patch, marking as reviewed to get it off the unreviewed patches list)

Comment 159 Christian Persch 2017-12-31 12:10:06 UTC

Comment on attachment 349626 [details] [review]
Show file preview in tooltip (demo), work in progress 10

>+  priv->tooltip_text = NULL;  // FIXME is this needed?
>+  priv->tooltip_icon = NULL;  // FIXME is this needed?

It's not; the Private struct is 0-initialised.

>+    // FIXME This step should be asynchronous !!!

Yes. Marking needs-work because of this.

Comment 160 Christian Persch 2017-12-31 12:14:47 UTC

Comment on attachment 361932 [details] [review]
less -R (520) draft patch

(foreign patch, marking as reviewed to get it off the unreviewed patches list)

Comment 161 GNOME Infrastructure Team 2021-06-10 21:15:03 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/gnome-terminal/-/issues/7736.