GNOME Bugzilla – Bug 730632
implement UAX29-like word boundary detection for double-click select-by-word
Last modified: 2021-06-10 14:52:45 UTC
Double clicking selects too much, making is quite useless, and is no longer configurable. When double clicking, all punctuation characters count as word chars. This makes it very hard to select e.g. a method name from a source code, since you'll end up having much more in the buffer than you wanted. E.g. consider this line of source code: vte_same_class(VteTerminal *terminal, glong acol, glong arow, I can't select the word "vte_same_class" or "terminal" with double click, since I'll end up having "vte_same_class(VteTerminal" or "*terminal," in the buffer instead. This makes double click quite useless, I have to select with a single click, precisely marking both the start and the end. It would be much easier the other way around. E.g if selecting stopped at parentheses but I wanted to select "vte_same_class(VteTerminal", I could easily do it by double-clicking anywhere under "vte_same_class" and releasing anywhere under "VteTerminal". So I believe stopping at most punctuation characters would heavily improve usability. This is also the way most applications behave (like browsers, gedit etc.) They stop at most punctuation chars. (I'm not sure if there's a Unicode standard for this, or if there's a Gtk+ method we could use.) (Or, the word-chars feature could be resurrected, probably with two sets of characters, one to add to word-chars, one to remove from there.)
I thought we fixed this years ago... Yeah, punctuation chars should never form a "word".
> Yeah, punctuation chars should never form a "word". Comment 0 clearly expects that underscore should be included in the word-chars, and underscore (U+005F) is punctuation too (category Pc). And including other punctuation has been specifically requested, too, in bug 700217 (U+00B7 MIDDLE DOT is category Po), and makes good sense, IMO. The default work-chars in g-t used to be '-A-Za-z0-9,./?%&#:_=+@~'; there's plenty of punctuation in there. The difference between this and what we do now is we now also include !"$'()*;<>^`[\]{|} So it's only the parenthesis that's the problem here? We could make word-chars include all punctuation with category != Ps, Pe, (and maybe Pi, Pf) in that case. That would leave only !"$'*;<>^`\| extra. And if those still are problematic, we could exclude them specifically by codepoint, but not by category.
For my particular use case, underscore would preferably be included, but other people might have different preferences. Erring on the side that something is included when it shouldn't be is a very bad user experience. Erring on the other side, when something is not a word char while you'd expect it be is not that bad, you just have to extend the selection. The Catalan middle dot request makes sense, but then it should be duplicated for all other apps (I've quickly checked gedit and firefox, it's not a word character there). Is the simple apostrophe a word char? I'd expect it to be when it's part of English word, but not when it's used to 'quote' someting. There's just no good solution. I don't think we should reinvent the whell. Either we should copy a good reference behavior (does Gtk have a textarea-like widget where it implements something? We could copy that - I don't mind losing underscore as a word char if that's the price for simplicity), or we should make it configurable. (Really, why was the feature dropped? :( )
So we probably should get the word boundaries using the algo from UAX29 [http://www.unicode.org/reports/tr29/#Word_Boundaries]. The setting was removed because it's a hard to understand what it does, and what format the setting exactly takes, and most importantly because we shouldn't make something configurable instead of doing the right thing by default. Behdad: do you know if there are there any implementations of the word boundary algo from UAX29 in gnome? Ideally as API, but copypasting would be ok as well. Our impl be simplified a bit by omiting the RTL (hebrew) stuff since we don't do RTL.
(In reply to comment #4) > So we probably should get the word boundaries using the algo from UAX29 > [http://www.unicode.org/reports/tr29/#Word_Boundaries]. > > The setting was removed because it's a hard to understand what it does Really? I think it's obvious. If it's not obvious for someone, they're free not to touch that setting ;) > what format the setting exactly takes, I think the only special character was '-' to denote the interval. Now that alphanumeric characters are word chars by default, I think it's safe to remove this feature and leave it simply a set of explicitly listed non-alphanumeric characters. > and most importantly because we > shouldn't make something configurable instead of doing the right thing by > default. I generally totally agree with this approach! The main question is, however, if there is one single "right thing" here. I don't think so. Even the UAX29 says "implementations may override (tailor) the results to meet the requirements of different environments or particular languages". Not sure if it refers to human languages or programming languages, probably both. Depending on the task you're doing in the terminal, different doubleclicking boundaries might be preferable. E.g. if you're programming, you probably want '_' to be a word-char. If you're doing tons of dpkg packaging, you probably don't want this, since '_' is a field separator there in package filenames. It's a complicated game trying to get the best out of such contradicting requirements (flexibility, code simplicity, UI simplicity etc.), but here I'd really prefer to see the word-chars setting being restored (maybe with different semantics if required). Having set up someone's personal preference might boost their productivity, and a different set might cause constant frustration.
Something like UAX29 is in Pango's break.c. But you don't want to start from that. My suggestion is, do whatever you can based on the GeneralCategory and special-casing of ASCII punctuation.
I think anything bracket-like should be excluded, not just parenthesis -- I'm hitting this with URLs surrounded by <> (And I just want to select them, not launch them). And although it may be a special case, I'm using a text-mode password manager which expects a password surrounded by │ line-drawing characters to be double-clickable to select. I also kind of agree that it would be nice to restore the word-chars setting because the specific different characters might depend on different fields of use.
(In reply to comment #2) > So it's only the parenthesis that's the problem here? Not just parenthesis. I'd personally prefer to use double-click in reasonably formatted C source codes to copy-paste identifiers. For that, *.,->;" are also characters that should be excluded. > We could make word-chars > include all punctuation with category != Ps, Pe, (and maybe Pi, Pf) in that > case. That would leave only !"$'*;<>^`\| extra. And , as well, maybe a few more. > And if those still are > problematic, we could exclude them specifically by codepoint, but not by > category. Define problematic? What is problematic for a user is not for another. This is a place where people want different behaviors. ChPe, please allow me to bring back the word-chars feature. Otherwise I'm afraid it'll be a constant source of user dissatisfaction and conflicting feature requests.
GtkTextEntry and GtkTextView / gedit have managed to do without a setting for this, so I really don't see the need for making this configurable, instead of making it do the right thing™. I would be open to just hardcoding some exceptions for ASCII chars to get the old default behaviour (comment 2) until I can get to fix this properly. Also relevent info is in bug 100487 and in zvt commits https://git.gnome.org/browse/archive/libzvt/commit/?id=2318d7122d6e060833536c92996d2299dbe7377c and https://git.gnome.org/browse/archive/libzvt/commit/?id=2edb67174dac0da9798a82e2f51befde87e1d551 .
(In reply to comment #9) > GtkTextEntry and GtkTextView / gedit have managed to do without a setting for > this, The whole refactoring started from addressing the middle dot issue. GtkTextEntry / GtkTextView / gedit get the middle dot wrong, so we probably shouldn't point to them as a good reference. > so I really don't see the need for making this configurable, instead of > making it do the right thing™. There is no single right thing™ here, even the UAX29 standard clearly says so. > I would be open to just hardcoding some > exceptions for ASCII chars to get the old default behaviour (comment 2) until I > can get to fix this properly. In my opinion, the only way to fix this properly and keep our users satisfied is to let them tailor wordchars to their personal taste. I wouldn't feel good submitting my favorite setting into mainstream vte, nor do I wish to carry this one single patch that I always have to apply to my vte. A reasonable built-in default, and a way to override in dconf (without a UI) would be acceptable for me. (It would rhyme with cursor blinking and probably a few others that are also only settable from dconf.) I understand the generic approach of making things "just work" and removing the number of options, but it shouldn't be done beyond all boundaries; and we shouldn't force the terminal to always be like all other apps, it'll never be like them. What's next? Remove encodings because anything other than UTF-8 is broken? Remove bs/del compability settings? Remove colors and just force to use the ones from the Gtk+ theme? Remove the block/ibeam/underscore cursor setting? Double click behavior is something users prefer to taylor to their personal taste, rather than something with one single global worldwide good solution. Couldn't we please just accept this? :/
Created attachment 281786 [details] [review] Quick fix I'm about to commit this quick fix. Without this, double clicking is pretty much unusable, it almost always selects too much and then you have to strip off those extra characters after pasting – you're better off if you use single click right away to highlight (that is, avoid this feature altogether). With this hotfix, double quick behaves quite reasonably, and pretty close to other Gtk apps. Unfortunately UAX29 won't be easy to implement. I'm not planning to work on it. Rules that look behind or ahead by one character suck and would require quite some refactoring. Rule WB4 sucks big time and would require a very complicated code.
lgtm.
What about WB11, WB12? "Do not break within sequences, such as “3.2” or “3,456.789”." And make sure that e.g. 192.36.148.17 would not break.
Patches welcome :) Seriously: The tough part is to implement looking at the characters before/after the ones where you'd break. Actually, the whole current concept of character "classes" needs to be completely reworked to something totally different. Once you're done with WB11/12, you're pretty much done with the rest too (except for WB4).
This is a mildly obscure use case, but the password manager YAPET uses │ (a unicode vertical bar) for the edge of the password, making it easy to select via double-clicking -- at least, with older VTE. The new patch still seems to consider that part of the word, unfortunately.
*** Bug 738815 has been marked as a duplicate of this bug. ***
I second the comment #10 and strongly vote for early reverting of the setting removal (while working on reasonable default). Ability to override the default via dconf can be sufficient as typical terminal (contrary to the vast majority of desktop apps) user (admin, developer) is not afraid of the amount of options and tweak tools but aims at optimizing his productivity and gets frustrated when a killer feature is dropped.
+1 Please give 'select-by-word characters' us back. It's an essential setting.
+1, configurable select-by-word was essential for me also.
I think Egmont's patch in comment #11 still breaks colon, if that is in PC class.
Are the classes locale-dependent?
I weep for the loss of the select-by-word characters feature of gnome-terminal. Can anyone point me at another terminal emulator that allows me to customize it as I prefer?
select-by-word characters *must* be user-configurable. I don't care if it is in an obscure conf program or in the gnome-terminal preference dialog (as before). What hubris removing this option that was there forever thinking one person knows what is better for the users than the users do. Anyone who is remotely a power user or developer is going to be majorly miffed at this bug. There is no "right" choice of characters. No standard can decide this for anyone. The current choice of characters is useless. I can't even select shell commands apart from my shellprompt, or individual entries from a grep foo *log without getting tons of extra garbage I must strip.
Here's another firm opinion for bringing back the setting: https://bugzilla.redhat.com/show_bug.cgi?id=1165244#c3
Created attachment 296392 [details] [review] vte: bring back word chars I attach patches that bring back word characters support. The patches are highly based on the code that's been removed; however, I've modified/simplified it. Based on user feedback so far, it seems that most people prefer a behavior where the word chars are the alphanumeric characters, plus a small (varying) set of punctuation. So basically bringing back what gnome-terminal used to have (apart from having to specify A-Za-z0-9, but not the accented letters, which was weird anyway) and also what konsole offers right now seems to be the right approach. This approach, even without touching the setting, already has the improvement over current vte/g-t that weird symbols (e.g. Unicode quotation marks in some utilities' error messages; line drawing chars of mc) are not considered word chars. For the unlikely case that a character is alphanumeric, yet the user doesn't want to make it a word char, I designed the feature in a way that the value is now called "exceptions", any character listed there inverts its behavior. Since I'm not planning to change this behavior anymore (word char is: alphanumeric characters XOR what's specified in the settings) and existing characters don't change their alphanumeric property either, I think it's totally safe and won't break for anyone over time. I've removed the special handling of the dash character, intervals can't be specified anymore. I don't think it would make sense, given how these symbols' Unicode codepoints are scattered all around, and just made the code/UX/documentation unnecessarily complicated.
Created attachment 296393 [details] [review] g-t: bring back word chars
I was working on a vte patch, too; I named this 'word-stop-chars' but maybe 'word-char-exceptions' is better... I'll take either this or my patch before the next release.
(In reply to comment #27) > I'll take either this or my patch before the next release. I'm glad to hear it :) I've expressed my opinion in private mail, probably here too (not sure) that specifying the "stop chars" is I believe the wrong approach. The useful default is to stop most of the time, except for a few well known characters. Let's list the exceptions, not the normal behavior. If you make it a list of punctuation chars which should be word chars, it'll be a list of 10-20 characters; chars that you're familiar with and do care about and are likely easily entered from the keyboard (e.g. remember the Catalan middle-dot issue, I'm sure the Catalan keyboard has that symbol, so it's very easy to enter for those who want it to be part of the word, while likely hard to enter for those who don't). If you make it a list of punctuation char which should be stop chars, it'll be a neverending hassle of individually listing dozens or even hundreds of chars that you don't care about (e.g. all the line drawing chars, all kinds of Unicode quotes, parentheses etc.). It's quite a hassle to even type or copy-paste these characters there, and I bet you'd have to open the dialog and extend your setting a lot of times before you'd settle with something that works relatively well - until, of course, the next such character appears. So I say let's go for - either for (alphanumeric XOR listed-as-exception) which my patch implements - or (alphanumeric OR listed-as-word-char) which might be easier to explain, loses the ability to remove an alphanumeric character, which I have no clue about whether anyone would want to ever do.
Created attachment 296404 [details] [review] vte: bring back word chars, v2 Just a git update for convenience, after chpe has committed half of my previous patch ;)
Created attachment 296429 [details] [review] vte: bring back word chars, v3 git merge again :) Seems that the current set of word char exceptions: -,.;/?%&#_=+@~ is a quite good choice. What I'm totally uncertain about is whether it's nicer if gnome-terminal knows about this set and vte defaults to the empty set instead; or if we make it vte's default too.
Created attachment 296430 [details] [review] g-t: bring back word chars, v3 git merge, to have something that works on top of current git head. I couldn't figure out how to g_settings_bind() with "ms" so I changed it back to "s" for now; I'm sure you (@chpe) know the right way to address this.
As commented in https://bugzilla.redhat.com/show_bug.cgi?id=1165244, I'm having word boundary problems in locale "en_US.UTF-8" with gcc error messages like: implicit declaration of function ‘foo’ There are other typographic quote characters that shouldn't be considered word characters. I would would appreciate if the word boundary configuration option could be brought back.
Took me ages to figure out how to set the value for that "ms" type, so for further reference, here it is: dconf write /org/gnome/terminal/legacy/profiles:/:${Profile_ID}/word-char-exceptions '@ms "-,.;/?%&#_=+@~·"' and to set to null: dconf write /org/gnome/terminal/legacy/profiles:/:${Profile_ID}/word-char-exceptions '@ms nothing' or remove the entry: dconf reset /org/gnome/terminal/legacy/profiles:/:${Profile_ID}/word-char-exceptions
Created attachment 299474 [details] [review] g-t: bring back word chars, v4 I figured out how to handle the "ms" format. I'm planning to improve the patch in a few days (see the FIXME inside), but until there here it is in its current state.
Created attachment 299556 [details] [review] g-t: bring back word chars, v5 Fix the issue with the previous patch. I hope there are no problems with this one :)
*** Bug 746046 has been marked as a duplicate of this bug. ***
Just for the record: vte-0.39.92 brings back the word char support, and changes the default to only contain alphanumeric characters plus a few other exceptions (but no large sets of Unicode punctuation characters). There is no UI in gnome-terminal (yet?), see comment 33 for how to set the value from command line, or comment 35 for a patch to g-t.
Review of attachment 299556 [details] [review]: I quite deliberately used "ms" instead of "s" so that a NULL could be passed through from settings to the API. Copying the default set to g-t and using that in place of NULL goes against that...
But then how would this feature look from the UI's point of view? A checkbox whether to use some undisclosed default or an explicitly entered value, plus the entry field next to that? Doesn't sound user-friendly to me.
Perhaps a comboboxentry with a "Default" entry being translated to NULL ?
- It's way too complicated UI for such a small feature. - It offers no way just to add or remove one single character. You have to figure out somehow the default set (and g-t offers no help), get it into the text entry somehow and then you can modify it. Or you have to begin constructing the set from scratch. In all other terminals where this option is available, it's just one text entry field initially populated with the default. G-t used to have this too. I can't see what was wrong with it or how anything more complicated could be more usable.
Hi, I ran into this issue after updating from gnome 3.14 to 3.16 - the selection behaviour changed with the removal of ':'. This makes it more annoying to select some types of things that appear very often in terminal windows, a few examples: IPv6 addresses: fdf0:abfb:1182:1::51c URLs: https://www.gnome.org/ All of the above said - I actually wouldn't mind making the select by word have *a lot fewer* special characters included in the list. To give an example, the IPv6 address with netmask: fdf0:abfb:1182:1::51c/64 or directory path /some/path/to/a/complicated-file_name The "double-click and drag" multi-word selection mode is really annoying on this - because the / is a "word" character, you can't select the ip address sans the netmask - or only the file_name portion of the file path rather than the whole thing. Compare this to the double-click-and-drag in your web browser (I'm using Firefox), which does allow you to choose whether to include the ipv6 netmask, and to select only a few components of the file path rather than the whole thing.
Thanks for your input! Every use case is different. You don't want the '/' to be a word-char, while I for one do. That's why the set is configurable (again). You can remove the '/' character for yourself, see comment 33 on how to do that. I hope we'll also bring back the UI for 3.18, and we might revise the default set. I began wondering what if we started off with the empty default (that is, alphanumeric chars only) and let everyone append to it if they wish. But it's similar to arguing which color scheme or font size should be the default...
(In reply to Egmont Koblinger from comment #41) > - It's way too complicated UI for such a small feature. > - It offers no way just to add or remove one single character. You have to > figure out somehow the default set (and g-t offers no help), get it into the > text entry somehow and then you can modify it. Or you have to begin > constructing the set from scratch. > > In all other terminals where this option is available, it's just one text > entry field initially populated with the default. G-t used to have this > too. I can't see what was wrong with it or how anything more complicated > could be more usable. Totally agree with this.
Hi Egmont et al. I finally got a newer OS installed and was keen to see this bug fixed. However, I'm having trouble figuring something out: How do I get rid of # (pound) from the word-chars set. That is, I want # to be a stop char and NOT get selected by a double-click. Comment #33 seems to indicate how to ADD stop chars using word-char-exceptions. I want to REMOVE stop chars. How do I do that? I thought these fixes fixed that? I thought the idea was that now it was fully user configurable? Also, dconf list /org/gnome/terminal/legacy/profiles:/:$foo/ shows me word-chars is still an option, but word-chars-exceptions isn't listed (I guess it won't until I write to it). My question is, is word-chars still used or is that leftover data from older versions? Its contents seem to be completely ignored. How do I delete that key (or do I just ignore it)? Finally, when changing the dconf setting, does one need to reboot or restart gnome-terminal to see the effect? Thanks!
(In reply to Trevor Cordes from comment #45) > Comment #33 seems to indicate how to ADD stop chars Nope, it shows you how to SET a concrete list of chars, replacing the previous list. Just execute that comment without the # sign in its parameter. > is word-chars still used I doubt so. > How do I delete that key (or do I just ignore it)? Comment 33 shows an example command how to remove a key. Use common sense to apply it for another key. Or, of course you can just ignore it. > does one need to reboot or restart gnome-terminal to see the effect? Try it out :)
Trevor, Where do you see it fixed? I've just checked this out and still the same on Ubuntu Xenial.
It's fixed in Fedora 22 (updates as of Dec 1), and so in Fedora 23 also. Ubuntu, you'll have to check your vte/g-t versions (vte 0.28 and gnome-terminal 3.16 should be enough). I did a little write-up just now for my local UNIX user group. I include step-by-step instructions. I think this will make things a lot clearer for non-programmers just trying to solve this problem: http://www.muug.mb.ca/pipermail/roundtable/2015-December/004433.html (Thanks again, Egmont; I was just getting hung up on the terminology and idiom.)
Word char support was missing from vte-0.38/gnome-terminal-3.14. The next release, vte-0.40/gnome-terminal-3.16 brings it back, although there's no UI, you have to use dconf as per previous comments here as of now.
As others have said, please just put back the simple, plain-vanilla system which gnome-terminal used to have, where the UI would show you the current setting (whether the default or not) which you could just edit in place. All this hackery with dconf and undocumented defaults is a nightmare for users. What was wrong with the old way?
So, how can we see the default list of characters? AFAIK, that's not possible (and I haven't found where the default list is set in the source code yet). The only solutions I've seen show how to set the complete list (not add a character to the existing list), or reset to the default value. But what is the actual default value?
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/vte/-/issues/2100.