Bug 354587 – selection unit should be more dependent on the language tokenizer

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 354587 - selection unit should be more dependent on the language tokenizer


Summary:	selection unit should be more dependent on the language tokenizer


Status:	RESOLVED FIXED

Product:	gtksourceview
Classification:	Platform
Component:	General
Version:	1.4.x
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	GTK Sourceview maintainers
QA Contact:	GTK Sourceview maintainers

URL:
Whiteboard:

Duplicates:	340949 500515 580495 724865 (view as bug list)
Depends on:	97545
Blocks:

Reported:	2006-09-06 10:06 UTC by Samium Gromoff
Modified:	2014-12-25 20:05 UTC

See Also:
GNOME target:	---
GNOME version:	2.13/2.14

Description Samium Gromoff 2006-09-06 10:06:43 UTC

The plain gtktextview widget does not provide means to control selection unit definition, and instead provides some wired-in defaults. It might be fitting
for plain text (which, in fact, some might find arguable), but does not work
as well for structured text, which is what gtksourceview is designed to represent.

For example -- let`s say we have a language definition which specifies
hexadecimal numbers as its tokens. When you double-click most language tokens
(except things like, say, line comments, which you`d prefer to be treated like
plain text) you expect it to be treated as a selection unit, and hence selected
as a whole.

However gtktextview`s wired-in plaintext defaults will split the token into pieces, and you will have 0x0aea0fed treated as six tokens.

The proposal is to:
      - have the language spec extended with selection unit semantics,
so as to distinguish between language objects meaned to be treated as selection units (keywords, various numbers, strings(?), etc) and those not meaned to (line and block comments, something else(?))
      - create a hidden-from-user mechanism to manage widget`s text selection units.

Comment 1 Adam Dingle 2006-10-04 19:36:06 UTC

I agree that this would be a useful enhancement.  I use the Anjuta IDE, which uses gtksourceview as an editor component; double clicking on C identifiers in Anjuta currently does not select the entire identifier, which is too bad.

Comment 2 Samium Gromoff 2006-11-08 10:49:57 UTC

Now when i thought about it, it appears that GtkTextView would have to be extended with a full-blown regexp matcher of its own, to be able to provide the necessary functionality to GtkSourceView.

Either this, or, specification of selection units on a case-by-case basis, which i assume would be severely suboptimal.

Comment 3 Paolo Maggi 2007-01-10 13:54:30 UTC

*** Bug 340949 has been marked as a duplicate of this bug. ***

Comment 4 Andy Owen 2009-06-21 11:35:27 UTC

This sounds like it could be something built into the syntax highlighting definitions for a language. e.g. for every language I use, if a "word" was delimited by whitespace, or characters of a different colour, then this would give the correct behaviour. But, it also depends on what you expect for an identifier like "this_is_an_identifier", personally, I consider that to be one word (since it is one token), but I believe that some people exist who think it is 4.

Comment 5 Ralf Ebert 2010-08-08 00:39:23 UTC

There is a plug-in that makes this configurable:
http://code.google.com/p/gedit-click-config/
http://mail.gnome.org/archives/gedit-list/2010-August/msg00004.html

The general approach looks nice, but the UI is a bit super-ultra-configurable. 

Maybe the backend bits of this plug-in could be moved to the core/the syntax highlighting definitions, with the UI being a separate plug-in.

Comment 6 Ignacio Casal Quinteiro (nacho) 2013-04-02 18:09:41 UTC

*** Bug 500515 has been marked as a duplicate of this bug. ***

Comment 7 Adam Dingle 2013-04-30 19:02:45 UTC

Here's a proposal for solving this longstanding annoyance.

I propose the following simple extension to .lang files to allow them to specify word units.  Today, .lang files contain <context> elements with regular expressions that define various elements for syntax highlighting.  Let's define a new context class "word", and say that if a .lang file contains a <context> element with this class then that context defines language-specific word units.  We'll use these units for selecting on double click, moving by ctrl+arrow and any other word-related operations.  Word units will be recognized anywhere in a document except inside any context that has the existing class "string", where we'll fall back on the existing word boundary mechanism.  I think we should recognize word units even inside comments, which often contain identifiers or other code fragments.

Today most word-related operations including word selection and moving by words are implemented in GtkTextView, which in turn calls GtkTextIter to look for word boundaries.  GtkTextIter in turn asks Pango where word boundaries lie.

We can add two new signals is_word_start() and is_word_end() to GtkTextBuffer.  GtkTextIter will now call these signals to find out if a position is a word boundary.  The default implementation of these signals will call Pango.  GtkSourceBuffer can override these signals with its own implementation of word boundaries.  If a document's .lang file doesn't specify any context with class="word", then the signals will just fall back on the default signal handlers which call Pango.

Anyway, that's my proposed design.  Feedback is welcome.

Comment 8 Paolo Borelli 2013-04-30 19:55:51 UTC

The part about lang files makes sense, even if I'd probably go with a word boundary regex (similar to the one that can be configured in gnome-terminal).


The hard part is changing gtktextview (or even better pango), because doing it in gtksourceview is too much of a hack in my opinion.


There are various mail threads in the past about this subject. For instance

https://mail.gnome.org/archives/gtk-devel-list/2006-April/msg00164.html


I also remember that Behdad was open to the idea of exposinf proper api in pango, though these days there is not much work done on pango at all :(

Comment 9 Adam Dingle 2013-04-30 22:20:42 UTC

Thanks for the feedback, Paolo.  Today, neither GtkTextView or Pango has any knowledge of lang files, and that seems right to me - they describe various languages, which is what GtkSourceView is all about.  So to make GtkTextView or Pango know about those files would seem like a greater hack in my opinion.  If you agree, then only GtkSourceView will know where the word boundaries should be.  I see a few possible approaches:

1. As I suggested above, have signals is_word_start and is_word_end which GtkTextView can call to find out where words start and end.

2. Have signals such as move_forward_word, move_backward_word, select_word and so on which GtkTextView can call to perform those operations.  GtkSourceView can reimplement them using its own understanding of word boundaries.

3. Instead of using signal callbacks, GtkSourceView could reimplement operations by handling key and mouse events itself.  For example, when the user double clicks, GtkSourceView could implement its own select word operation without calling GtkTextView to do that.

I lean toward (1) because I think (2) and (3) will take code that already exists in GtkTextView and duplicate it in GtkSourceView to some extent.  I also worry that (3) would be fragile since if anyone used a GtkTextIter to iterate by words through the document, that would use different word boundaries than those used by GtkSourceView.

Comment 10 Adam Dingle 2013-04-30 22:25:51 UTC

I guess a fourth possible approach would be to have GtkSourceView pass a word boundary regex down to GtkTextView somehow.  Then GtkTextView would need to use regular expressions (which maybe it doesn't today), but wouldn't need to know about lang files.  The problem with this is that we might want word boundaries to be context-dependent - for example, perhaps inside strings they should be different  from outside.  But if they need to be the same everywhere in a source file, I could live with that, and making them configurable at all would still certainly be a huge improvement over what we have today.

Comment 11 José Aliste 2013-04-30 23:34:17 UTC

I agree with Adam that implementing word selection for lang files does not belong to pango... maybe to GtkTextView to some extent. I would go for 3) since this would give you more flexibility, no need to modify Gtk+. Afterwards, when you have an working implementation, we could see which modifications would allow for better code reuse. I don't know, it is just my opinion :)

Comment 12 José Aliste 2013-05-01 02:10:32 UTC

Oh... I will have to shoot to my self. :) After reading Unicode spec, I believe that some of the fixes for this bug should go into pango. See http://www.unicode.org/reports/tr29/ for more info where, for instance, it says we should not break between between numbers and letters when they are adjacent. So these more basic things would already improve a lot the situation. Then of course, we need more tailored situations for different lang files, but fixing the basic stuff in pango would benefit the whole stack :)

Comment 13 José Aliste 2013-05-01 03:25:55 UTC

I just found the relevant pango bug. See https://bugzilla.gnome.org/show_bug.cgi?id=97545

Comment 14 Paolo Borelli 2013-05-01 07:46:00 UTC

I did not mean that gtk or pango would need to be aware of lang files, I meant that we would call a set_word_boundary api on textview and this in turn would call a pango api on each pango layout

The api could even set this on a text tag so that we could have different boundaries for different contexts, though I am not sure if that is actually a good idea or if it makes the user experience unpredictable and annoying...

Comment 15 Adam Dingle 2013-05-01 08:26:43 UTC

Ah.  Paolo, did you mean that GtkSourceView would call set_word_boundary once for each word boundary in the document?  Or that it would call set_word_boundary just once, passing a regular expression or pattern that woudl allow GtkTextView to figure out where the boundaries are?

It sounds like you may have meant the former.  If so, then every time the buffer changes presumably we'd need to scan over it and set all the word boundaries again, which I worry would be slow.  Or maybe we could be smart and only set word boundaries in the area of the buffer that actually changed, but I think that may be more complicated.

Comment 16 Paolo Borelli 2013-05-01 09:38:26 UTC

I meant just once or at worse only on the text tags surrounding specific contexts if we really want different boundaries for code and comments

Comment 17 Adam Dingle 2013-05-01 12:47:07 UTC

I've thought about this more and I now lean toward the following approach. GtkSourceView can pass GtkTextView a set of characters to be used for grouping words. For the C language, for example, this set will look like this:

A-Za-z0-9_$

We can use regular expression notation to define the set, but this is only a set of characters - not a full-blown regular expression.

GtkTextView will consider any cluster of characters in that set to be a single word. Outside such clusters, it will continue to use Pango to choose word boundaries. So, for example, a comment in a C source file could contain Japanese text and word breaking would still work fine there.

I like this approach because it is ultra simple, doesn't require word boundary callbacks, and is a big improvement over what we have today. This is also pretty much how gnome-terminal handles word selection, by the way (in its "Select-by-word characters" preference).

One more detail: for languages such as Python 3 which allow identifiers to contain various classes of Unicode characters, I think the .lang files for those languages should probably exclude Asian characters when defining character sets for word breaking in this way. (If they included them, then word breaking would fail in comments in Asian languages.) So then if an identifier consisted of, say, several Chinese words (probably not a common case) and the user clicked on it then they'd select only one of those words. That would be probably be fine, and is how gedit works today anyway.

Comment 18 José Aliste 2013-05-03 05:18:45 UTC

Adam, for what it is word. I just pushed  a patch to pango that fixes the "123foo" is two words thing. For the more complicated features you want here, it might be good to ping Behdad. I was discussing with him today on IRC and as always, he is very responsive and aware of the issues described in this bug (and also willing to support some new API in pango that would allow us to fix this bug)

Comment 19 Adam Dingle 2013-05-03 06:44:56 UTC

Thanks, José.  It's good to hear that Behdad is aware of this bug and potentially willing to help (I'm adding him to the cc list).  I wasn't actually thinking we'd make any changes to Pango in addressing this bug, however.   Pango knows about different human languages and how to break words in them, but changing those rules to adapt to various computer languages seems like something that can easily be handled outside Pango.  But I don't feel strongly about this; if the Gtk and/or Pango developers think this belongs at the Pango layer then that's fine.  Paolo, do you have an opinion about that?  I don't think it will be a large change in any case.

Comment 20 Behdad Esfahbod 2013-07-11 19:01:52 UTC

Eventually I like to add API to Pango that would allow customizing word boundaries.  But we're far from that actually happening.  For now, I think a solution in GtkTextView may be preferable.

Comment 21 camille 2013-11-20 23:20:12 UTC

Hey guys, just wanted to let you know that there is a $200 bounty on this issue at Bountysource: https://www.bountysource.com/issues/1072090.

Comment 22 camille 2013-11-20 23:21:38 UTC

Hey guys, just wanted to let you know that there is a $200 bounty on this issue at Bountysource: https://www.bountysource.com/issues/1072090

Comment 23 Garrett Regier 2014-02-21 10:29:03 UTC

*** Bug 724865 has been marked as a duplicate of this bug. ***

Comment 24 Sébastien Wilmet 2014-04-10 16:05:16 UTC

The word boundaries in GtkTextView are based only on natural-language words. I've filed bug #727972 for having the same behavior as in Vim, i.e. having word boundaries also for non-natural-language words. With that bug fixed, maybe overriding the GtkTextView word boundaries in GtkSourceView is no longer needed.

Comment 25 Sébastien Wilmet 2014-04-29 19:59:18 UTC

I've made good progress. See bug #562767 and:
https://git.gnome.org/browse/gtksourceview/log/?h=wip/custom-word-boundaries

The custom word boundaries are implemented like in Vim.
I proposed to implement this behavior directly in GtkTextView (see bug #530405), but the advantage with a virtual function in GtkTextBuffer is that GtkSourceView can have a completely different implementation in the future if we want, it is more flexible. But GtkTextView-only users won't benefit from the enhanced word boundaries.

Comment 26 Leslie P. Polzer 2014-06-07 10:32:53 UTC

What's the status of this?

Please correct me if I'm wrong, but it would seem that Adam Dingle in Comment #17 (which is my favored approach) suggested an approach that is different from the patch posted in Comment #25.

Comment 27 Sébastien Wilmet 2014-06-07 14:01:38 UTC

The patches simply need a review. Mathias Clasen already made some comments for the patches in GTK+. He proposed another solution that I don't like, and I've explained why.

Note that with my solution the underscore is added to the group of characters for "normal" words (i.e. not the punctuation). This can be easily extended to support what Adam proposed, except that A-Za-z0-9 is already part of the characters for "normal" words (since the implementation uses the natural-language word boundaries of Pango).

Comment 28 Sébastien Wilmet 2014-12-17 11:41:02 UTC

*** Bug 580495 has been marked as a duplicate of this bug. ***

Comment 29 Sébastien Wilmet 2014-12-24 15:29:05 UTC

Finished:
https://git.gnome.org/browse/gtksourceview/log/?h=wip/custom-word-boundaries-2

The ::extend-selection signal has been added to GtkTextView (see bug #111503).

The custom word boundaries implemented in GtkSourceView are roughly the same boundaries as in Vim. The boundaries are generic (not specific to a particular language) and are normally suitable for a wide range of languages, including natural languages and programming languages.

I don't think customization is needed (FWIW in Vim I haven't seen an option for word boundaries), but if it would be really useful for a specific language, more flexibility can be added later.

Comment 30 Sébastien Wilmet 2014-12-25 20:05:43 UTC

Merged! The most important commits (on the GtkSourceView side):
https://git.gnome.org/browse/gtksourceview/commit/?id=93f42228976286e2ecea01866d65c662af921732
https://git.gnome.org/browse/gtksourceview/commit/?id=6023c13c95d7240bac29705ed9768c3306ef2450
https://git.gnome.org/browse/gtksourceview/commit/?id=565f4105771888e5375fbd0109dd2b9ca32779da
https://git.gnome.org/browse/gtksourceview/commit/?id=1e184c170fb2611eb37f011a781b4b0149f6d4eb

Ignacio said it's a good enough solution. So we can consider this bug as fixed. If for some specific use cases the word boundaries are not suitable, another bug can be opened or this one can be reopened, to implement a more flexible solution where customization is possible (e.g. set whether the underscore is part of a word, or add other special characters).