Bug 562767 – Overridable boundaries for word movements and selection

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 562767 - Overridable boundaries for word movements and selection


Summary:	Overridable boundaries for word movements and selection


Status:	RESOLVED DUPLICATE of bug 111503

Product:	gtk+
Classification:	Platform
Component:	Widget: GtkTextView
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtk-bugs
QA Contact:	gtk-bugs

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2008-11-30 14:42 UTC by jaywalkie
Modified:	2014-12-17 11:25 UTC

See Also:
GNOME target:	---
GNOME version:	Unversioned Enhancement

Attachments
textbuffer: move_iter virtual function (14.64 KB, patch) 2014-04-29 19:50 UTC, Sébastien Wilmet	reviewed	Details \| Review
textiter: set/get boundary type (3.34 KB, patch) 2014-10-09 19:32 UTC, Sébastien Wilmet	none	Details \| Review
textiter: more generic implementation for "visible" funcs (3.08 KB, patch) 2014-10-19 16:37 UTC, Sébastien Wilmet	none	Details \| Review
textiter: more generic implementation for starts/ends/inside word (3.50 KB, patch) 2014-10-19 16:37 UTC, Sébastien Wilmet	reviewed	Details \| Review
Add GtkTextBuffer::move_iter vfunc (7.28 KB, patch) 2014-10-19 16:37 UTC, Sébastien Wilmet	none	Details \| Review
textiter: set/get boundary type (7.01 KB, patch) 2014-10-19 16:37 UTC, Sébastien Wilmet	none	Details \| Review

Description jaywalkie 2008-11-30 14:42:29 UTC

If GtkTextView provides a signal emitted when user intends to select a word by double-clicking on it, then the applications (e.g.: gtksourceview) can decide what exactly a word is. Current behavior is not suitable for customized purposes, like selecting a C identifier, or a SSN number, etc. Instead of directly calling extend_selection in gtktextview.c, it should emit a signal. Then, if the event handler returns  false, or if there is no event handler, it should go ahead and do whatever it does now.
I think this would be very useful.

Comment 1 Sébastien Wilmet 2014-04-29 19:50:44 UTC

Created attachment 275446 [details] [review]
textbuffer: move_iter virtual function

The purpose is to be able to override word boundaries for word movements
(Ctrl+arrow) and word selection (double click). The virtual function
uses an enum for the move type, so the function can be reused for e.g.
sentences.

The implementation has two word boundary types (word start and word
end). This is needed for word movements: Ctrl+right go to the next word
end, Ctrl+left go to the previous word start. With a single word
boundary type, it would be more difficult to know if a word boundary is
a word start or a word end (the contents must be analyzed in that case,
for example if the previous char is a space, the iter is probably at a
word start).

Unfortunately two word boundaries is less convenient for selecting a
word with double click. That's why convenient functions are written in
gtktextbuffer, to work on a single word boundary type.

Comment 2 Sébastien Wilmet 2014-04-29 19:53:31 UTC

An example of custom word boundaries implemented in GtkSourceView, for the above patch with the move_iter() virtual function:
https://git.gnome.org/browse/gtksourceview/log/?h=wip/custom-word-boundaries

I don't think a signal is needed, a virtual function should be enough.

Comment 3 Matthias Clasen 2014-05-02 16:55:09 UTC

Review of attachment 275446 [details] [review]:

In principle, I like the vfunc approach to customization here. It solves a real problem.

More detailed comments:

this patch introduces new (private) gtktextbuffer api and uses that in gtktextview. this means that we now have two different ways of moving by words, one thats implemented in gtktextiter, and another (customizable) one in gtktextbuffer ? I think we should make the textiter api work by calling the textbuffer api, and leave the textview unchanged. Do you expect the movement type enum to grow ? I could see some need to customize moving by paragraphs. Finally, it would be really good to document some expectations for how move_iter implementations are expected to behave - e.g. never move in the opposite direction, be idempotent (or not ?!). And when thats documented, there should be tests for the default implementation to verify the behaviour.

Comment 4 Sébastien Wilmet 2014-05-03 20:45:25 UTC

> I think we should make the textiter api work by calling the textbuffer api

The documentation of gtk_text_iter_starts_word() and gtk_text_iter_ends_word() talk about "natural-language word" and that word breaks are determined by Pango.

A spell checker relies on the Pango word boundaries. If a spell checker uses the custom word boundaries, it will probably highlight misspelled words for e.g. punctuations.

So I think it's better to keep word-related functions as is in GtkTextIter.

> Do you expect the movement type enum to grow ?

Not in the near future. In GtkSourceView there are plans only for word boundaries FWIW.

Comment 5 Matthias Clasen 2014-05-05 23:24:40 UTC

(In reply to comment #4)
> > I think we should make the textiter api work by calling the textbuffer api
> 
> The documentation of gtk_text_iter_starts_word() and gtk_text_iter_ends_word()
> talk about "natural-language word" and that word breaks are determined by
> Pango.
> 
> A spell checker relies on the Pango word boundaries. If a spell checker uses
> the custom word boundaries, it will probably highlight misspelled words for
> e.g. punctuations.
> 
> So I think it's better to keep word-related functions as is in GtkTextIter.

But you are not going to use a spell-checker anyway on a buffer where custom work boundaries are relevant, like, say source code.

I would consider updating the documentation to say 

 * By default, word breaks are determined by Pango and should be correct for
 * nearly any human language (if not, the correct fix would be to the Pango
 * word break algorithms). Since GTK+ 3.14, the work break determination can
 * be customized via the #GtkTextBuffer::move-iter vfunc.

Comment 6 Sébastien Wilmet 2014-05-06 11:38:07 UTC

There are at least two different use cases for word boundaries:
- for a spell checker.
- for word movements (ctrl+arrow) and word selection (double click).

In source code it is useful to spell check the comments. Or in a LaTeX document there is a mix of "code" and text.

The syntax highlighting engine of GtkSourceView defines "no-spell-check" regions in the buffer. So the spell checker runs only where it is relevant. But where it is relevant, only natural-language words must be spell checked. If the custom word boundaries are used for the spell checker, punctuations or other characters may be wrongly highlighted. And it would be difficult to apply the "no-spell-check" tag inside a comment to avoid the spell checking on the special characters.

For word movements (ctrl+arrow), my idea is to implement the same behavior as the 'e' and 'b' Vim commands. So a group of special characters is taken as a word.

Comment 7 Matthias Clasen 2014-05-06 17:43:14 UTC

If you have those regions already, I'd suggest to use them for determining the word boundary algorithm as well: when in a comment, use pango, else do whatever works for the programming language at hand.

Comment 8 Sébastien Wilmet 2014-05-06 19:56:43 UTC

It would fix the spell checker, indeed. But the word movements and selection in comments would be inconsistent with the rest of the code. A comment can contain a variable name, special characters like a line ----- to delimit a section, and so on.

The Unicode spec [1] explains that word boundaries can be tailored for certain features. A spell checker has different requirements than word movements, word selection or "whole word search".

Pango implements word boundaries for natural-language words, so it is suitable for a spell checker. But it is not suitable for word movements and selection. The current behavior is broken in my opinion. For example:

> abcd ---- efgh

- If I double click on "abcd", "abcd" is selected, OK.
- If I double click on "----", " ---- " is selected (with the spaces). It would be better to select "----" without the spaces so it is consistent with a natural-language word.
- If the cursor is after the 'd' and I press Ctrl+right, the cursor is moved after the 'h'. It would be better to move after "----".

It would be nice to fix this behavior in GtkTextView directly, see bug #530405 (I'll add a comment there). The solution here with the vfunc is more flexible, but it adds an API (the vfunc), and GtkTextView itself doesn't benefit from a better behavior.

[1] http://www.unicode.org/reports/tr29/#Word_Boundaries

Comment 9 Sébastien Wilmet 2014-08-21 15:16:40 UTC

(In reply to comment #8)
> Pango implements word boundaries for natural-language words, so it is suitable
> for a spell checker. But it is not suitable for word movements and selection.

Another idea is to provide an API in Pango to retrieve word boundaries suitable for word movements and word selection. Then GtkTextView and GtkEntry can use this new Pango API.

Or if flexibility is preferable, the vfunc is the other solution. But as I explained, the GtkTextIter functions are suitable for spell checkers but not for word movements and word selection. Two different types of word boundaries are required.

Currently with the GtkTextIter functions (word boundaries at natural-language words) it is possible to derive easily the word boundaries used for word movements and word selection. Doing the reverse is not easily possible. So I think the GtkTextIter functions should not be overridable.

Any thoughts on this?

Comment 10 Sébastien Wilmet 2014-08-21 17:42:21 UTC

In bug #111503 there is a discussion about consistency across applications. And I agree that the same behavior should be present in GtkEntry, GtkTextView and GtkSourceView.

Comment 11 Sébastien Wilmet 2014-10-09 19:32:00 UTC

Created attachment 288173 [details] [review]
textiter: set/get boundary type

This way, all the words/sentences-related GtkTextIter functions can be
reused for the custom word boundaries, and still be able to use the
Pango boundaries for spell checkers.

This commit is just an idea, it's far from finished.

Comment 12 Sébastien Wilmet 2014-10-19 16:37:06 UTC

Created attachment 288854 [details] [review]
textiter: more generic implementation for "visible" funcs

For gtk_text_iter_forward_visible_word_end(),
backward_visible_word_start() etc, use a more generic implementation
that is based on other public functions.

The purpose is to make gtk_text_iter_forward_word_end() and
gtk_text_iter_backward_word_start() overridable, all the other
word-related GtkTextIter functions should be based on them.

For the cursor_position functions the change is not really needed, but
for code simplification it's better to keep only move_iter_visible().

The unit tests have already a good code coverage for this code, and
there is no regressions.

Comment 13 Sébastien Wilmet 2014-10-19 16:37:17 UTC

Created attachment 288855 [details] [review]
textiter: more generic implementation for starts/ends/inside word

To use only gtk_text_iter_forward_word_end() and backward_word_start()
which will be overridable.

Comment 14 Sébastien Wilmet 2014-10-19 16:37:22 UTC

Created attachment 288856 [details] [review]
Add GtkTextBuffer::move_iter vfunc

So that gtk_text_iter_forward_word_end() and backward_word_start() are
overridable.

Comment 15 Sébastien Wilmet 2014-10-19 16:37:27 UTC

Created attachment 288857 [details] [review]
textiter: set/get boundary type

This way, all the words/sentences-related GtkTextIter functions can be
reused for custom boundaries, and still be able to use the Pango
boundaries for e.g. spell checkers.

With the GtkTextBuffer::move_iter vfunc, all boundary types are
overridable. For example GtkSourceBuffer can override word boundaries
used for cursor movements so it better fits source code.

Comment 16 Sébastien Wilmet 2014-10-19 16:41:24 UTC

Still one thing to do: update the documentation of the gtk_text_iter word-related functions.

The above 4 patches come with the GtkSourceView part:
https://git.gnome.org/browse/gtksourceview/log/?h=wip/custom-word-boundaries

All the gtk_text_iter word functions are already well covered by unit tests.

Comment 17 Sébastien Wilmet 2014-10-19 17:04:09 UTC

For the record, there was also a discussion on the mailing list:
https://mail.gnome.org/archives/gtk-devel-list/2014-September/msg00019.html

Also, the GtkTextMoveType enum can be extended for sentences. For example triple-click in a C code could select the whole scope. But it can be done later.

Comment 18 Matthias Clasen 2014-11-27 23:08:47 UTC

Review of attachment 288855 [details] [review]:

Are you sure that it is a great idea to base these predicates on iter motion functions ? Performance-wise this might be quite a step backwards - the motion functions might end up running over the whole buffer if no suitable boundaries are present.

Comment 19 Sébastien Wilmet 2014-11-28 13:51:07 UTC

Mmm, indeed with a buffer content like "--\n--\n--\nblah", from offset 0 gtk_text_iter_forward_word_end() goes to the end of the buffer.

The idea was to have only one vfunc, so the API addition is minimal.

A solution is to add a 'limit' GtkTextIter parameter to the vfunc, so we can search only on the current line for example. Currently test_log_attrs() searches only on the current line. For some languages it is not 100% correct, since a word can span multiple lines:

> A line boundary is usually a word boundary, but there are exceptions such as
> a word containing a SHY (soft hyphen): it will break across lines, yet is a
> single word.

http://www.unicode.org/reports/tr29/#Word_Boundaries

But taking only the current line should be good enough, and is not a regression from the current GtkTextIter implementation (in the future the 'limit' parameter can be changed to take into account the exceptions, for example by taking two lines instead of one, but it means that pango_get_log_attrs() must be called with the two lines at once, not call it line by line).

Comment 20 Matthias Clasen 2014-11-29 03:15:45 UTC

sounds like a good plan to me

Comment 21 Sébastien Wilmet 2014-12-04 19:30:29 UTC

I'm not sure it's a good idea to add a boundary type state to GtkTextIter. GTK_TEXT_BOUNDARY_TYPE_NATURAL_LANGUAGE and GTK_TEXT_BOUNDARY_TYPE_FOR_CURSOR_MOVEMENTS make some sense for _word_ boundaries, but doesn't make sense for characters, cursor positions and lines.

Overriding word boundaries for keybindings like Ctrl+left and Ctrl+right is already possible by overriding the GtkTextView::move-cursor and GtkTextView::delete-from-cursor signals.

What is currently not easily feasible is to customize the double click and triple click behaviors.

For the triple click, GtkTextView selects the "display line", by using functions like gtk_text_view_backward_display_line_start(). It doesn't make sense to be able to override those functions just to be able to customize the triple click. So a signal or vfunc is needed. We could have an enum with DOUBLE_CLICK and TRIPLE_CLICK, so that the double click is also more customizable. The above patches assume that the double click selects a word. But what if a text editor wants a totally different behavior, like selecting the display line for double click and the "buffer line" for triple click?

So I think it's better to have a signal or vfunc specifically for the double and triple click, without assumptions about their behavior. Having that is sufficient, and is less invasive than adding a boundary type state to GtkTextIter.

Comment 22 Sébastien Wilmet 2014-12-13 20:20:38 UTC

(In reply to comment #21)
> So I think it's better to have a signal or vfunc specifically for the double
> and triple click, without assumptions about their behavior. Having that is
> sufficient, and is less invasive than adding a boundary type state to
> GtkTextIter.

See the patch at bug #111503 for that.

Comment 23 Sébastien Wilmet 2014-12-17 11:25:03 UTC


*** This bug has been marked as a duplicate of bug 111503 ***