GNOME Bugzilla – Bug 562767
Overridable boundaries for word movements and selection
Last modified: 2014-12-17 11:25:03 UTC
If GtkTextView provides a signal emitted when user intends to select a word by double-clicking on it, then the applications (e.g.: gtksourceview) can decide what exactly a word is. Current behavior is not suitable for customized purposes, like selecting a C identifier, or a SSN number, etc. Instead of directly calling extend_selection in gtktextview.c, it should emit a signal. Then, if the event handler returns false, or if there is no event handler, it should go ahead and do whatever it does now. I think this would be very useful.
Created attachment 275446 [details] [review] textbuffer: move_iter virtual function The purpose is to be able to override word boundaries for word movements (Ctrl+arrow) and word selection (double click). The virtual function uses an enum for the move type, so the function can be reused for e.g. sentences. The implementation has two word boundary types (word start and word end). This is needed for word movements: Ctrl+right go to the next word end, Ctrl+left go to the previous word start. With a single word boundary type, it would be more difficult to know if a word boundary is a word start or a word end (the contents must be analyzed in that case, for example if the previous char is a space, the iter is probably at a word start). Unfortunately two word boundaries is less convenient for selecting a word with double click. That's why convenient functions are written in gtktextbuffer, to work on a single word boundary type.
An example of custom word boundaries implemented in GtkSourceView, for the above patch with the move_iter() virtual function: https://git.gnome.org/browse/gtksourceview/log/?h=wip/custom-word-boundaries I don't think a signal is needed, a virtual function should be enough.
Review of attachment 275446 [details] [review]: In principle, I like the vfunc approach to customization here. It solves a real problem. More detailed comments: this patch introduces new (private) gtktextbuffer api and uses that in gtktextview. this means that we now have two different ways of moving by words, one thats implemented in gtktextiter, and another (customizable) one in gtktextbuffer ? I think we should make the textiter api work by calling the textbuffer api, and leave the textview unchanged. Do you expect the movement type enum to grow ? I could see some need to customize moving by paragraphs. Finally, it would be really good to document some expectations for how move_iter implementations are expected to behave - e.g. never move in the opposite direction, be idempotent (or not ?!). And when thats documented, there should be tests for the default implementation to verify the behaviour.
> I think we should make the textiter api work by calling the textbuffer api The documentation of gtk_text_iter_starts_word() and gtk_text_iter_ends_word() talk about "natural-language word" and that word breaks are determined by Pango. A spell checker relies on the Pango word boundaries. If a spell checker uses the custom word boundaries, it will probably highlight misspelled words for e.g. punctuations. So I think it's better to keep word-related functions as is in GtkTextIter. > Do you expect the movement type enum to grow ? Not in the near future. In GtkSourceView there are plans only for word boundaries FWIW.
(In reply to comment #4) > > I think we should make the textiter api work by calling the textbuffer api > > The documentation of gtk_text_iter_starts_word() and gtk_text_iter_ends_word() > talk about "natural-language word" and that word breaks are determined by > Pango. > > A spell checker relies on the Pango word boundaries. If a spell checker uses > the custom word boundaries, it will probably highlight misspelled words for > e.g. punctuations. > > So I think it's better to keep word-related functions as is in GtkTextIter. But you are not going to use a spell-checker anyway on a buffer where custom work boundaries are relevant, like, say source code. I would consider updating the documentation to say * By default, word breaks are determined by Pango and should be correct for * nearly any human language (if not, the correct fix would be to the Pango * word break algorithms). Since GTK+ 3.14, the work break determination can * be customized via the #GtkTextBuffer::move-iter vfunc.
There are at least two different use cases for word boundaries: - for a spell checker. - for word movements (ctrl+arrow) and word selection (double click). In source code it is useful to spell check the comments. Or in a LaTeX document there is a mix of "code" and text. The syntax highlighting engine of GtkSourceView defines "no-spell-check" regions in the buffer. So the spell checker runs only where it is relevant. But where it is relevant, only natural-language words must be spell checked. If the custom word boundaries are used for the spell checker, punctuations or other characters may be wrongly highlighted. And it would be difficult to apply the "no-spell-check" tag inside a comment to avoid the spell checking on the special characters. For word movements (ctrl+arrow), my idea is to implement the same behavior as the 'e' and 'b' Vim commands. So a group of special characters is taken as a word.
If you have those regions already, I'd suggest to use them for determining the word boundary algorithm as well: when in a comment, use pango, else do whatever works for the programming language at hand.
It would fix the spell checker, indeed. But the word movements and selection in comments would be inconsistent with the rest of the code. A comment can contain a variable name, special characters like a line ----- to delimit a section, and so on. The Unicode spec [1] explains that word boundaries can be tailored for certain features. A spell checker has different requirements than word movements, word selection or "whole word search". Pango implements word boundaries for natural-language words, so it is suitable for a spell checker. But it is not suitable for word movements and selection. The current behavior is broken in my opinion. For example: > abcd ---- efgh - If I double click on "abcd", "abcd" is selected, OK. - If I double click on "----", " ---- " is selected (with the spaces). It would be better to select "----" without the spaces so it is consistent with a natural-language word. - If the cursor is after the 'd' and I press Ctrl+right, the cursor is moved after the 'h'. It would be better to move after "----". It would be nice to fix this behavior in GtkTextView directly, see bug #530405 (I'll add a comment there). The solution here with the vfunc is more flexible, but it adds an API (the vfunc), and GtkTextView itself doesn't benefit from a better behavior. [1] http://www.unicode.org/reports/tr29/#Word_Boundaries
(In reply to comment #8) > Pango implements word boundaries for natural-language words, so it is suitable > for a spell checker. But it is not suitable for word movements and selection. Another idea is to provide an API in Pango to retrieve word boundaries suitable for word movements and word selection. Then GtkTextView and GtkEntry can use this new Pango API. Or if flexibility is preferable, the vfunc is the other solution. But as I explained, the GtkTextIter functions are suitable for spell checkers but not for word movements and word selection. Two different types of word boundaries are required. Currently with the GtkTextIter functions (word boundaries at natural-language words) it is possible to derive easily the word boundaries used for word movements and word selection. Doing the reverse is not easily possible. So I think the GtkTextIter functions should not be overridable. Any thoughts on this?
In bug #111503 there is a discussion about consistency across applications. And I agree that the same behavior should be present in GtkEntry, GtkTextView and GtkSourceView.
Created attachment 288173 [details] [review] textiter: set/get boundary type This way, all the words/sentences-related GtkTextIter functions can be reused for the custom word boundaries, and still be able to use the Pango boundaries for spell checkers. This commit is just an idea, it's far from finished.
Created attachment 288854 [details] [review] textiter: more generic implementation for "visible" funcs For gtk_text_iter_forward_visible_word_end(), backward_visible_word_start() etc, use a more generic implementation that is based on other public functions. The purpose is to make gtk_text_iter_forward_word_end() and gtk_text_iter_backward_word_start() overridable, all the other word-related GtkTextIter functions should be based on them. For the cursor_position functions the change is not really needed, but for code simplification it's better to keep only move_iter_visible(). The unit tests have already a good code coverage for this code, and there is no regressions.
Created attachment 288855 [details] [review] textiter: more generic implementation for starts/ends/inside word To use only gtk_text_iter_forward_word_end() and backward_word_start() which will be overridable.
Created attachment 288856 [details] [review] Add GtkTextBuffer::move_iter vfunc So that gtk_text_iter_forward_word_end() and backward_word_start() are overridable.
Created attachment 288857 [details] [review] textiter: set/get boundary type This way, all the words/sentences-related GtkTextIter functions can be reused for custom boundaries, and still be able to use the Pango boundaries for e.g. spell checkers. With the GtkTextBuffer::move_iter vfunc, all boundary types are overridable. For example GtkSourceBuffer can override word boundaries used for cursor movements so it better fits source code.
Still one thing to do: update the documentation of the gtk_text_iter word-related functions. The above 4 patches come with the GtkSourceView part: https://git.gnome.org/browse/gtksourceview/log/?h=wip/custom-word-boundaries All the gtk_text_iter word functions are already well covered by unit tests.
For the record, there was also a discussion on the mailing list: https://mail.gnome.org/archives/gtk-devel-list/2014-September/msg00019.html Also, the GtkTextMoveType enum can be extended for sentences. For example triple-click in a C code could select the whole scope. But it can be done later.
Review of attachment 288855 [details] [review]: Are you sure that it is a great idea to base these predicates on iter motion functions ? Performance-wise this might be quite a step backwards - the motion functions might end up running over the whole buffer if no suitable boundaries are present.
Mmm, indeed with a buffer content like "--\n--\n--\nblah", from offset 0 gtk_text_iter_forward_word_end() goes to the end of the buffer. The idea was to have only one vfunc, so the API addition is minimal. A solution is to add a 'limit' GtkTextIter parameter to the vfunc, so we can search only on the current line for example. Currently test_log_attrs() searches only on the current line. For some languages it is not 100% correct, since a word can span multiple lines: > A line boundary is usually a word boundary, but there are exceptions such as > a word containing a SHY (soft hyphen): it will break across lines, yet is a > single word. http://www.unicode.org/reports/tr29/#Word_Boundaries But taking only the current line should be good enough, and is not a regression from the current GtkTextIter implementation (in the future the 'limit' parameter can be changed to take into account the exceptions, for example by taking two lines instead of one, but it means that pango_get_log_attrs() must be called with the two lines at once, not call it line by line).
sounds like a good plan to me
I'm not sure it's a good idea to add a boundary type state to GtkTextIter. GTK_TEXT_BOUNDARY_TYPE_NATURAL_LANGUAGE and GTK_TEXT_BOUNDARY_TYPE_FOR_CURSOR_MOVEMENTS make some sense for _word_ boundaries, but doesn't make sense for characters, cursor positions and lines. Overriding word boundaries for keybindings like Ctrl+left and Ctrl+right is already possible by overriding the GtkTextView::move-cursor and GtkTextView::delete-from-cursor signals. What is currently not easily feasible is to customize the double click and triple click behaviors. For the triple click, GtkTextView selects the "display line", by using functions like gtk_text_view_backward_display_line_start(). It doesn't make sense to be able to override those functions just to be able to customize the triple click. So a signal or vfunc is needed. We could have an enum with DOUBLE_CLICK and TRIPLE_CLICK, so that the double click is also more customizable. The above patches assume that the double click selects a word. But what if a text editor wants a totally different behavior, like selecting the display line for double click and the "buffer line" for triple click? So I think it's better to have a signal or vfunc specifically for the double and triple click, without assumptions about their behavior. Having that is sufficient, and is less invasive than adding a boundary type state to GtkTextIter.
(In reply to comment #21) > So I think it's better to have a signal or vfunc specifically for the double > and triple click, without assumptions about their behavior. Having that is > sufficient, and is less invasive than adding a boundary type state to > GtkTextIter. See the patch at bug #111503 for that.
*** This bug has been marked as a duplicate of bug 111503 ***