Bug 63398 – gtk_text_iter_start_word () works wrong with

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 63398 - gtk_text_iter_start_word () works wrong with


Summary:	gtk_text_iter_start_word () works wrong with


Status:	RESOLVED DUPLICATE of bug 97545

Product:	gtk+
Classification:	Platform
Component:	Widget: GtkTextView
Version:	1.3.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtk-bugs
QA Contact:	gtk-bugs

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2001-10-30 18:35 UTC by Vitaly Tishkov
Modified:	2011-02-04 16:09 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Vitaly Tishkov 2001-10-30 18:35:13 UTC

Section 5.15 of Unicode Standard v. 3.0 says that a break should be insert
between letters and non-letters when determining words boundaries. Gtk+
handles numbers correctly but handles non-numbers incorrectly.

Test case:

#include <gtk/gtk.h>

void test (const gchar* line, gint pos);

int
main (int argc, char** argv)
{
    gtk_init (&argc, &argv);
    test ("word1word2", 3);
    test ("word1word2", 4);
    test ("word1word2", 5);
    test ("word;word2", 3);
    test ("word;word2", 4);
    test ("word;word2", 5);
    test ("word\tword2", 3);
    test ("word\tword2", 4);
    test ("word\tword2", 5);
}

void test (const gchar* line, gint pos)
{
    GtkTextIter iter;
    GtkTextBuffer* buffer;
    GtkTextTag* invisible_tag;
    gboolean res;

    buffer = gtk_text_buffer_new (NULL);
    gtk_text_buffer_set_text (buffer, line, strlen(line));
    
    gtk_text_buffer_get_iter_at_offset(buffer, &iter, pos);
    res = gtk_text_iter_starts_word (&iter);

    printf ("line = '%s', pos = %d, res = %d\n", line, pos, res);
}    

Output:
line = 'word1word2', pos = 3, res = 0
line = 'word1word2', pos = 4, res = 1
line = 'word1word2', pos = 5, res = 1
line = 'word;word2', pos = 3, res = 0
line = 'word;word2', pos = 4, res = 0
line = 'word;word2', pos = 5, res = 1
line = 'word	word2', pos = 3, res = 0
line = 'word	word2', pos = 4, res = 0
line = 'word	word2', pos = 5, res = 1

I assume gtk_text_iter_starts_word() should return TRUE for these test
cases:
line = 'word;word2', pos = 4, res = 0
line = 'word	word2', pos = 4, res = 0

Comment 1 Matthias Clasen 2002-04-05 13:34:32 UTC

Move open bugs from milestones 2.0.[012] -- > 2.0.3, since 2.0.2 is already out.

Comment 2 Owen Taylor 2002-06-14 15:31:03 UTC

Move GtkTextView 2.0.4 bugs to 2.0.5

Comment 3 Matthias Clasen 2002-08-08 07:57:52 UTC

It seems that Pangos idea of words is a bit different from the Unicode idea. Unicode 
seems to simply cut a text into a sequence of words by inserting word boundaries at 
certain places, ie it doesn't distinguish between word start and end. Pango on the 
other hand, splits a text into a sequence of "letter-words", "number-words" and non-
words, ie the whitespace and the semicolon end up being not in any word, thus not 
starting a word. 

I don't know which approach makes more sense, but the difference 
may be important for tasks like word counting...

Does Xpointer have words, too? 
If so, it might be interesting to see what approach they take.

Should this bug be 
moved to Pango ?

Comment 4 Matthias Clasen 2002-11-21 19:00:14 UTC

Moving bugs from older 2.0.x milestones to 2.0.10.

Comment 5 Luis Villa 2002-12-05 21:54:13 UTC

Hrm, I seem to recall other similar bugs, Matias- may be worth
searching for 'Unicode' in the entire gtk bug product?

Comment 6 Owen Taylor 2002-12-16 20:48:04 UTC

Going to mark this as a duplicate of the big "fix pango_default_break"
bug, though I haven't analyzed the details here.

Distinguishing between word start and end is generally important
for correct cursor navigation; perhaps we need something for
words like we have currently for sentences:


  /* There are two ways to divide sentences. The first assigns all
   * intersentence whitespace/control/format chars to some sentence,
   * so all chars are in some sentence; is_sentence_boundary denotes
   * the boundaries there. The second way doesn't assign
   * between-sentence spaces, etc. to any sentence, so
   * is_sentence_start/is_sentence_end mark the boundaries of those
   * sentences.
   */
  guint is_sentence_boundary : 1;
  guint is_sentence_start : 1;  /* first character in a sentence */
  guint is_sentence_end : 1;    /* first non-sentence char after a
sentence */


*** This bug has been marked as a duplicate of 97545 ***