After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 63398 - gtk_text_iter_start_word () works wrong with
gtk_text_iter_start_word () works wrong with
Status: RESOLVED DUPLICATE of bug 97545
Product: gtk+
Classification: Platform
Component: Widget: GtkTextView
1.3.x
Other All
: Normal normal
: ---
Assigned To: gtk-bugs
gtk-bugs
Depends on:
Blocks:
 
 
Reported: 2001-10-30 18:35 UTC by Vitaly Tishkov
Modified: 2011-02-04 16:09 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Vitaly Tishkov 2001-10-30 18:35:13 UTC
Section 5.15 of Unicode Standard v. 3.0 says that a break should be insert
between letters and non-letters when determining words boundaries. Gtk+
handles numbers correctly but handles non-numbers incorrectly.

Test case:

#include <gtk/gtk.h>

void test (const gchar* line, gint pos);

int
main (int argc, char** argv)
{
    gtk_init (&argc, &argv);
    test ("word1word2", 3);
    test ("word1word2", 4);
    test ("word1word2", 5);
    test ("word;word2", 3);
    test ("word;word2", 4);
    test ("word;word2", 5);
    test ("word\tword2", 3);
    test ("word\tword2", 4);
    test ("word\tword2", 5);
}

void test (const gchar* line, gint pos)
{
    GtkTextIter iter;
    GtkTextBuffer* buffer;
    GtkTextTag* invisible_tag;
    gboolean res;

    buffer = gtk_text_buffer_new (NULL);
    gtk_text_buffer_set_text (buffer, line, strlen(line));
    
    gtk_text_buffer_get_iter_at_offset(buffer, &iter, pos);
    res = gtk_text_iter_starts_word (&iter);

    printf ("line = '%s', pos = %d, res = %d\n", line, pos, res);
}    

Output:
line = 'word1word2', pos = 3, res = 0
line = 'word1word2', pos = 4, res = 1
line = 'word1word2', pos = 5, res = 1
line = 'word;word2', pos = 3, res = 0
line = 'word;word2', pos = 4, res = 0
line = 'word;word2', pos = 5, res = 1
line = 'word	word2', pos = 3, res = 0
line = 'word	word2', pos = 4, res = 0
line = 'word	word2', pos = 5, res = 1

I assume gtk_text_iter_starts_word() should return TRUE for these test
cases:
line = 'word;word2', pos = 4, res = 0
line = 'word	word2', pos = 4, res = 0
Comment 1 Matthias Clasen 2002-04-05 13:34:32 UTC
Move open bugs from milestones 2.0.[012] -- > 2.0.3, since 2.0.2 is already out.
Comment 2 Owen Taylor 2002-06-14 15:31:03 UTC
Move GtkTextView 2.0.4 bugs to 2.0.5
Comment 3 Matthias Clasen 2002-08-08 07:57:52 UTC
It seems that Pangos idea of words is a bit different from the Unicode idea. Unicode 
seems to simply cut a text into a sequence of words by inserting word boundaries at 
certain places, ie it doesn't distinguish between word start and end. Pango on the 
other hand, splits a text into a sequence of "letter-words", "number-words" and non-
words, ie the whitespace and the semicolon end up being not in any word, thus not 
starting a word. 

I don't know which approach makes more sense, but the difference 
may be important for tasks like word counting...

Does Xpointer have words, too? 
If so, it might be interesting to see what approach they take.

Should this bug be 
moved to Pango ?
Comment 4 Matthias Clasen 2002-11-21 19:00:14 UTC
Moving bugs from older 2.0.x milestones to 2.0.10.
Comment 5 Luis Villa 2002-12-05 21:54:13 UTC
Hrm, I seem to recall other similar bugs, Matias- may be worth
searching for 'Unicode' in the entire gtk bug product?
Comment 6 Owen Taylor 2002-12-16 20:48:04 UTC
Going to mark this as a duplicate of the big "fix pango_default_break"
bug, though I haven't analyzed the details here.

Distinguishing between word start and end is generally important
for correct cursor navigation; perhaps we need something for
words like we have currently for sentences:


  /* There are two ways to divide sentences. The first assigns all
   * intersentence whitespace/control/format chars to some sentence,
   * so all chars are in some sentence; is_sentence_boundary denotes
   * the boundaries there. The second way doesn't assign
   * between-sentence spaces, etc. to any sentence, so
   * is_sentence_start/is_sentence_end mark the boundaries of those
   * sentences.
   */
  guint is_sentence_boundary : 1;
  guint is_sentence_start : 1;  /* first character in a sentence */
  guint is_sentence_end : 1;    /* first non-sentence char after a
sentence */


*** This bug has been marked as a duplicate of 97545 ***