GNOME Bugzilla – Bug 63398
gtk_text_iter_start_word () works wrong with
Last modified: 2011-02-04 16:09:49 UTC
Section 5.15 of Unicode Standard v. 3.0 says that a break should be insert between letters and non-letters when determining words boundaries. Gtk+ handles numbers correctly but handles non-numbers incorrectly. Test case: #include <gtk/gtk.h> void test (const gchar* line, gint pos); int main (int argc, char** argv) { gtk_init (&argc, &argv); test ("word1word2", 3); test ("word1word2", 4); test ("word1word2", 5); test ("word;word2", 3); test ("word;word2", 4); test ("word;word2", 5); test ("word\tword2", 3); test ("word\tword2", 4); test ("word\tword2", 5); } void test (const gchar* line, gint pos) { GtkTextIter iter; GtkTextBuffer* buffer; GtkTextTag* invisible_tag; gboolean res; buffer = gtk_text_buffer_new (NULL); gtk_text_buffer_set_text (buffer, line, strlen(line)); gtk_text_buffer_get_iter_at_offset(buffer, &iter, pos); res = gtk_text_iter_starts_word (&iter); printf ("line = '%s', pos = %d, res = %d\n", line, pos, res); } Output: line = 'word1word2', pos = 3, res = 0 line = 'word1word2', pos = 4, res = 1 line = 'word1word2', pos = 5, res = 1 line = 'word;word2', pos = 3, res = 0 line = 'word;word2', pos = 4, res = 0 line = 'word;word2', pos = 5, res = 1 line = 'word word2', pos = 3, res = 0 line = 'word word2', pos = 4, res = 0 line = 'word word2', pos = 5, res = 1 I assume gtk_text_iter_starts_word() should return TRUE for these test cases: line = 'word;word2', pos = 4, res = 0 line = 'word word2', pos = 4, res = 0
Move open bugs from milestones 2.0.[012] -- > 2.0.3, since 2.0.2 is already out.
Move GtkTextView 2.0.4 bugs to 2.0.5
It seems that Pangos idea of words is a bit different from the Unicode idea. Unicode seems to simply cut a text into a sequence of words by inserting word boundaries at certain places, ie it doesn't distinguish between word start and end. Pango on the other hand, splits a text into a sequence of "letter-words", "number-words" and non- words, ie the whitespace and the semicolon end up being not in any word, thus not starting a word. I don't know which approach makes more sense, but the difference may be important for tasks like word counting... Does Xpointer have words, too? If so, it might be interesting to see what approach they take. Should this bug be moved to Pango ?
Moving bugs from older 2.0.x milestones to 2.0.10.
Hrm, I seem to recall other similar bugs, Matias- may be worth searching for 'Unicode' in the entire gtk bug product?
Going to mark this as a duplicate of the big "fix pango_default_break" bug, though I haven't analyzed the details here. Distinguishing between word start and end is generally important for correct cursor navigation; perhaps we need something for words like we have currently for sentences: /* There are two ways to divide sentences. The first assigns all * intersentence whitespace/control/format chars to some sentence, * so all chars are in some sentence; is_sentence_boundary denotes * the boundaries there. The second way doesn't assign * between-sentence spaces, etc. to any sentence, so * is_sentence_start/is_sentence_end mark the boundaries of those * sentences. */ guint is_sentence_boundary : 1; guint is_sentence_start : 1; /* first character in a sentence */ guint is_sentence_end : 1; /* first non-sentence char after a sentence */ *** This bug has been marked as a duplicate of 97545 ***