GNOME Bugzilla – Bug 620572
pango-layout->set_text causes the size of the text in chars to increase
Last modified: 2010-06-04 21:06:48 UTC
in PangoLayout->set_text, each invalid byte in the text is replaced by -1 (which generates an unknown glyph to be generate). The problem is that many chars are represented by multiple bytes in UTF-8. In my case, the text has \uFFFF and \uFFFE (which use 3 bytes in UTF-8). So, everytime \uFFFF or \uFFFE occurs in the text pango inserts 3 unknown glyph causing the length of the text in chars to be bigger by 2. See https://bugs.eclipse.org/bugs/show_bug.cgi?id=314620 Here is where I believe the problem is, pango-layout.c set_text(): ... /* validate it, and replace invalid bytes with '?' */ start = layout->text; for (;;) { gboolean valid; valid = g_utf8_validate (start, -1, (const char **)&end); if (!*end) break; /* Replace invalid bytes with -1. The -1 will be converted to * ((gunichar) -1) by glib, and that in turn yields a glyph value of * ((PangoGlyph) -1) by PANGO_GET_UNKNOWN_GLYPH(-1), * and that's PANGO_GLYPH_INVALID_INPUT. */ if (!valid) *end++ = -1; start = end; } ------------------- I believe the error is in these lines: if (!valid) *end++ = -1; It should have been (I think): if (!valid) { *end = -1; end = g_utf8_find_next_char (end, -1) } I did not test this.
This code path is supposed to be emergency fallback - if there's a programming bug that results in invalid unicode in a PangoLayout show something to the user rather than an empty layout. But if you know what invalid text you are putting into a PangoLayout, then please avoid doing so. That said, it wouldn't be ridiculous to recognize sequences that are Unicode non-characters encoded in UTF-8 and show a single invalid glyph for them. The proposed patch is definitely wrong though - if someone, say, tries to put, say, iso-8859-1 encoded text into a PangoLayout, we shouldn't be collapsing sequences of bytes that aren't UTF-8 start bytes into a single glyph.
(In reply to comment #1) > This code path is supposed to be emergency fallback - if there's a programming > bug that results in invalid unicode in a PangoLayout show something to the user > rather than an empty layout. That is a good thing, but changing the length of string causes problems for the users. Here is a test case: #include <gtk/gtk.h> #include <string.h> int main (int argc, char** argv) { const char* text = "a\357\277\276b\0"; // a\uFFFE\b GtkWidget *window, *label; gtk_init_check (&argc, &argv); window = gtk_window_new (GTK_WINDOW_TOPLEVEL); label = gtk_label_new (text); gtk_container_add (GTK_CONTAINER (window), label); PangoLayout* layout = gtk_widget_create_pango_layout (label, text); const char * text2 = pango_layout_get_text(layout); printf("text length in chars %i\n", g_utf8_strlen (text, -1)); printf("text length in bytes %i\n", strlen (text)); printf("text2 length in chars %i\n", g_utf8_strlen (text2, -1)); printf("text2 length in bytes %i\n", strlen (text2)); g_object_unref(layout); g_signal_connect (G_OBJECT (window), "destroy", G_CALLBACK (gtk_main_quit), NULL); g_signal_connect (G_OBJECT (window), "delete_event",G_CALLBACK (gtk_main_quit), NULL); gtk_widget_show_all (window); gtk_main(); } It is g_utf8_strlen (text2, -1) != g_utf8_strlen (text, -1) that is causing problems for me. > But if you know what invalid text you are putting > into a PangoLayout, then please avoid doing so. Yes, I can do that, and probably I'll have to. But it will double the work (because pango is still going to validate the string over again). Bad for performance. > That said, it wouldn't be ridiculous to recognize sequences that are Unicode > non-characters encoded in UTF-8 and show a single invalid glyph for them. Run the snippet above, it shows 3 empty boxes (one for each byte). The text only includes one Unicode code point that is invalid. > The proposed patch is definitely wrong though - if someone, say, > tries to put, say, > iso-8859-1 encoded text into a PangoLayout, we shouldn't be collapsing > sequences of bytes that aren't UTF-8 start bytes into a single glyph. Right, the patch is wrong...it only changes the first byte and leaves garbage in the other two. What about this case: Suppose the text is: a\uFFFE\uFFFE\uFFFE\b The current code will display 9 boxes. With the patch it will show 1 box (?) Besides, can we fix the "g_utf8_strlen (text2, -1) != g_utf8_strlen (text, -1)" without breaking "strlen (text) == strlen (text2)" ?
\uFFFE is invalid Unicode. Garbage in, garbage out. Pango is being really nice with you not crashing.
> Run the snippet above, it shows 3 empty boxes (one for each byte). The text > only includes one Unicode code point that is invalid. If it's invalid, it's not a Unicode code point, be it one or more. Owen, the code must be the way it is because we can't change the byte length since attribute offsets work in that space. So, every byte of invalid data needs to have a byte corresponding to it. The only way to get away with it is to introduce 2byte and 3byte and 4byte "this is invalid" sequences internal to Pango. I'm already being attacked on for relying on glib being friendly to -1 bytes in its utf-8 routines. Don't want to invent more pseudo-utf8.
(In reply to comment #3) > \uFFFE is invalid Unicode. Not really... it is not a valid character but it valid code point and very useful in some cases. > Garbage in, garbage out. > Pango is being really nice with you not crashing. Thanks ;-) <gr> You can close the bug as WONT FIX - but the bug is there (how can you change the size of text in chars and say it is not a bug?)
(In reply to comment #5) > (In reply to comment #3) > > \uFFFE is invalid Unicode. > > Not really... it is not a valid character but it valid code point and very > useful in some cases. Valid code point as in it's a valid number you mean? > > Garbage in, garbage out. > > Pango is being really nice with you not crashing. > > Thanks ;-) > <gr> > > You can close the bug as WONT FIX - but the bug is there (how can you change > the size of text in chars and say it is not a bug?) Docs for pango_layout_set_text() explain that text has to be "a valid UTF-8 string". And I already explained why it can't be changed. Fix your code!