Bug 620572 – pango-layout->set_text causes the size of the text in chars to increase

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 620572 - pango-layout->set_text causes the size of the text in chars to increase


Summary:	pango-layout->set_text causes the size of the text in chars to increase


Status:	RESOLVED NOTABUG

Product:	pango
Classification:	Platform
Component:	general
Version:	1.24.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	pango-maint
QA Contact:	pango-maint

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2010-06-04 13:57 UTC by Felipe Heidrich
Modified:	2010-06-04 21:06 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Felipe Heidrich 2010-06-04 13:57:37 UTC

in PangoLayout->set_text, each invalid byte in the text is replaced by -1 (which generates an unknown glyph to be generate). The problem is that many chars are represented by multiple bytes in UTF-8. In my case, the text has \uFFFF and \uFFFE (which use 3 bytes in UTF-8). So, everytime \uFFFF or \uFFFE occurs in the text pango inserts 3 unknown glyph causing the length of the text in chars to be bigger by 2.

See
https://bugs.eclipse.org/bugs/show_bug.cgi?id=314620

Here is where I believe the problem is, pango-layout.c set_text():
...
  /* validate it, and replace invalid bytes with '?'
   */
  start = layout->text;
  for (;;) {
    gboolean valid;

    valid = g_utf8_validate (start, -1, (const char **)&end);

    if (!*end)
      break;

    /* Replace invalid bytes with -1.  The -1 will be converted to
     * ((gunichar) -1) by glib, and that in turn yields a glyph value of
     * ((PangoGlyph) -1) by PANGO_GET_UNKNOWN_GLYPH(-1),
     * and that's PANGO_GLYPH_INVALID_INPUT.
     */
    if (!valid)
      *end++ = -1;

    start = end;
  }

-------------------
I believe the error is in these lines: 
    if (!valid)
      *end++ = -1;


It should have been (I think):
    if (!valid) {
      *end = -1;
      end = g_utf8_find_next_char (end, -1) 
    }

I did not test this.

Comment 1 Owen Taylor 2010-06-04 15:28:41 UTC

This code path is supposed to be emergency fallback - if there's a programming bug that results in invalid unicode in a PangoLayout show something to the user rather than an empty layout. But if you know what invalid text you are putting into a PangoLayout, then please avoid doing so.

That said, it wouldn't be ridiculous to recognize sequences that are Unicode non-characters encoded in UTF-8 and show a single invalid glyph for them. The proposed patch is definitely wrong though - if someone, say, tries to put, say, iso-8859-1 encoded text into a PangoLayout, we shouldn't be collapsing sequences of bytes that aren't UTF-8 start bytes into a single glyph.

Comment 2 Felipe Heidrich 2010-06-04 18:19:05 UTC

(In reply to comment #1)
> This code path is supposed to be emergency fallback - if there's a programming
> bug that results in invalid unicode in a PangoLayout show something to the user
> rather than an empty layout. 

That is a good thing, but changing the length of string causes problems for the users.

Here is a test case:
#include <gtk/gtk.h>
#include <string.h>

int main (int argc, char** argv) {
	const char* text = "a\357\277\276b\0"; // a\uFFFE\b

	GtkWidget *window, *label;
	
	gtk_init_check (&argc, &argv);
	window  = gtk_window_new (GTK_WINDOW_TOPLEVEL);
	label = gtk_label_new  (text);
	
	gtk_container_add (GTK_CONTAINER (window), label);
	PangoLayout* layout = gtk_widget_create_pango_layout (label, text);
	const char * text2 = pango_layout_get_text(layout);
	
	printf("text length in chars %i\n", g_utf8_strlen (text, -1));
	printf("text length in bytes %i\n", strlen (text));

	printf("text2 length in chars %i\n", g_utf8_strlen (text2, -1));
	printf("text2 length in bytes %i\n", strlen (text2));
	g_object_unref(layout);
	
		
	g_signal_connect (G_OBJECT (window), "destroy", G_CALLBACK (gtk_main_quit), NULL);
	g_signal_connect (G_OBJECT (window), "delete_event",G_CALLBACK (gtk_main_quit), NULL);
	gtk_widget_show_all (window);
	gtk_main();
}

It is g_utf8_strlen (text2, -1) != g_utf8_strlen (text, -1) that is causing problems for me.

> But if you know what invalid text you are putting
> into a PangoLayout, then please avoid doing so.

Yes, I can do that, and probably I'll have to. But it will double the work (because pango is still going to validate the string over again). Bad for performance.
 
> That said, it wouldn't be ridiculous to recognize sequences that are Unicode
> non-characters encoded in UTF-8 and show a single invalid glyph for them.

Run the snippet above, it shows 3 empty boxes (one for each byte). The text only includes one Unicode code point that is invalid.

> The proposed patch is definitely wrong though - if someone, say, 
> tries to put, say,
> iso-8859-1 encoded text into a PangoLayout, we shouldn't be collapsing
> sequences of bytes that aren't UTF-8 start bytes into a single glyph.

Right, the patch is wrong...it only changes the first byte and leaves garbage in the other two.

What about this case:
Suppose the text is: a\uFFFE\uFFFE\uFFFE\b

The current code will display 9 boxes.
With the patch it will show 1 box (?)

Besides, can we fix the "g_utf8_strlen (text2, -1) != g_utf8_strlen (text, -1)" without breaking "strlen (text) == strlen (text2)" ?

Comment 3 Behdad Esfahbod 2010-06-04 18:33:56 UTC

\uFFFE is invalid Unicode.  Garbage in, garbage out.  Pango is being really nice with you not crashing.

Comment 4 Behdad Esfahbod 2010-06-04 18:43:45 UTC

> Run the snippet above, it shows 3 empty boxes (one for each byte). The text
> only includes one Unicode code point that is invalid.

If it's invalid, it's not a Unicode code point, be it one or more.

Owen, the code must be the way it is because we can't change the byte length since attribute offsets work in that space.  So, every byte of invalid data needs to have a byte corresponding to it.  The only way to get away with it is to introduce 2byte and 3byte and 4byte "this is invalid" sequences internal to Pango.  I'm already being attacked on for relying on glib being friendly to -1 bytes in its utf-8 routines.  Don't want to invent more pseudo-utf8.

Comment 5 Felipe Heidrich 2010-06-04 19:47:56 UTC

(In reply to comment #3)
> \uFFFE is invalid Unicode. 

Not really... it is not a valid character but it valid code point and very useful in some cases.

> Garbage in, garbage out.  
> Pango is being really nice with you not crashing.

Thanks ;-)
<gr>

You can close the bug as WONT FIX - but the bug is there (how can you change the size of text in chars and say it is not a bug?)

Comment 6 Behdad Esfahbod 2010-06-04 21:06:48 UTC

(In reply to comment #5)
> (In reply to comment #3)
> > \uFFFE is invalid Unicode. 
> 
> Not really... it is not a valid character but it valid code point and very
> useful in some cases.

Valid code point as in it's a valid number you mean?


> > Garbage in, garbage out.  
> > Pango is being really nice with you not crashing.
> 
> Thanks ;-)
> <gr>
> 
> You can close the bug as WONT FIX - but the bug is there (how can you change
> the size of text in chars and say it is not a bug?)

Docs for pango_layout_set_text() explain that text has to be "a valid UTF-8 string".

And I already explained why it can't be changed.

Fix your code!