Bug 331995 – Better handling invalid UTF-8 input

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 331995 - Better handling invalid UTF-8 input


Summary:	Better handling invalid UTF-8 input


Status:	RESOLVED FIXED

Product:	pango
Classification:	Platform
Component:	general
Version:	1.10.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	pango-maint
QA Contact:	pango-maint

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2006-02-21 08:39 UTC by Arnaud Charlet
Modified:	2007-09-21 10:37 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Arnaud Charlet 2006-02-21 08:39:29 UTC

Please describe the problem:
When assertions are disabled, it seems logical to remove the following
check in pango_layout_set_text, see proposed patch:

--- pango/pango-layout.c.old  2006-02-21 09:36:34.000000000 +0100
+++ pango/pango-layout.c      2006-02-21 09:37:04.000000000 +0100
@@ -823,7 +823,11 @@ pango_layout_set_text (PangoLayout *layo

   old_text = layout->text;

+#ifdef G_DISABLE_ASSERT
+  if (length == -1)
+#else
   if (length != 0)
+#endif
     {
       if (!g_utf8_validate (text, length, &end))
        g_warning ("Invalid UTF-8 string passed to pango_layout_set_text()");

Steps to reproduce:


Actual results:


Expected results:


Does this happen every time?


Other information:

Comment 1 Behdad Esfahbod 2006-02-21 09:26:48 UTC

A better patch would be:

#ifndef G_DISABLE_ASSERT
   if (length != 0)
     {
       ...
     }
#else
    if (length < 0)
      length = strlen (text);
#endif


But I'm not sure not validating the input with non-debugging builds is a good idea.  On the other hand, what we are currently doing is silently truncating the text at the first invalid sequence, which is probably worse than just misrendering it.  But if we decide to do that, it would make sense to not truncate with debugging either.  So, just warn and proceed.  Of course we assume that input is valid in the rest of the stack, but 1) this check only affects PangoLayout anyway, the lower level API doesn't have anything like this.  2) I cannot immediately think about anything that behaves *really bad* with invalid input.  And by letting invalid input in, we can work on how to correctly present an "invalid input" glyph...  So, I'm going to make a change here.  Patch follows.

Comment 2 Behdad Esfahbod 2006-02-21 10:22:27 UTC

I made it to not truncate the text, and only validate if asserts are not disabled.  Also changed the docs to emphasize that the input text should be valid.  Note that this is no change in the semantics of Pango public API.  Just changed the mode of failure.

2006-02-21  Behdad Esfahbod  <behdad@gnome.org>

        * pango/fonts.c, pango/glyphstring.c, pango/pango-fontmap.c,
        pango/pango-ot-buffer.c, pango/pangocairo-font.c, pango/pangoft2.c,
        pango/pangoxft-font.c, pango/shape.c: Change g_critical to g_warning.
        We already handle them gracefully.

        Bug 331994 – --disable-debug removes G_DISABLE_CAST_CHECKS
        Patch from charlet@act-europe.fr

        * configure.in: Do not lose PANGO_DEBUG_FLAGS when reassigning.

        Bug 331995 – pango_layout_set_text optimization
        Patch from charlet@act-europe.fr

        * pango/pango-layout.c: Do not validate input text if asserts are
        disabled.  Moreover, do not truncate input text on invalid sequence.

        Bug 331996 – avoid crashes in win32 font handling
        Patch from charlet@act-europe.fr

        * pango/pangofc-fontmap.c, pango/pangowin32-fontmap.c,
        pango/pangowin32.c: if (!font) return NULL in a number of places.

Comment 3 Owen Taylor 2006-02-21 16:59:22 UTC

In my (pre-adding-validation) experience if you get invalid input into the 
internals of Pango, things go south in extremely bad ways that produce
impossible to figure out stack traces. Note that Pango assumes all
over the place that it can use g_utf8_next_char() and so forth, and
these functions can and will crash on invalid input.

Unless validation really is noticeable performance-wise, I think it would be
better to always validate at input. But even if not, I'd really, really
strongly advise against: "And by letting invalid input in, we can work on 
how to correctly present an "invalid input" glyph." - the only conceivably
OK way of doing an "invalid input" glyph would be to transform the input
at validation time ... note that you have to keep the same byte offsets
or things get completely screwed up, so you are limited to doing things
likely turning invalid input bytes into '?' characters.

Comment 4 Behdad Esfahbod 2006-02-21 23:19:16 UTC

Thanks Owen.  One can argue that it should be left to the caller of PangoLayout API to make sure input is valid.  So I really don't like the "validate and crop" solution.

About letting invalid input in, you are absolutely right, everything breaks in bad ways..., so I'm fine with editing the input at set_text time to make it valid.  One needs to make sure byte offsets remain the same.  One solution is to insert '?' chars for each byte of a broken sequence, another 'pragmatic' solution is to insert possibly over-long '?' chars for each broken sequence.  I'll code something like this soon.

Comment 5 Behdad Esfahbod 2006-02-26 21:18:29 UTC

Ok, I changed pango_layout_set_text to insert '?' for every invalid byte for now.  Seems to work fine here.  Can you a couple eyes:

/**
 * pango_layout_set_text:
 * @layout: a #PangoLayout
 * @text: a valid UTF-8 string
 * @length: maximum length of @text, in bytes. -1 indicates that
 *          the string is nul-terminated and the length should be
 *          calculated.  The text will also be truncated on
 *          encountaring a nul-termination even when @length is
 *          positive.
 * 
 * Sets the text of the layout.
 **/
void
pango_layout_set_text (PangoLayout *layout,
		       const char  *text,
		       int          length)
{
  char *old_text, *start, *end;
  
  g_return_if_fail (layout != NULL);
  g_return_if_fail (length == 0 || text != NULL);

  old_text = layout->text;
  
  if (length < 0)
    layout->text = g_strdup (text);
  else if (length > 0)
    /* This is not exactly what we want.  We don't need the padding...
     */
    layout->text = g_strndup (text, length);
  else
    layout->text = g_malloc0 (1);

  layout->length = strlen (layout->text);

  /* validate it, and replace invalid bytes with '?'
   */
  start = layout->text;
  for (;;) {
    gboolean valid;
    
    valid = g_utf8_validate (start, -1, &end);

    if (!*end)
      break;

    if (!valid)
      *end++ = '?';

    start = end;
  }

  if (start != layout->text)
    /* TODO: Write out the beginning excerpt of text? */
    g_warning ("Invalid UTF-8 string passed to pango_layout_set_text()");

  layout->n_chars = g_utf8_strlen (layout->text, -1);

  pango_layout_clear_lines (layout);
  
  if (old_text)
    g_free (old_text);
}

Comment 6 Behdad Esfahbod 2006-02-26 21:21:41 UTC

parse_markup still fails on invalid input so you get an empty layout there...

Comment 7 Behdad Esfahbod 2006-02-26 21:32:35 UTC

What we should do is to duplicate the string and make it "valid" before passing to parse_markup...

Comment 8 Behdad Esfahbod 2006-02-26 22:01:34 UTC

I'm not quite satisfied with using '?'.  Here is the only other alternative I can think of ATM: Use -1 instead of '?'.

Unicode and UTF-8 are designed such that -1 is never a valid codepoint (in UTF-8, UTF-16, UTF-32).  g_utf8_next_char skips one byte on -1, and g_utf8_get_char returns -1.  If I do that change right now, I will see a hexbox for FFFFFF.  We can better handle it later when we decide on how to allocate glyph numbers for meta-glyphs.

I know it's a hack, but one that works quite good.  We can document glib utf8 functions as doing such too.  Matthias pointed that this way the string is not technically valid anymore.  I see that as a feature not a bug.  If the input string is not valid, I don't mind returning an invalid string in pango_layout_get_text.  Just that this one is different in invalid positions, it's safe to most of UTF-8 handlers, and specifically to glib's.

Strong objections against this?

Comment 9 Behdad Esfahbod 2006-02-26 22:02:54 UTC

To be exact, what I see is a hexbox for EFFFFF, not FFFFFF.

Comment 10 Arnaud Charlet 2007-09-21 10:37:41 UTC

Note that the initial issue of this PR has been addressed as far as I can see.
There may be additional improvements, but since nothing happened since
one year and a half, I'd say it's reasonable to close this PR at this time,
and open a new PR on possible other enhancements if someone feels this is
a good idea.

Arno