GNOME Bugzilla – Bug 331995
Better handling invalid UTF-8 input
Last modified: 2007-09-21 10:37:41 UTC
Please describe the problem: When assertions are disabled, it seems logical to remove the following check in pango_layout_set_text, see proposed patch: --- pango/pango-layout.c.old 2006-02-21 09:36:34.000000000 +0100 +++ pango/pango-layout.c 2006-02-21 09:37:04.000000000 +0100 @@ -823,7 +823,11 @@ pango_layout_set_text (PangoLayout *layo old_text = layout->text; +#ifdef G_DISABLE_ASSERT + if (length == -1) +#else if (length != 0) +#endif { if (!g_utf8_validate (text, length, &end)) g_warning ("Invalid UTF-8 string passed to pango_layout_set_text()"); Steps to reproduce: Actual results: Expected results: Does this happen every time? Other information:
A better patch would be: #ifndef G_DISABLE_ASSERT if (length != 0) { ... } #else if (length < 0) length = strlen (text); #endif But I'm not sure not validating the input with non-debugging builds is a good idea. On the other hand, what we are currently doing is silently truncating the text at the first invalid sequence, which is probably worse than just misrendering it. But if we decide to do that, it would make sense to not truncate with debugging either. So, just warn and proceed. Of course we assume that input is valid in the rest of the stack, but 1) this check only affects PangoLayout anyway, the lower level API doesn't have anything like this. 2) I cannot immediately think about anything that behaves *really bad* with invalid input. And by letting invalid input in, we can work on how to correctly present an "invalid input" glyph... So, I'm going to make a change here. Patch follows.
I made it to not truncate the text, and only validate if asserts are not disabled. Also changed the docs to emphasize that the input text should be valid. Note that this is no change in the semantics of Pango public API. Just changed the mode of failure. 2006-02-21 Behdad Esfahbod <behdad@gnome.org> * pango/fonts.c, pango/glyphstring.c, pango/pango-fontmap.c, pango/pango-ot-buffer.c, pango/pangocairo-font.c, pango/pangoft2.c, pango/pangoxft-font.c, pango/shape.c: Change g_critical to g_warning. We already handle them gracefully. Bug 331994 – --disable-debug removes G_DISABLE_CAST_CHECKS Patch from charlet@act-europe.fr * configure.in: Do not lose PANGO_DEBUG_FLAGS when reassigning. Bug 331995 – pango_layout_set_text optimization Patch from charlet@act-europe.fr * pango/pango-layout.c: Do not validate input text if asserts are disabled. Moreover, do not truncate input text on invalid sequence. Bug 331996 – avoid crashes in win32 font handling Patch from charlet@act-europe.fr * pango/pangofc-fontmap.c, pango/pangowin32-fontmap.c, pango/pangowin32.c: if (!font) return NULL in a number of places.
In my (pre-adding-validation) experience if you get invalid input into the internals of Pango, things go south in extremely bad ways that produce impossible to figure out stack traces. Note that Pango assumes all over the place that it can use g_utf8_next_char() and so forth, and these functions can and will crash on invalid input. Unless validation really is noticeable performance-wise, I think it would be better to always validate at input. But even if not, I'd really, really strongly advise against: "And by letting invalid input in, we can work on how to correctly present an "invalid input" glyph." - the only conceivably OK way of doing an "invalid input" glyph would be to transform the input at validation time ... note that you have to keep the same byte offsets or things get completely screwed up, so you are limited to doing things likely turning invalid input bytes into '?' characters.
Thanks Owen. One can argue that it should be left to the caller of PangoLayout API to make sure input is valid. So I really don't like the "validate and crop" solution. About letting invalid input in, you are absolutely right, everything breaks in bad ways..., so I'm fine with editing the input at set_text time to make it valid. One needs to make sure byte offsets remain the same. One solution is to insert '?' chars for each byte of a broken sequence, another 'pragmatic' solution is to insert possibly over-long '?' chars for each broken sequence. I'll code something like this soon.
Ok, I changed pango_layout_set_text to insert '?' for every invalid byte for now. Seems to work fine here. Can you a couple eyes: /** * pango_layout_set_text: * @layout: a #PangoLayout * @text: a valid UTF-8 string * @length: maximum length of @text, in bytes. -1 indicates that * the string is nul-terminated and the length should be * calculated. The text will also be truncated on * encountaring a nul-termination even when @length is * positive. * * Sets the text of the layout. **/ void pango_layout_set_text (PangoLayout *layout, const char *text, int length) { char *old_text, *start, *end; g_return_if_fail (layout != NULL); g_return_if_fail (length == 0 || text != NULL); old_text = layout->text; if (length < 0) layout->text = g_strdup (text); else if (length > 0) /* This is not exactly what we want. We don't need the padding... */ layout->text = g_strndup (text, length); else layout->text = g_malloc0 (1); layout->length = strlen (layout->text); /* validate it, and replace invalid bytes with '?' */ start = layout->text; for (;;) { gboolean valid; valid = g_utf8_validate (start, -1, &end); if (!*end) break; if (!valid) *end++ = '?'; start = end; } if (start != layout->text) /* TODO: Write out the beginning excerpt of text? */ g_warning ("Invalid UTF-8 string passed to pango_layout_set_text()"); layout->n_chars = g_utf8_strlen (layout->text, -1); pango_layout_clear_lines (layout); if (old_text) g_free (old_text); }
parse_markup still fails on invalid input so you get an empty layout there...
What we should do is to duplicate the string and make it "valid" before passing to parse_markup...
I'm not quite satisfied with using '?'. Here is the only other alternative I can think of ATM: Use -1 instead of '?'. Unicode and UTF-8 are designed such that -1 is never a valid codepoint (in UTF-8, UTF-16, UTF-32). g_utf8_next_char skips one byte on -1, and g_utf8_get_char returns -1. If I do that change right now, I will see a hexbox for FFFFFF. We can better handle it later when we decide on how to allocate glyph numbers for meta-glyphs. I know it's a hack, but one that works quite good. We can document glib utf8 functions as doing such too. Matthias pointed that this way the string is not technically valid anymore. I see that as a feature not a bug. If the input string is not valid, I don't mind returning an invalid string in pango_layout_get_text. Just that this one is different in invalid positions, it's safe to most of UTF-8 handlers, and specifically to glib's. Strong objections against this?
To be exact, what I see is a hexbox for EFFFFF, not FFFFFF.
Note that the initial issue of this PR has been addressed as far as I can see. There may be additional improvements, but since nothing happened since one year and a half, I'd say it's reasonable to close this PR at this time, and open a new PR on possible other enhancements if someone feels this is a good idea. Arno