GNOME Bugzilla – Bug 555285
g_utf8_validate() forbids embedded NUL
Last modified: 2008-10-08 23:47:36 UTC
Please describe the problem: g_utf8_validate() returns FALSE if the string contains a NUL. There is no reason provided for this in the surrounding comments, and U+0000 is a valid character with UTF-8 representation 0x0 (as it's part of ASCII.) All other Unicode codepoints are allowed. Steps to reproduce: Actual results: Expected results: g_utf8_validate() should allow NULs in a string. Does this happen every time? Other information:
Right. I was surprised when I noticed this too. But it's too much to change. It's all a mess... In Pango I work around it by doing: start = layout->text; for (;;) { gboolean valid; valid = g_utf8_validate (start, -1, (const char **)&end); if (!*end) break; /* Replace invalid bytes with -1. The -1 will be converted to * ((gunichar) -1) by glib, and that in turn yields a glyph value of * ((PangoGlyph) -1) by PANGO_GET_UNKNOWN_GLYPH(-1), * and that's PANGO_GLYPH_INVALID_INPUT. */ if (!valid) *end++ = -1; start = end; } So I simply ignore errors caused by NUL bytes.
Unfortunately, it can't always be worked around that simply. Replacing bytes in the original string either corrupts data or causes it to be no longer UTF-8. I think there are several possible solutions to fix the mess. I started a discussion on the mailing list. http://article.gmane.org/gmane.comp.gnome.gtk%2B.devel.general/16024
Whatever your opinion on this, it cannot be changed in glib at this point. Write your own utf8 validation function if you need one that accepts NUL