GNOME Bugzilla – Bug 501997
g_utf8_normalize() returns NULL on invalid string
Last modified: 2018-05-24 11:10:27 UTC
Documentation Section: glib/glib-Unicode-Manipulation.html#g-utf8-normalize Nothing about what happends if the string is not valid utf8 Correct version: That if the string is not valid utf8, NULL will be returned Other information:
commit b6ad8a7ac9331257d1405d5e360b868f37a698d5 Author: hasselmm <hasselmm@5bbd4a9e-d125-0410-bf1d-f987e7eefc80> Date: Thu Dec 6 10:22:13 2007 +0000 * glib/gunidecomp.c: Mention g_utf8_normalize() returns NULL on invalid string. (#501997) git-svn-id: svn+ssh://svn.gnome.org/svn/glib/trunk@6058 5bbd4a9e-d125-0410-bf1d-f987e7eefc80
That is not correct. If the string is not valid utf8, it might also crash, since it iterates over the string using g_utf8_next_char(). The string must be valid utf8. @str: a UTF-8 encoded string.
So you must check "unsafe" data with g_utf8_validate. When it comes to documentation: g_utf8_next_char() has this stated <quote> Before using this macro, use g_utf8_validate()</quote> Could/Should this be mentioned for all the functions that take utf8 input that MUST be valid?
Owen: Well, in that case also the documentation for g_ucs4_to_utf8 is wrong: Returns: a pointer to a newly allocated UTF-8 string. This value must be freed with g_free(). If an error occurs, NULL will be returned and error set. In that case, items_read will be set to the position of the first invalid input character. I used the documentation of g_ucs4_to_utf8 to verify validity of Stian's claim.
Duh, didn't see the _g_utf8_normalize_wc call. Crap.
Especially if you widen the scope to include all of Pango and GTK+, there are lots and lots and lots of functions that require *valid* UTF-8 data, and only a few (like g_utf8_to_ucs4(), g_utf8_validate(), a few others) that are safe on an invalid data. So I think adding text every place a valid string is required is a bad idea. (Generally, validation greatly increases the cost and complexity of working with a UTF-8 string, which is why we have the concept that you validate at the interfaces where you read data in and not throughout the code.) It would be cool if we could linkify "UTF-8 String" everywhere in the docs to to some generic text about getting, validating, and manipulating UTF-8 strings, but that would require hacking up gtk-doc or a *ton* of manual editing and noise in the inline docs.
(In reply to comment #6) > It would be cool if we could linkify "UTF-8 String" everywhere in the docs > to to some generic text about getting, validating, and manipulating UTF-8 > strings, but that would require hacking up gtk-doc or a *ton* of manual > editing and noise in the inline docs. Well, that's easy to achive with a script in the spirit of gtkdoc-fixxref. $ python gtkdoc-glossary glossary2.txt glossary2.html > glossary3.html glossary2.txt: UTF-8 string: Bla bla g_utf8_validate() bla bla <ganz viel text> denn das "muss" umbrechen und so weiter und so weiter blub blab foo' GTK+: The GIMP Toolkit
Created attachment 100432 [details] gtkdoc-glossary script
Created attachment 100433 [details] Sample input
Created attachment 100434 [details] Sample output
Mathias, I've created a RFE for gtk-doc as Bug 502191.
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/116.