Bug 501997 – g_utf8_normalize() returns NULL on invalid string

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 501997 - g_utf8_normalize() returns NULL on invalid string


Summary:	g_utf8_normalize() returns NULL on invalid string


Status:	RESOLVED OBSOLETE

Product:	glib
Classification:	Platform
Component:	docs
Version:	unspecified
Hardware:	Other All

Importance:	Normal minor
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2007-12-06 09:31 UTC by Stian Skjelstad
Modified:	2018-05-24 11:10 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
gtkdoc-glossary script (661 bytes, text/plain) 2007-12-06 17:34 UTC, Mathias Hasselmann (IRC: tbf)	Details
Sample input (231 bytes, text/html) 2007-12-06 17:34 UTC, Mathias Hasselmann (IRC: tbf)	Details
Sample output (455 bytes, text/html) 2007-12-06 17:34 UTC, Mathias Hasselmann (IRC: tbf)	Details

Description Stian Skjelstad 2007-12-06 09:31:16 UTC

Documentation 
Section: glib/glib-Unicode-Manipulation.html#g-utf8-normalize
Nothing about what happends if the string is not valid utf8

Correct version:
That if the string is not valid utf8, NULL will be returned

Other information:

Comment 1 Mathias Hasselmann (IRC: tbf) 2007-12-06 10:33:56 UTC

commit b6ad8a7ac9331257d1405d5e360b868f37a698d5
Author: hasselmm <hasselmm@5bbd4a9e-d125-0410-bf1d-f987e7eefc80>
Date:   Thu Dec 6 10:22:13 2007 +0000

    * glib/gunidecomp.c: Mention g_utf8_normalize()
    returns NULL on invalid string. (#501997)
    
    
    git-svn-id: svn+ssh://svn.gnome.org/svn/glib/trunk@6058 5bbd4a9e-d125-0410-bf1d-f987e7eefc80

Comment 2 Owen Taylor 2007-12-06 14:40:33 UTC

That is not correct. If the string is not valid utf8, it might also crash,
since it iterates over the string using g_utf8_next_char(). 

The string must be valid utf8.
 
 @str: a UTF-8 encoded string.

Comment 3 Stian Skjelstad 2007-12-06 14:47:08 UTC

So you must check "unsafe" data with g_utf8_validate.

When it comes to documentation:
g_utf8_next_char() has this stated
<quote> Before using this macro, use g_utf8_validate()</quote>

Could/Should this be mentioned for all the functions that take utf8 input that MUST be valid?

Comment 4 Mathias Hasselmann (IRC: tbf) 2007-12-06 15:20:29 UTC

Owen: Well, in that case also the documentation for g_ucs4_to_utf8 is wrong:

    Returns: a pointer to a newly allocated UTF-8 string. This value must be freed with g_free(). If an error occurs, NULL will be returned and error set. In that case, items_read will be set to the position of the first invalid input character.

I used the documentation of g_ucs4_to_utf8 to verify validity of Stian's claim.

Comment 5 Mathias Hasselmann (IRC: tbf) 2007-12-06 15:23:29 UTC

Duh, didn't see the _g_utf8_normalize_wc call. Crap.

Comment 6 Owen Taylor 2007-12-06 16:06:38 UTC

Especially if you widen the scope to include all of Pango and GTK+, there are
lots and lots and lots of functions that require *valid* UTF-8 data, and only
a few (like g_utf8_to_ucs4(), g_utf8_validate(), a few others) that are safe
on an invalid data. So I think adding text every place a valid string is
required is a bad idea.

(Generally, validation greatly increases the cost and complexity of working with a
UTF-8 string, which is why we have the concept that you validate at the interfaces
where you read data in and not throughout the code.)

It would be cool if we could linkify "UTF-8 String" everywhere in the docs
to to some generic text about getting, validating, and manipulating UTF-8
strings, but that would require hacking up gtk-doc or a *ton* of manual
editing and noise in the inline docs.

Comment 7 Mathias Hasselmann (IRC: tbf) 2007-12-06 17:33:23 UTC

(In reply to comment #6)
> It would be cool if we could linkify "UTF-8 String" everywhere in the docs
> to to some generic text about getting, validating, and manipulating UTF-8
> strings, but that would require hacking up gtk-doc or a *ton* of manual
> editing and noise in the inline docs.

Well, that's easy to achive with a script in the spirit of gtkdoc-fixxref.

$ python gtkdoc-glossary glossary2.txt glossary2.html > glossary3.html

glossary2.txt:

UTF-8 string: Bla bla g_utf8_validate() bla bla <ganz viel text> denn das "muss" umbrechen und so weiter und so weiter blub blab foo'
GTK+: The GIMP Toolkit

Comment 8 Mathias Hasselmann (IRC: tbf) 2007-12-06 17:34:06 UTC

Created attachment 100432 [details]
gtkdoc-glossary script

Comment 9 Mathias Hasselmann (IRC: tbf) 2007-12-06 17:34:32 UTC

Created attachment 100433 [details]
Sample input

Comment 10 Mathias Hasselmann (IRC: tbf) 2007-12-06 17:34:51 UTC

Created attachment 100434 [details]
Sample output

Comment 11 Stefan Sauer (gstreamer, gtkdoc dev) 2007-12-06 20:35:58 UTC

Mathias, I've created a RFE for gtk-doc as Bug 502191.

Comment 12 GNOME Infrastructure Team 2018-05-24 11:10:27 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/116.