Bug 780095 – g_utf8_get_char_validated() stopping at nul byte even for length specified buffers

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 780095 - g_utf8_get_char_validated() stopping at nul byte even for length specified buffers


Summary:	g_utf8_get_char_validated() stopping at nul byte even for length specified bu...


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2017-03-15 13:52 UTC by Mike Fleetwood
Modified:	2017-06-21 10:39 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
gutf8: Clarify return values from g_utf8_get_char_extended() (2.16 KB, patch) 2017-03-17 12:31 UTC, Philip Withnall	committed	Details \| Review
gutf8: Clarify documentation for g_utf8_get_char_validated() (1.26 KB, patch) 2017-03-17 12:31 UTC, Philip Withnall	committed	Details \| Review
gutf8: Fix g_utf8_get_char_validated() with given length limits (3.39 KB, patch) 2017-03-17 12:31 UTC, Philip Withnall	needs-work	Details \| Review
gutf8: Fix documentation for g_utf8_get_char_validated() length limits (3.63 KB, patch) 2017-03-21 11:25 UTC, Philip Withnall	none	Details \| Review
gutf8: Fix documentation for g_utf8_get_char_validated() length limits (3.67 KB, patch) 2017-03-28 10:18 UTC, Philip Withnall	committed	Details \| Review

Description Mike Fleetwood 2017-03-15 13:52:22 UTC

This test program shows the issue:

#include <glib.h>
#include <stdio.h>
int main(void)
{
	gchar * buf = "\xC0\x00_45678";
	gunichar uc = g_utf8_get_char_validated(buf, 8);
	printf("uc=%d\n", (signed int)uc);
	return 0;
}

The code reports -2 (partial UTF-8 character) after stopping at the nul
byte, even though the length of the buffer has been specified.  I think
it should return -1 (invalid UTF-8 character) when the second max_len
parameter is positive and large enough to discover the invalid nul byte
in the middle of any sized multi-byte UTF-8 character.


Also the description of the max_len parameter [1] is confusing or wrong.
    max_len
    the maximum number of bytes to read, or -1, for no maximum or if p
    is nul-terminated
-1 can't both mean "no maximum" and "nul-terminated".  The code actually
stops when it encounters a nul byte.  I think the parameter should be
described as:
    the maximum number of bytes to read, or -1 if p is nul-terminated


Thanks,
Mike

[1] https://developer.gnome.org/glib/stable/glib-Unicode-Manipulation.html#g-utf8-get-char-validated

Comment 1 Philip Withnall 2017-03-17 12:31:42 UTC

Created attachment 348172 [details] [review]
gutf8: Clarify return values from g_utf8_get_char_extended()

It’s hard to remember what the difference is between -1 and -2, so give
them names.

This introduces no functional changes.

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Comment 2 Philip Withnall 2017-03-17 12:31:47 UTC

Created attachment 348173 [details] [review]
gutf8: Clarify documentation for g_utf8_get_char_validated()

There is no such thing as ‘no maximum’ when reading a string. It’s got
to end somewhere.

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Comment 3 Philip Withnall 2017-03-17 12:31:53 UTC

Created attachment 348174 [details] [review]
gutf8: Fix g_utf8_get_char_validated() with given length limits

If g_utf8_get_char_validated() encountered a nul byte in the middle of a
string of given longer length, it would return -2, indicating a partial
gunichar — but that’s not correct, since there were more input bytes
after the nul. Instead, -1 should be returned to indicate an invalid
UTF-8 encoding.

Add unit tests for this bug, and for g_utf8_get_char_validated()’s
behaviour in general.

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Comment 4 Philip Withnall 2017-03-17 12:39:42 UTC

Review of attachment 348174 [details] [review]:

Unfortunately, this makes the glib/tests/convert.c tests fail, so something’s obviously not quite right. I’ll try and find time to debug this later.

Comment 5 Matthias Clasen 2017-03-19 00:10:09 UTC

Review of attachment 348172 [details] [review]:

obvious improvement

Comment 6 Matthias Clasen 2017-03-19 00:10:43 UTC

Review of attachment 348173 [details] [review]:

sure

Comment 7 Philip Withnall 2017-03-20 11:09:21 UTC

Pushed the first two patches now that 2.52.0 is out:

5052770 gutf8: Fix g_utf8_get_char_validated() with given length limits
69b4c72 gutf8: Clarify documentation for g_utf8_get_char_validated()

Comment 8 Philip Withnall 2017-03-20 11:25:29 UTC

Taking a look at the convert.c unit test failures, it seems that g_utf8_to_ucs4() relies on the current behaviour of g_utf8_get_char_extended() (which provides the functionality for g_utf8_get_char_validated()), in that when g_utf8_to_ucs4() is called with a length of -1 for a nul-terminated string, it always passes max_len of 6 to g_utf8_get_char_extended(). It seems g_utf8_get_char_extended() treats max_len in the unconventional way: read up to the next nul byte *or* up to max_len.

I think changing that behaviour is going to be an API break; since bits of GLib depend on it, it’s obviously the way the API is intended to be used.

Unless anyone has any objections, I’ll patch the documentation for g_utf8_get_char_extended() instead.

Comment 9 Mike Fleetwood 2017-03-20 13:23:13 UTC

(In reply to Philip Withnall from comment #8)
> I think changing that behaviour is going to be an API break; since bits of
> GLib depend on it, it’s obviously the way the API is intended to be used.
>
> Unless anyone has any objections, I’ll patch the documentation for
> g_utf8_get_char_extended() instead.

Did you mean to say g_utf8_get_char_validated() here?
As g_utf8_char_extended() is an internal function where as
g_utf8_char_validated() is part of the documented API.

Comment 10 Philip Withnall 2017-03-20 13:57:34 UTC

(In reply to Mike Fleetwood from comment #9)
> (In reply to Philip Withnall from comment #8)
> > I think changing that behaviour is going to be an API break; since bits of
> > GLib depend on it, it’s obviously the way the API is intended to be used.
> >
> > Unless anyone has any objections, I’ll patch the documentation for
> > g_utf8_get_char_extended() instead.
> 
> Did you mean to say g_utf8_get_char_validated() here?
> As g_utf8_char_extended() is an internal function where as
> g_utf8_char_validated() is part of the documented API.

Whoops, yes, I meant g_utf8_get_char_validated().

Comment 11 Mike Fleetwood 2017-03-20 15:17:21 UTC

In that case I would like to object please.

It would make g_utf8_get_char_validate() a bit useless for reading and
validating UTF-8 characters.  Since nul is a valid single byte UTF-8
character, length delimited buffers have to be used.  To work around
this bug, code has to read the first byte itself and determine how many
bytes to expect the UTF-8 character to be.  At that point your into
parsing characters yourself which is what g_utf8_get_char_validate() is
for.

Quoting from the documentation for g_utf8_get_char():
    If you are not sure that the bytes are complete valid Unicode
    characters, you should use g_utf8_get_char_validated() instead.

Comment 12 Philip Withnall 2017-03-21 11:15:11 UTC

(In reply to Mike Fleetwood from comment #11)
> It would make g_utf8_get_char_validate() a bit useless for reading and
> validating UTF-8 characters.

Many of the other GLib UTF-8 functions will stop on embedded nul bytes as well. As far as I can tell, GLib fairly commonly assumes that embedded nuls cannot be used in UTF-8 strings. This makes sense, given that GLib’s UTF-8 functions are mostly designed for handling strings for UIs, where embedded nuls make no sense.

For example, see the documentation for g_utf8_validate():
> Note that g_utf8_validate() returns FALSE if max_len is positive and any of the max_len bytes are nul.

I think that sentence is roughly what should be added to the documentation for g_utf8_get_char_validated() to fix this bug.

If you can make a good case for adding a new function like g_utf8_get_char_validated() (but which handles nul bytes), it would be added separately. Breaking API for g_utf8_get_char_validated() is unlikely to happen.

Comment 13 Philip Withnall 2017-03-21 11:25:50 UTC

Created attachment 348395 [details] [review]
gutf8: Fix documentation for g_utf8_get_char_validated() length limits

If g_utf8_get_char_validated() encounters a nul byte in the middle of a
string of given longer length, it returns -2, indicating a partial
gunichar. That is not the obvious behaviour, but since
g_utf8_get_char_validated() has been API for a long time, the behaviour
cannot be changed.

Document it, and add some unit tests (for this behaviour and the other
behaviour of g_utf8_get_char_validated()).

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Comment 14 Mike Fleetwood 2017-03-21 13:42:12 UTC

g_utf8_find_next_char() was recently fixed to successfully step past nul
chars in buffers [1][2] changing the behaviour of the API to match the
documentation.

[1] Bug 547200 - g_utf8_find_next_char() issues
[2] Fix a corner-case in g_utf8_find_next_char
    https://git.gnome.org/browse/glib/commit/?id=e0e652e

If I understand correctly your primary concern from comment #8, is that
changing g_utf8_get_char_extended() to fix g_utf8_get_char_validated()
breaks g_utf8_to_ucs4().  Is there a reason to NOT make an alternative
fix in g_utf8_get_char_validated() so that its behaviour matches the
documentation?

Comment 15 Allison Karlitskaya (desrt) 2017-03-23 11:40:05 UTC

I disagree with the premise of the bug.

max_len is *max* len.  If we discover that the string is in fact shorter than that (due to reading a nul) then we must behave as if max_len was specified as the length that we discovered.

From what I understand, that is the current behaviour:

  - nul found mid-sequence: return -2
  - hit max_len mid-sequence: return -2

I consider that to be consistent.

I can also imagine a case where we are handling strings read from some source which (for some reason) end up nul terminated as they pass across some API boundary.  We might still like to know that it is perhaps possible to read a bit more data and get a valid codepoint, so -2 is nice for that.

Comment 16 Allison Karlitskaya (desrt) 2017-03-23 11:42:15 UTC

(In reply to Philip Withnall from comment #8)
> Unless anyone has any objections, I’ll patch the documentation for
> g_utf8_get_char_extended() instead.

(In reply to Philip Withnall from comment #10) 
> Whoops, yes, I meant g_utf8_get_char_validated().

+1

Comment 17 Allison Karlitskaya (desrt) 2017-03-23 11:44:45 UTC

Review of attachment 348395 [details] [review]:

::: glib/gutf8.c
@@ +657,3 @@
+ *
+ * Note that g_utf8_get_char_validated() returns (gunichar)-2 if
+ * @max_len is positive and any of the @max_len bytes are nul.

No?!

It should only give -2 if the nul is found inside of the first multibyte sequence.

@max_len could be 1000 here, with nul at position 999, and according to your statement we would be obliged to return -2.

Comment 18 Philip Withnall 2017-03-28 10:18:27 UTC

Created attachment 348877 [details] [review]
gutf8: Fix documentation for g_utf8_get_char_validated() length limits

If g_utf8_get_char_validated() encounters a nul byte in the middle of a
string of given longer length, it returns -2, indicating a partial
gunichar. That is not the obvious behaviour, but since
g_utf8_get_char_validated() has been API for a long time, the behaviour
cannot be changed.

Document it, and add some unit tests (for this behaviour and the other
behaviour of g_utf8_get_char_validated()).

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Comment 19 Emmanuele Bassi (:ebassi) 2017-06-21 10:35:16 UTC

Review of attachment 348877 [details] [review]:

Looks good to me.

Comment 20 Philip Withnall 2017-06-21 10:39:30 UTC

Thanks for the review!

Attachment 348877 [details] pushed as 3e89b19 - gutf8: Fix documentation for g_utf8_get_char_validated() length limits