GNOME Bugzilla – Bug 780095
g_utf8_get_char_validated() stopping at nul byte even for length specified buffers
Last modified: 2017-06-21 10:39:35 UTC
This test program shows the issue: #include <glib.h> #include <stdio.h> int main(void) { gchar * buf = "\xC0\x00_45678"; gunichar uc = g_utf8_get_char_validated(buf, 8); printf("uc=%d\n", (signed int)uc); return 0; } The code reports -2 (partial UTF-8 character) after stopping at the nul byte, even though the length of the buffer has been specified. I think it should return -1 (invalid UTF-8 character) when the second max_len parameter is positive and large enough to discover the invalid nul byte in the middle of any sized multi-byte UTF-8 character. Also the description of the max_len parameter [1] is confusing or wrong. max_len the maximum number of bytes to read, or -1, for no maximum or if p is nul-terminated -1 can't both mean "no maximum" and "nul-terminated". The code actually stops when it encounters a nul byte. I think the parameter should be described as: the maximum number of bytes to read, or -1 if p is nul-terminated Thanks, Mike [1] https://developer.gnome.org/glib/stable/glib-Unicode-Manipulation.html#g-utf8-get-char-validated
Created attachment 348172 [details] [review] gutf8: Clarify return values from g_utf8_get_char_extended() It’s hard to remember what the difference is between -1 and -2, so give them names. This introduces no functional changes. Signed-off-by: Philip Withnall <withnall@endlessm.com>
Created attachment 348173 [details] [review] gutf8: Clarify documentation for g_utf8_get_char_validated() There is no such thing as ‘no maximum’ when reading a string. It’s got to end somewhere. Signed-off-by: Philip Withnall <withnall@endlessm.com>
Created attachment 348174 [details] [review] gutf8: Fix g_utf8_get_char_validated() with given length limits If g_utf8_get_char_validated() encountered a nul byte in the middle of a string of given longer length, it would return -2, indicating a partial gunichar — but that’s not correct, since there were more input bytes after the nul. Instead, -1 should be returned to indicate an invalid UTF-8 encoding. Add unit tests for this bug, and for g_utf8_get_char_validated()’s behaviour in general. Signed-off-by: Philip Withnall <withnall@endlessm.com>
Review of attachment 348174 [details] [review]: Unfortunately, this makes the glib/tests/convert.c tests fail, so something’s obviously not quite right. I’ll try and find time to debug this later.
Review of attachment 348172 [details] [review]: obvious improvement
Review of attachment 348173 [details] [review]: sure
Pushed the first two patches now that 2.52.0 is out: 5052770 gutf8: Fix g_utf8_get_char_validated() with given length limits 69b4c72 gutf8: Clarify documentation for g_utf8_get_char_validated()
Taking a look at the convert.c unit test failures, it seems that g_utf8_to_ucs4() relies on the current behaviour of g_utf8_get_char_extended() (which provides the functionality for g_utf8_get_char_validated()), in that when g_utf8_to_ucs4() is called with a length of -1 for a nul-terminated string, it always passes max_len of 6 to g_utf8_get_char_extended(). It seems g_utf8_get_char_extended() treats max_len in the unconventional way: read up to the next nul byte *or* up to max_len. I think changing that behaviour is going to be an API break; since bits of GLib depend on it, it’s obviously the way the API is intended to be used. Unless anyone has any objections, I’ll patch the documentation for g_utf8_get_char_extended() instead.
(In reply to Philip Withnall from comment #8) > I think changing that behaviour is going to be an API break; since bits of > GLib depend on it, it’s obviously the way the API is intended to be used. > > Unless anyone has any objections, I’ll patch the documentation for > g_utf8_get_char_extended() instead. Did you mean to say g_utf8_get_char_validated() here? As g_utf8_char_extended() is an internal function where as g_utf8_char_validated() is part of the documented API.
(In reply to Mike Fleetwood from comment #9) > (In reply to Philip Withnall from comment #8) > > I think changing that behaviour is going to be an API break; since bits of > > GLib depend on it, it’s obviously the way the API is intended to be used. > > > > Unless anyone has any objections, I’ll patch the documentation for > > g_utf8_get_char_extended() instead. > > Did you mean to say g_utf8_get_char_validated() here? > As g_utf8_char_extended() is an internal function where as > g_utf8_char_validated() is part of the documented API. Whoops, yes, I meant g_utf8_get_char_validated().
In that case I would like to object please. It would make g_utf8_get_char_validate() a bit useless for reading and validating UTF-8 characters. Since nul is a valid single byte UTF-8 character, length delimited buffers have to be used. To work around this bug, code has to read the first byte itself and determine how many bytes to expect the UTF-8 character to be. At that point your into parsing characters yourself which is what g_utf8_get_char_validate() is for. Quoting from the documentation for g_utf8_get_char(): If you are not sure that the bytes are complete valid Unicode characters, you should use g_utf8_get_char_validated() instead.
(In reply to Mike Fleetwood from comment #11) > It would make g_utf8_get_char_validate() a bit useless for reading and > validating UTF-8 characters. Many of the other GLib UTF-8 functions will stop on embedded nul bytes as well. As far as I can tell, GLib fairly commonly assumes that embedded nuls cannot be used in UTF-8 strings. This makes sense, given that GLib’s UTF-8 functions are mostly designed for handling strings for UIs, where embedded nuls make no sense. For example, see the documentation for g_utf8_validate(): > Note that g_utf8_validate() returns FALSE if max_len is positive and any of the max_len bytes are nul. I think that sentence is roughly what should be added to the documentation for g_utf8_get_char_validated() to fix this bug. If you can make a good case for adding a new function like g_utf8_get_char_validated() (but which handles nul bytes), it would be added separately. Breaking API for g_utf8_get_char_validated() is unlikely to happen.
Created attachment 348395 [details] [review] gutf8: Fix documentation for g_utf8_get_char_validated() length limits If g_utf8_get_char_validated() encounters a nul byte in the middle of a string of given longer length, it returns -2, indicating a partial gunichar. That is not the obvious behaviour, but since g_utf8_get_char_validated() has been API for a long time, the behaviour cannot be changed. Document it, and add some unit tests (for this behaviour and the other behaviour of g_utf8_get_char_validated()). Signed-off-by: Philip Withnall <withnall@endlessm.com>
g_utf8_find_next_char() was recently fixed to successfully step past nul chars in buffers [1][2] changing the behaviour of the API to match the documentation. [1] Bug 547200 - g_utf8_find_next_char() issues [2] Fix a corner-case in g_utf8_find_next_char https://git.gnome.org/browse/glib/commit/?id=e0e652e If I understand correctly your primary concern from comment #8, is that changing g_utf8_get_char_extended() to fix g_utf8_get_char_validated() breaks g_utf8_to_ucs4(). Is there a reason to NOT make an alternative fix in g_utf8_get_char_validated() so that its behaviour matches the documentation?
I disagree with the premise of the bug. max_len is *max* len. If we discover that the string is in fact shorter than that (due to reading a nul) then we must behave as if max_len was specified as the length that we discovered. From what I understand, that is the current behaviour: - nul found mid-sequence: return -2 - hit max_len mid-sequence: return -2 I consider that to be consistent. I can also imagine a case where we are handling strings read from some source which (for some reason) end up nul terminated as they pass across some API boundary. We might still like to know that it is perhaps possible to read a bit more data and get a valid codepoint, so -2 is nice for that.
(In reply to Philip Withnall from comment #8) > Unless anyone has any objections, I’ll patch the documentation for > g_utf8_get_char_extended() instead. (In reply to Philip Withnall from comment #10) > Whoops, yes, I meant g_utf8_get_char_validated(). +1
Review of attachment 348395 [details] [review]: ::: glib/gutf8.c @@ +657,3 @@ + * + * Note that g_utf8_get_char_validated() returns (gunichar)-2 if + * @max_len is positive and any of the @max_len bytes are nul. No?! It should only give -2 if the nul is found inside of the first multibyte sequence. @max_len could be 1000 here, with nul at position 999, and according to your statement we would be obliged to return -2.
Created attachment 348877 [details] [review] gutf8: Fix documentation for g_utf8_get_char_validated() length limits If g_utf8_get_char_validated() encounters a nul byte in the middle of a string of given longer length, it returns -2, indicating a partial gunichar. That is not the obvious behaviour, but since g_utf8_get_char_validated() has been API for a long time, the behaviour cannot be changed. Document it, and add some unit tests (for this behaviour and the other behaviour of g_utf8_get_char_validated()). Signed-off-by: Philip Withnall <withnall@endlessm.com>
Review of attachment 348877 [details] [review]: Looks good to me.
Thanks for the review! Attachment 348877 [details] pushed as 3e89b19 - gutf8: Fix documentation for g_utf8_get_char_validated() length limits