GNOME Bugzilla – Bug 792516
gconvert: More consistent handling of embedded NUL bytes
Last modified: 2018-02-01 14:11:38 UTC
The return values of functions g_locale_to_utf8() and g_filename_to_utf8() are widely expected to be NUL-terminated UTF-8 strings satisfying their default annotation of (type utf8). However, the paths in these functions that invoke iconv() to convert from encodings other than UTF-8 can produce output with embedded NUL bytes. The behavior of reverse conversion functions g_locale_from_utf8() and g_filename_from_utf8() on NUL bytes embedded in the specified-length input buffer is not documented, although these functions may, inconsistently, invalidate such inputs, while neither iconv() nor the Unicode standard forbid embedded NUL bytes.
Created attachment 366793 [details] [review] gconvert: Optimize UTF-8 conversions, fix output on error In the strdup_len() path, no need to repeat the work of g_utf8_validate() and search for the string-terminating nul byte. Also in strdup_len(), make the out parameter bytes_read receive the length of the valid (meaning also nul-free) part of the input string, as the documentation on g_{locale,filename}_{from,to}_utf8() says it does.
Created attachment 366794 [details] [review] gconvert: Tighten, document embedded NUL behavior of UTF-8 conversions The character encoding conversion utility functions g_locale_to_utf8() and g_filename_to_utf8() had inconsistent behavior on producing strings with inner NUL bytes: in the all-UTF-8 strdup path, the input string validation prohibits embedded NULs, while g_convert(), using iconv(), can produce UTF-8 output with NUL bytes inside the output buffer. This, while valid UTF-8 per the Unicode standard, is not valid for the nul-terminated (type utf8) return value format that the *_to_utf8() functions are annotated with (as per discussion in bug 756128). Check the output of g_convert() for embedded NUL bytes, and if any are found, set the newly introduced error G_CONVERT_ERROR_EMBDEDDED_NUL. Also document the error set by g_{locale,filename}_{from,to}_utf8() when the input string contains nul bytes.
Created attachment 366798 [details] [review] Test embedded NULs in input of g_{locale,filename}_to_utf8() The tests exercise both g_strncpy() and g_convert() paths.
Review of attachment 366793 [details] [review]: ::: glib/gconvert.c @@ +826,2 @@ static gchar * strdup_len (const gchar *string, It would be good to add a brief documentation comment, along the lines of: ``` /* Validate @string as UTF-8. @len can be negative if @string is nul-terminated, or a non-negative value in bytes. If @string ends in an incomplete sequence, or contains any illegal sequences or nul codepoints, %FALSE will be returned. On success or error, if provided, @bytes_read will be set to the number of bytes in @string up to @len or the terminating nul byte. */ ``` @@ +829,2 @@ gsize *bytes_read, + gsize *bytes_written, Shouldn’t all the callers of strdup_len() be changed to pass these two parameters in the new order? Actually, you could just drop one of these parameters, since they’re both always set to the same thing.
Review of attachment 366794 [details] [review]: Nice commit message! (Although there’s a typo in `G_CONVERT_ERROR_EMBDEDDED_NUL` in it.) This looks good to me. I think all my comments are minor. ::: glib/gconvert.c @@ +856,3 @@ } +static gchar* Nitpick: Missing space before `*`. @@ +862,3 @@ + gsize *bytes_read, + gsize *bytes_written, + GError **error) Nitpick: The parameters should be aligned by the first letter of their name. So `error` needs to stay where it is, and all the others need to be indented one space further. i.e. ``` convert_to_utf8 (const gchar *opsysstring, gssize len, const gchar *charset, gsize *bytes_read, gsize *bytes_written, GError **error) ``` @@ +869,3 @@ + utf8 = g_convert (opsysstring, len, "UTF-8", charset, + bytes_read, &outbytes, error); + if (!utf8) Nitpick: It’s clearer to write `(utf8 == NULL)` so that it doesn’t look like utf8 is a boolean variable. @@ +916,3 @@ + * + * If the source encoding is not UTF-8 and the conversion output contains a nul + * byte, the error #G_CONVERT_ERROR_EMBEDDED_NUL is set and %NULL is returned. s/nul byte/nul character/ to make it a bit clearer we’re talking about UTF-8 here? (I realise that U+0000 is represented as a single zero byte.) @@ +918,3 @@ + * byte, the error #G_CONVERT_ERROR_EMBEDDED_NUL is set and %NULL is returned. + * If the source encoding is UTF-8, an embedded nul byte is treated with the + * #G_CONVERT_ERROR_ILLEGAL_SEQUENCE error for backward compatibility with s/#/%/ (see https://developer.gnome.org/gtk-doc-manual/unstable/documenting_syntax.html.en) @@ +964,3 @@ * the system codepage. + * + * The input string should not contain nul bytes even if the @len s/nul bytes/nul characters/, as above. @@ +966,3 @@ + * The input string should not contain nul bytes even if the @len + * argument is positive. A nul byte found inside the string may result + * in error #G_CONVERT_ERROR_ILLEGAL_SEQUENCE. Use g_convert() to convert s/#/%/, as above. @@ +1178,3 @@ + * #G_CONVERT_ERROR_ILLEGAL_SEQUENCE error for backward compatibility with + * earlier versions of this library. Use g_convert() to produce output that + * may contain embedded nul bytes. Same changes needed as above. @@ +1226,3 @@ + * argument is positive. A nul byte found inside the string may result + * in error #G_CONVERT_ERROR_ILLEGAL_SEQUENCE. Note that nul bytes are + * forbidden in all filename encodings that GLib is known to work with. Same changes needed as above.
Review of attachment 366798 [details] [review]: ::: glib/tests/convert.c @@ +695,3 @@ + +static void +test_locale_embedded_nul_utf8 (void) I’m trying to get GLib into the habit of having a brief comment above each test case which explains what the test case is testing. For example: ``` /* Test embedded nul characters in UTF-8 input to g_locale_to_utf8() result in an error. */ ``` @@ +703,3 @@ + setlocale (LC_ALL, ""); + g_setenv ("CHARSET", "UTF-8", TRUE); + g_assert (g_get_charset (NULL)); s/g_assert/g_assert_true/ to get a slightly more helpful error message on failure (and to avoid the assertion being compiled out when compiled with G_DISABLE_ASSERT). @@ +707,3 @@ + res = g_locale_to_utf8 ("ab\0c", 4, &bytes_read, NULL, &error); + + g_assert (res == NULL); Similarly, s/g_assert/g_assert_null/ @@ +708,3 @@ + + g_assert (res == NULL); + g_assert_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE); `error` is leaked after this. @@ +725,3 @@ + + g_assert (res == NULL); + g_assert_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_EMBEDDED_NUL); Same comments as above. @@ +751,3 @@ + g_assert (res == NULL); + g_assert_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE); + g_assert_cmpuint (bytes_read, ==, 2); Same comments as above. @@ +766,3 @@ + + g_assert (res == NULL); + g_assert_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_EMBEDDED_NUL); Same comments as above. @@ +803,3 @@ + g_test_add_func ("/conversion/filename-embedded-nul", test_filename_embedded_nul); + g_test_add_func ("/conversion/filename-embedded-nul/subprocess/utf8", test_filename_embedded_nul_utf8); + g_test_add_func ("/conversion/filename-embedded-nul/subprocess/iconv", test_filename_embedded_nul_iconv); It would probably also make sense to add a test for using g_convert() with embedded nuls, and that succeeding.
(In reply to Philip Withnall from comment #4) > Review of attachment 366793 [details] [review] [review]: > > ::: glib/gconvert.c > @@ +826,2 @@ > static gchar * > strdup_len (const gchar *string, > > It would be good to add a brief documentation comment Sure, will do. > @@ +829,2 @@ > gsize *bytes_read, > + gsize *bytes_written, > > Shouldn’t all the callers of strdup_len() be changed to pass these two > parameters in the new order? It is the order that the callers pass the parameters in :) It just didn't matter before, because strdup_len() always set the same value to both pointers, and with this patch, it no longer does.
Created attachment 366986 [details] [review] gconvert: Optimize UTF-8 conversions, fix output on error In the strdup_len() path, no need to do what g_utf8_validate() already does: locate the string-terminating nul byte. Also in strdup_len(), make the out parameter bytes_read receive the length of the valid (meaning also nul-free) part of the input string, as the documentation on g_{locale,filename}_{from,to}_utf8() says it does.
Created attachment 366987 [details] [review] gconvert: Tighten, document embedded NUL behavior of UTF-8 conversions The character encoding conversion utility functions g_locale_to_utf8() and g_filename_to_utf8() had inconsistent behavior on producing strings with inner NUL bytes: in the all-UTF-8 strdup path, the input string validation prohibits embedded NULs, while g_convert(), using iconv(), can produce UTF-8 output with NUL bytes inside the output buffer. This, while valid UTF-8 per the Unicode standard, is not valid for the nul-terminated (type utf8) return value format that the *_to_utf8() functions are annotated with (as per discussion in bug 756128). Check the output of g_convert() for embedded NUL bytes, and if any are found, set the newly introduced error G_CONVERT_ERROR_EMBEDDED_NUL. Also document the error set by g_{locale,filename}_{from,to}_utf8() when the input string contains nul bytes.
Created attachment 366988 [details] [review] Test embedded NULs in input of g_{locale,filename}_to_utf8() The tests exercise both g_strncpy() and g_convert() paths.
Created attachment 366989 [details] [review] gconvert: Consistently validate inputs and outputs for embedded NULs String inputs to convenience conversion functions g_locale_from_utf8(), g_filename_from_utf8(), and g_filename_to_utf8(), are annotated for the bindings as NUL-terminated strings of (type utf8) or (type filename). There is also a len parameter that allows converting part of the string, but it is exposed to the bindings as a value independent from the string buffer. Absent any more sophisticated ways to annotate, the way to provide a safeguard against len argument values longer than the string length is to check that no nul is encountered within the first len bytes of the string. strdup_len() includes this check as part of UTF-8 validation, but g_convert() permits embedded nuls. For g_filename_from_utf8(), also check the output to prevent embedded NUL bytes. It's not safe to allow embedded NULs in a string that is going to be used as (type filename), and no known bytestring encoding for file names allows them.
Created attachment 366990 [details] [review] Test that g_convert() can handle embedded NUL bytes
Review of attachment 366986 [details] [review]: Great, thanks.
Review of attachment 366987 [details] [review]: ++
Review of attachment 366988 [details] [review]: ++
Review of attachment 366989 [details] [review]: ++
Review of attachment 366990 [details] [review]: One more check needed. ::: glib/tests/convert.c @@ +700,3 @@ + g_assert_cmpuint (bytes_read, ==, 4); + g_assert_cmpmem (res, bytes_written, "ab\0\xc3\xb6", 5); + g_free (res); Add in a `g_assert_no_error (error);` call somewhere.
Created attachment 367043 [details] [review] Test that g_convert() can handle embedded NUL bytes
Created attachment 367044 [details] [review] Fixed a style nitpick.
Created attachment 367046 [details] [review] Test that g_convert() can handle embedded NUL bytes Actually committing the style fix, restoring the patch title.
Review of attachment 367046 [details] [review]: Looks good, thanks.
All pushed to master, thanks. Attachment 366986 [details] pushed as 413605a - gconvert: Optimize UTF-8 conversions, fix output on error Attachment 366987 [details] pushed as 81cd815 - gconvert: Tighten, document embedded NUL behavior of UTF-8 conversions Attachment 366988 [details] pushed as d584ff7 - Test embedded NULs in input of g_{locale,filename}_to_utf8() Attachment 366989 [details] pushed as f35a6a7 - gconvert: Consistently validate inputs and outputs for embedded NULs Attachment 367046 [details] pushed as 52f9891 - Test that g_convert() can handle embedded NUL bytes
*** Bug 738042 has been marked as a duplicate of this bug. ***