After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 792516 - gconvert: More consistent handling of embedded NUL bytes
gconvert: More consistent handling of embedded NUL bytes
Status: RESOLVED FIXED
Product: glib
Classification: Platform
Component: i18n
2.55.x
Other All
: Normal normal
: ---
Assigned To: gtkdev
gtkdev
: 738042 (view as bug list)
Depends on:
Blocks: 756128
 
 
Reported: 2018-01-14 17:46 UTC by Mikhail Zabaluev
Modified: 2018-02-01 14:11 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
gconvert: Optimize UTF-8 conversions, fix output on error (1.82 KB, patch)
2018-01-14 17:48 UTC, Mikhail Zabaluev
none Details | Review
gconvert: Tighten, document embedded NUL behavior of UTF-8 conversions (7.53 KB, patch)
2018-01-14 17:48 UTC, Mikhail Zabaluev
none Details | Review
Test embedded NULs in input of g_{locale,filename}_to_utf8() (3.90 KB, patch)
2018-01-14 19:47 UTC, Mikhail Zabaluev
none Details | Review
gconvert: Optimize UTF-8 conversions, fix output on error (2.48 KB, patch)
2018-01-18 06:44 UTC, Mikhail Zabaluev
committed Details | Review
gconvert: Tighten, document embedded NUL behavior of UTF-8 conversions (9.80 KB, patch)
2018-01-18 06:44 UTC, Mikhail Zabaluev
committed Details | Review
Test embedded NULs in input of g_{locale,filename}_to_utf8() (4.48 KB, patch)
2018-01-18 06:44 UTC, Mikhail Zabaluev
committed Details | Review
gconvert: Consistently validate inputs and outputs for embedded NULs (16.93 KB, patch)
2018-01-18 06:45 UTC, Mikhail Zabaluev
committed Details | Review
Test that g_convert() can handle embedded NUL bytes (1.78 KB, patch)
2018-01-18 06:45 UTC, Mikhail Zabaluev
none Details | Review
Test that g_convert() can handle embedded NUL bytes (1.81 KB, patch)
2018-01-18 21:00 UTC, Mikhail Zabaluev
none Details | Review
Fixed a style nitpick. (1.81 KB, patch)
2018-01-18 21:16 UTC, Mikhail Zabaluev
none Details | Review
Test that g_convert() can handle embedded NUL bytes (1.81 KB, patch)
2018-01-18 21:24 UTC, Mikhail Zabaluev
committed Details | Review

Description Mikhail Zabaluev 2018-01-14 17:46:57 UTC
The return values of functions g_locale_to_utf8() and g_filename_to_utf8()
are widely expected to be NUL-terminated UTF-8 strings satisfying their default annotation of (type utf8). However, the paths in these functions that invoke iconv() to convert from encodings other than UTF-8 can produce output with embedded NUL bytes.

The behavior of reverse conversion functions g_locale_from_utf8() and g_filename_from_utf8() on NUL bytes embedded in the specified-length input buffer is not documented, although these functions may, inconsistently, invalidate such inputs, while neither iconv() nor the Unicode standard forbid embedded NUL bytes.
Comment 1 Mikhail Zabaluev 2018-01-14 17:48:31 UTC
Created attachment 366793 [details] [review]
gconvert: Optimize UTF-8 conversions, fix output on error

In the strdup_len() path, no need to repeat the work of
g_utf8_validate() and search for the string-terminating nul byte.

Also in strdup_len(), make the out parameter bytes_read receive the length
of the valid (meaning also nul-free) part of the input string, as the
documentation on g_{locale,filename}_{from,to}_utf8() says it does.
Comment 2 Mikhail Zabaluev 2018-01-14 17:48:43 UTC
Created attachment 366794 [details] [review]
gconvert: Tighten, document embedded NUL behavior of UTF-8 conversions

The character encoding conversion utility functions g_locale_to_utf8()
and g_filename_to_utf8() had inconsistent behavior on producing strings
with inner NUL bytes: in the all-UTF-8 strdup path, the input string
validation prohibits embedded NULs, while g_convert(), using iconv(),
can produce UTF-8 output with NUL bytes inside the output buffer.
This, while valid UTF-8 per the Unicode standard, is not valid for
the nul-terminated (type utf8) return value format that the *_to_utf8()
functions are annotated with (as per discussion in bug 756128).

Check the output of g_convert() for embedded NUL bytes, and if any
are found, set the newly introduced error
G_CONVERT_ERROR_EMBDEDDED_NUL.

Also document the error set by g_{locale,filename}_{from,to}_utf8()
when the input string contains nul bytes.
Comment 3 Mikhail Zabaluev 2018-01-14 19:47:42 UTC
Created attachment 366798 [details] [review]
Test embedded NULs in input of g_{locale,filename}_to_utf8()

The tests exercise both g_strncpy() and g_convert() paths.
Comment 4 Philip Withnall 2018-01-17 10:48:33 UTC
Review of attachment 366793 [details] [review]:

::: glib/gconvert.c
@@ +826,2 @@
 static gchar *
 strdup_len (const gchar *string,

It would be good to add a brief documentation comment, along the lines of:
```
/* Validate @string as UTF-8. @len can be negative if @string is nul-terminated, or a non-negative value in bytes. If @string ends in an incomplete sequence, or contains any illegal sequences or nul codepoints, %FALSE will be returned. On success or error, if provided, @bytes_read will be set to the number of bytes in @string up to @len or the terminating nul byte. */
```

@@ +829,2 @@
 	    gsize       *bytes_read,
+	    gsize       *bytes_written,

Shouldn’t all the callers of strdup_len() be changed to pass these two parameters in the new order?

Actually, you could just drop one of these parameters, since they’re both always set to the same thing.
Comment 5 Philip Withnall 2018-01-17 11:01:17 UTC
Review of attachment 366794 [details] [review]:

Nice commit message! (Although there’s a typo in `G_CONVERT_ERROR_EMBDEDDED_NUL` in it.)

This looks good to me. I think all my comments are minor.

::: glib/gconvert.c
@@ +856,3 @@
 }
 
+static gchar*

Nitpick: Missing space before `*`.

@@ +862,3 @@
+                 gsize       *bytes_read,
+                 gsize       *bytes_written,
+                 GError      **error)

Nitpick: The parameters should be aligned by the first letter of their name. So `error` needs to stay where it is, and all the others need to be indented one space further. i.e.

```
convert_to_utf8 (const gchar  *opsysstring,
                 gssize        len,
                 const gchar  *charset,
                 gsize        *bytes_read,
                 gsize        *bytes_written,
                 GError      **error)
```

@@ +869,3 @@
+  utf8 = g_convert (opsysstring, len, "UTF-8", charset,
+                    bytes_read, &outbytes, error);
+  if (!utf8)

Nitpick: It’s clearer to write `(utf8 == NULL)` so that it doesn’t look like utf8 is a boolean variable.

@@ +916,3 @@
+ *
+ * If the source encoding is not UTF-8 and the conversion output contains a nul
+ * byte, the error #G_CONVERT_ERROR_EMBEDDED_NUL is set and %NULL is returned.

s/nul byte/nul character/ to make it a bit clearer we’re talking about UTF-8 here? (I realise that U+0000 is represented as a single zero byte.)

@@ +918,3 @@
+ * byte, the error #G_CONVERT_ERROR_EMBEDDED_NUL is set and %NULL is returned.
+ * If the source encoding is UTF-8, an embedded nul byte is treated with the
+ * #G_CONVERT_ERROR_ILLEGAL_SEQUENCE error for backward compatibility with

s/#/%/ (see https://developer.gnome.org/gtk-doc-manual/unstable/documenting_syntax.html.en)

@@ +964,3 @@
  * the system codepage.
+ *
+ * The input string should not contain nul bytes even if the @len

s/nul bytes/nul characters/, as above.

@@ +966,3 @@
+ * The input string should not contain nul bytes even if the @len
+ * argument is positive. A nul byte found inside the string may result
+ * in error #G_CONVERT_ERROR_ILLEGAL_SEQUENCE. Use g_convert() to convert

s/#/%/, as above.

@@ +1178,3 @@
+ * #G_CONVERT_ERROR_ILLEGAL_SEQUENCE error for backward compatibility with
+ * earlier versions of this library. Use g_convert() to produce output that
+ * may contain embedded nul bytes.

Same changes needed as above.

@@ +1226,3 @@
+ * argument is positive. A nul byte found inside the string may result
+ * in error #G_CONVERT_ERROR_ILLEGAL_SEQUENCE. Note that nul bytes are
+ * forbidden in all filename encodings that GLib is known to work with.

Same changes needed as above.
Comment 6 Philip Withnall 2018-01-17 11:09:11 UTC
Review of attachment 366798 [details] [review]:

::: glib/tests/convert.c
@@ +695,3 @@
+
+static void
+test_locale_embedded_nul_utf8 (void)

I’m trying to get GLib into the habit of having a brief comment above each test case which explains what the test case is testing. For example:
```
/* Test embedded nul characters in UTF-8 input to g_locale_to_utf8() result in an error. */
```

@@ +703,3 @@
+  setlocale (LC_ALL, "");
+  g_setenv ("CHARSET", "UTF-8", TRUE);
+  g_assert (g_get_charset (NULL));

s/g_assert/g_assert_true/ to get a slightly more helpful error message on failure (and to avoid the assertion being compiled out when compiled with G_DISABLE_ASSERT).

@@ +707,3 @@
+  res = g_locale_to_utf8 ("ab\0c", 4, &bytes_read, NULL, &error);
+
+  g_assert (res == NULL);

Similarly, s/g_assert/g_assert_null/

@@ +708,3 @@
+
+  g_assert (res == NULL);
+  g_assert_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE);

`error` is leaked after this.

@@ +725,3 @@
+
+  g_assert (res == NULL);
+  g_assert_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_EMBEDDED_NUL);

Same comments as above.

@@ +751,3 @@
+  g_assert (res == NULL);
+  g_assert_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE);
+  g_assert_cmpuint (bytes_read, ==, 2);

Same comments as above.

@@ +766,3 @@
+
+  g_assert (res == NULL);
+  g_assert_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_EMBEDDED_NUL);

Same comments as above.

@@ +803,3 @@
+  g_test_add_func ("/conversion/filename-embedded-nul", test_filename_embedded_nul);
+  g_test_add_func ("/conversion/filename-embedded-nul/subprocess/utf8", test_filename_embedded_nul_utf8);
+  g_test_add_func ("/conversion/filename-embedded-nul/subprocess/iconv", test_filename_embedded_nul_iconv);

It would probably also make sense to add a test for using g_convert() with embedded nuls, and that succeeding.
Comment 7 Mikhail Zabaluev 2018-01-17 20:32:02 UTC
(In reply to Philip Withnall from comment #4)
> Review of attachment 366793 [details] [review] [review]:
> 
> ::: glib/gconvert.c
> @@ +826,2 @@
>  static gchar *
>  strdup_len (const gchar *string,
> 
> It would be good to add a brief documentation comment

Sure, will do.

> @@ +829,2 @@
>  	    gsize       *bytes_read,
> +	    gsize       *bytes_written,
> 
> Shouldn’t all the callers of strdup_len() be changed to pass these two
> parameters in the new order?

It is the order that the callers pass the parameters in :)
It just didn't matter before, because strdup_len() always set the same value to both pointers, and with this patch, it no longer does.
Comment 8 Mikhail Zabaluev 2018-01-18 06:44:13 UTC
Created attachment 366986 [details] [review]
gconvert: Optimize UTF-8 conversions, fix output on error

In the strdup_len() path, no need to do what g_utf8_validate()
already does: locate the string-terminating nul byte.

Also in strdup_len(), make the out parameter bytes_read receive the length
of the valid (meaning also nul-free) part of the input string, as the
documentation on g_{locale,filename}_{from,to}_utf8() says it does.
Comment 9 Mikhail Zabaluev 2018-01-18 06:44:33 UTC
Created attachment 366987 [details] [review]
gconvert: Tighten, document embedded NUL behavior of UTF-8 conversions

The character encoding conversion utility functions g_locale_to_utf8()
and g_filename_to_utf8() had inconsistent behavior on producing strings
with inner NUL bytes: in the all-UTF-8 strdup path, the input string
validation prohibits embedded NULs, while g_convert(), using iconv(),
can produce UTF-8 output with NUL bytes inside the output buffer.
This, while valid UTF-8 per the Unicode standard, is not valid for
the nul-terminated (type utf8) return value format that the *_to_utf8()
functions are annotated with (as per discussion in bug 756128).

Check the output of g_convert() for embedded NUL bytes, and if any
are found, set the newly introduced error
G_CONVERT_ERROR_EMBEDDED_NUL.

Also document the error set by g_{locale,filename}_{from,to}_utf8()
when the input string contains nul bytes.
Comment 10 Mikhail Zabaluev 2018-01-18 06:44:45 UTC
Created attachment 366988 [details] [review]
Test embedded NULs in input of g_{locale,filename}_to_utf8()

The tests exercise both g_strncpy() and g_convert() paths.
Comment 11 Mikhail Zabaluev 2018-01-18 06:45:40 UTC
Created attachment 366989 [details] [review]
gconvert: Consistently validate inputs and outputs for embedded NULs

String inputs to convenience conversion functions g_locale_from_utf8(),
g_filename_from_utf8(), and g_filename_to_utf8(), are annotated for the
bindings as NUL-terminated strings of (type utf8) or (type filename).
There is also a len parameter that allows converting part of the string,
but it is exposed to the bindings as a value independent from the string
buffer. Absent any more sophisticated ways to annotate, the way to
provide a safeguard against len argument values longer than the
string length is to check that no nul is encountered within the first
len bytes of the string. strdup_len() includes this check as part of
UTF-8 validation, but g_convert() permits embedded nuls.

For g_filename_from_utf8(), also check the output to prevent embedded NUL
bytes. It's not safe to allow embedded NULs in a string that is going
to be used as (type filename), and no known bytestring encoding for
file names allows them.
Comment 12 Mikhail Zabaluev 2018-01-18 06:45:51 UTC
Created attachment 366990 [details] [review]
Test that g_convert() can handle embedded NUL bytes
Comment 13 Philip Withnall 2018-01-18 11:45:33 UTC
Review of attachment 366986 [details] [review]:

Great, thanks.
Comment 14 Philip Withnall 2018-01-18 11:48:33 UTC
Review of attachment 366987 [details] [review]:

++
Comment 15 Philip Withnall 2018-01-18 11:49:52 UTC
Review of attachment 366988 [details] [review]:

++
Comment 16 Philip Withnall 2018-01-18 11:58:42 UTC
Review of attachment 366989 [details] [review]:

++
Comment 17 Philip Withnall 2018-01-18 12:00:35 UTC
Review of attachment 366990 [details] [review]:

One more check needed.

::: glib/tests/convert.c
@@ +700,3 @@
+  g_assert_cmpuint (bytes_read, ==, 4);
+  g_assert_cmpmem (res, bytes_written, "ab\0\xc3\xb6", 5);
+  g_free (res);

Add in a `g_assert_no_error (error);` call somewhere.
Comment 18 Mikhail Zabaluev 2018-01-18 21:00:37 UTC
Created attachment 367043 [details] [review]
Test that g_convert() can handle embedded NUL bytes
Comment 19 Mikhail Zabaluev 2018-01-18 21:16:01 UTC
Created attachment 367044 [details] [review]
Fixed a style nitpick.
Comment 20 Mikhail Zabaluev 2018-01-18 21:24:20 UTC
Created attachment 367046 [details] [review]
Test that g_convert() can handle embedded NUL bytes

Actually committing the style fix, restoring the patch title.
Comment 21 Philip Withnall 2018-01-19 10:10:09 UTC
Review of attachment 367046 [details] [review]:

Looks good, thanks.
Comment 22 Philip Withnall 2018-01-19 12:12:33 UTC
All pushed to master, thanks.

Attachment 366986 [details] pushed as 413605a - gconvert: Optimize UTF-8 conversions, fix output on error
Attachment 366987 [details] pushed as 81cd815 - gconvert: Tighten, document embedded NUL behavior of UTF-8 conversions
Attachment 366988 [details] pushed as d584ff7 - Test embedded NULs in input of g_{locale,filename}_to_utf8()
Attachment 366989 [details] pushed as f35a6a7 - gconvert: Consistently validate inputs and outputs for embedded NULs
Attachment 367046 [details] pushed as 52f9891 - Test that g_convert() can handle embedded NUL bytes
Comment 23 Philip Withnall 2018-02-01 14:11:38 UTC
*** Bug 738042 has been marked as a duplicate of this bug. ***