Bug 792516 – gconvert: More consistent handling of embedded NUL bytes

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 792516 - gconvert: More consistent handling of embedded NUL bytes


Summary:	gconvert: More consistent handling of embedded NUL bytes


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	i18n
Version:	2.55.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Duplicates:	738042 (view as bug list)
Depends on:
Blocks:	756128

Reported:	2018-01-14 17:46 UTC by Mikhail Zabaluev
Modified:	2018-02-01 14:11 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
gconvert: Optimize UTF-8 conversions, fix output on error (1.82 KB, patch) 2018-01-14 17:48 UTC, Mikhail Zabaluev	none	Details \| Review
gconvert: Tighten, document embedded NUL behavior of UTF-8 conversions (7.53 KB, patch) 2018-01-14 17:48 UTC, Mikhail Zabaluev	none	Details \| Review
Test embedded NULs in input of g_{locale,filename}_to_utf8() (3.90 KB, patch) 2018-01-14 19:47 UTC, Mikhail Zabaluev	none	Details \| Review
gconvert: Optimize UTF-8 conversions, fix output on error (2.48 KB, patch) 2018-01-18 06:44 UTC, Mikhail Zabaluev	committed	Details \| Review
gconvert: Tighten, document embedded NUL behavior of UTF-8 conversions (9.80 KB, patch) 2018-01-18 06:44 UTC, Mikhail Zabaluev	committed	Details \| Review
Test embedded NULs in input of g_{locale,filename}_to_utf8() (4.48 KB, patch) 2018-01-18 06:44 UTC, Mikhail Zabaluev	committed	Details \| Review
gconvert: Consistently validate inputs and outputs for embedded NULs (16.93 KB, patch) 2018-01-18 06:45 UTC, Mikhail Zabaluev	committed	Details \| Review
Test that g_convert() can handle embedded NUL bytes (1.78 KB, patch) 2018-01-18 06:45 UTC, Mikhail Zabaluev	none	Details \| Review
Test that g_convert() can handle embedded NUL bytes (1.81 KB, patch) 2018-01-18 21:00 UTC, Mikhail Zabaluev	none	Details \| Review
Fixed a style nitpick. (1.81 KB, patch) 2018-01-18 21:16 UTC, Mikhail Zabaluev	none	Details \| Review
Test that g_convert() can handle embedded NUL bytes (1.81 KB, patch) 2018-01-18 21:24 UTC, Mikhail Zabaluev	committed	Details \| Review

Description Mikhail Zabaluev 2018-01-14 17:46:57 UTC

The return values of functions g_locale_to_utf8() and g_filename_to_utf8()
are widely expected to be NUL-terminated UTF-8 strings satisfying their default annotation of (type utf8). However, the paths in these functions that invoke iconv() to convert from encodings other than UTF-8 can produce output with embedded NUL bytes.

The behavior of reverse conversion functions g_locale_from_utf8() and g_filename_from_utf8() on NUL bytes embedded in the specified-length input buffer is not documented, although these functions may, inconsistently, invalidate such inputs, while neither iconv() nor the Unicode standard forbid embedded NUL bytes.

Comment 1 Mikhail Zabaluev 2018-01-14 17:48:31 UTC

Created attachment 366793 [details] [review]
gconvert: Optimize UTF-8 conversions, fix output on error

In the strdup_len() path, no need to repeat the work of
g_utf8_validate() and search for the string-terminating nul byte.

Also in strdup_len(), make the out parameter bytes_read receive the length
of the valid (meaning also nul-free) part of the input string, as the
documentation on g_{locale,filename}_{from,to}_utf8() says it does.

Comment 2 Mikhail Zabaluev 2018-01-14 17:48:43 UTC

Created attachment 366794 [details] [review]
gconvert: Tighten, document embedded NUL behavior of UTF-8 conversions

The character encoding conversion utility functions g_locale_to_utf8()
and g_filename_to_utf8() had inconsistent behavior on producing strings
with inner NUL bytes: in the all-UTF-8 strdup path, the input string
validation prohibits embedded NULs, while g_convert(), using iconv(),
can produce UTF-8 output with NUL bytes inside the output buffer.
This, while valid UTF-8 per the Unicode standard, is not valid for
the nul-terminated (type utf8) return value format that the *_to_utf8()
functions are annotated with (as per discussion in bug 756128).

Check the output of g_convert() for embedded NUL bytes, and if any
are found, set the newly introduced error
G_CONVERT_ERROR_EMBDEDDED_NUL.

Also document the error set by g_{locale,filename}_{from,to}_utf8()
when the input string contains nul bytes.

Comment 3 Mikhail Zabaluev 2018-01-14 19:47:42 UTC

Created attachment 366798 [details] [review]
Test embedded NULs in input of g_{locale,filename}_to_utf8()

The tests exercise both g_strncpy() and g_convert() paths.

Comment 4 Philip Withnall 2018-01-17 10:48:33 UTC

Review of attachment 366793 [details] [review]:

::: glib/gconvert.c
@@ +826,2 @@
 static gchar *
 strdup_len (const gchar *string,

It would be good to add a brief documentation comment, along the lines of:
```
/* Validate @string as UTF-8. @len can be negative if @string is nul-terminated, or a non-negative value in bytes. If @string ends in an incomplete sequence, or contains any illegal sequences or nul codepoints, %FALSE will be returned. On success or error, if provided, @bytes_read will be set to the number of bytes in @string up to @len or the terminating nul byte. */
```

@@ +829,2 @@
 	    gsize       *bytes_read,
+	    gsize       *bytes_written,

Shouldn’t all the callers of strdup_len() be changed to pass these two parameters in the new order?

Actually, you could just drop one of these parameters, since they’re both always set to the same thing.

Comment 5 Philip Withnall 2018-01-17 11:01:17 UTC

Review of attachment 366794 [details] [review]:

Nice commit message! (Although there’s a typo in `G_CONVERT_ERROR_EMBDEDDED_NUL` in it.)

This looks good to me. I think all my comments are minor.

::: glib/gconvert.c
@@ +856,3 @@
 }
 
+static gchar*

Nitpick: Missing space before `*`.

@@ +862,3 @@
+                 gsize       *bytes_read,
+                 gsize       *bytes_written,
+                 GError      **error)

Nitpick: The parameters should be aligned by the first letter of their name. So `error` needs to stay where it is, and all the others need to be indented one space further. i.e.

```
convert_to_utf8 (const gchar  *opsysstring,
                 gssize        len,
                 const gchar  *charset,
                 gsize        *bytes_read,
                 gsize        *bytes_written,
                 GError      **error)
```

@@ +869,3 @@
+  utf8 = g_convert (opsysstring, len, "UTF-8", charset,
+                    bytes_read, &outbytes, error);
+  if (!utf8)

Nitpick: It’s clearer to write `(utf8 == NULL)` so that it doesn’t look like utf8 is a boolean variable.

@@ +916,3 @@
+ *
+ * If the source encoding is not UTF-8 and the conversion output contains a nul
+ * byte, the error #G_CONVERT_ERROR_EMBEDDED_NUL is set and %NULL is returned.

s/nul byte/nul character/ to make it a bit clearer we’re talking about UTF-8 here? (I realise that U+0000 is represented as a single zero byte.)

@@ +918,3 @@
+ * byte, the error #G_CONVERT_ERROR_EMBEDDED_NUL is set and %NULL is returned.
+ * If the source encoding is UTF-8, an embedded nul byte is treated with the
+ * #G_CONVERT_ERROR_ILLEGAL_SEQUENCE error for backward compatibility with

s/#/%/ (see https://developer.gnome.org/gtk-doc-manual/unstable/documenting_syntax.html.en)

@@ +964,3 @@
  * the system codepage.
+ *
+ * The input string should not contain nul bytes even if the @len

s/nul bytes/nul characters/, as above.

@@ +966,3 @@
+ * The input string should not contain nul bytes even if the @len
+ * argument is positive. A nul byte found inside the string may result
+ * in error #G_CONVERT_ERROR_ILLEGAL_SEQUENCE. Use g_convert() to convert

s/#/%/, as above.

@@ +1178,3 @@
+ * #G_CONVERT_ERROR_ILLEGAL_SEQUENCE error for backward compatibility with
+ * earlier versions of this library. Use g_convert() to produce output that
+ * may contain embedded nul bytes.

Same changes needed as above.

@@ +1226,3 @@
+ * argument is positive. A nul byte found inside the string may result
+ * in error #G_CONVERT_ERROR_ILLEGAL_SEQUENCE. Note that nul bytes are
+ * forbidden in all filename encodings that GLib is known to work with.

Same changes needed as above.

Comment 6 Philip Withnall 2018-01-17 11:09:11 UTC

Review of attachment 366798 [details] [review]:

::: glib/tests/convert.c
@@ +695,3 @@
+
+static void
+test_locale_embedded_nul_utf8 (void)

I’m trying to get GLib into the habit of having a brief comment above each test case which explains what the test case is testing. For example:
```
/* Test embedded nul characters in UTF-8 input to g_locale_to_utf8() result in an error. */
```

@@ +703,3 @@
+  setlocale (LC_ALL, "");
+  g_setenv ("CHARSET", "UTF-8", TRUE);
+  g_assert (g_get_charset (NULL));

s/g_assert/g_assert_true/ to get a slightly more helpful error message on failure (and to avoid the assertion being compiled out when compiled with G_DISABLE_ASSERT).

@@ +707,3 @@
+  res = g_locale_to_utf8 ("ab\0c", 4, &bytes_read, NULL, &error);
+
+  g_assert (res == NULL);

Similarly, s/g_assert/g_assert_null/

@@ +708,3 @@
+
+  g_assert (res == NULL);
+  g_assert_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE);

`error` is leaked after this.

@@ +725,3 @@
+
+  g_assert (res == NULL);
+  g_assert_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_EMBEDDED_NUL);

Same comments as above.

@@ +751,3 @@
+  g_assert (res == NULL);
+  g_assert_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE);
+  g_assert_cmpuint (bytes_read, ==, 2);

Same comments as above.

@@ +766,3 @@
+
+  g_assert (res == NULL);
+  g_assert_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_EMBEDDED_NUL);

Same comments as above.

@@ +803,3 @@
+  g_test_add_func ("/conversion/filename-embedded-nul", test_filename_embedded_nul);
+  g_test_add_func ("/conversion/filename-embedded-nul/subprocess/utf8", test_filename_embedded_nul_utf8);
+  g_test_add_func ("/conversion/filename-embedded-nul/subprocess/iconv", test_filename_embedded_nul_iconv);

It would probably also make sense to add a test for using g_convert() with embedded nuls, and that succeeding.

Comment 7 Mikhail Zabaluev 2018-01-17 20:32:02 UTC

(In reply to Philip Withnall from comment #4)
> Review of attachment 366793 [details] [review] [review]:
> 
> ::: glib/gconvert.c
> @@ +826,2 @@
>  static gchar *
>  strdup_len (const gchar *string,
> 
> It would be good to add a brief documentation comment

Sure, will do.

> @@ +829,2 @@
>  	    gsize       *bytes_read,
> +	    gsize       *bytes_written,
> 
> Shouldn’t all the callers of strdup_len() be changed to pass these two
> parameters in the new order?

It is the order that the callers pass the parameters in :)
It just didn't matter before, because strdup_len() always set the same value to both pointers, and with this patch, it no longer does.

Comment 8 Mikhail Zabaluev 2018-01-18 06:44:13 UTC

Created attachment 366986 [details] [review]
gconvert: Optimize UTF-8 conversions, fix output on error

In the strdup_len() path, no need to do what g_utf8_validate()
already does: locate the string-terminating nul byte.

Also in strdup_len(), make the out parameter bytes_read receive the length
of the valid (meaning also nul-free) part of the input string, as the
documentation on g_{locale,filename}_{from,to}_utf8() says it does.

Comment 9 Mikhail Zabaluev 2018-01-18 06:44:33 UTC

Created attachment 366987 [details] [review]
gconvert: Tighten, document embedded NUL behavior of UTF-8 conversions

The character encoding conversion utility functions g_locale_to_utf8()
and g_filename_to_utf8() had inconsistent behavior on producing strings
with inner NUL bytes: in the all-UTF-8 strdup path, the input string
validation prohibits embedded NULs, while g_convert(), using iconv(),
can produce UTF-8 output with NUL bytes inside the output buffer.
This, while valid UTF-8 per the Unicode standard, is not valid for
the nul-terminated (type utf8) return value format that the *_to_utf8()
functions are annotated with (as per discussion in bug 756128).

Check the output of g_convert() for embedded NUL bytes, and if any
are found, set the newly introduced error
G_CONVERT_ERROR_EMBEDDED_NUL.

Also document the error set by g_{locale,filename}_{from,to}_utf8()
when the input string contains nul bytes.

Comment 10 Mikhail Zabaluev 2018-01-18 06:44:45 UTC

Created attachment 366988 [details] [review]
Test embedded NULs in input of g_{locale,filename}_to_utf8()

The tests exercise both g_strncpy() and g_convert() paths.

Comment 11 Mikhail Zabaluev 2018-01-18 06:45:40 UTC

Created attachment 366989 [details] [review]
gconvert: Consistently validate inputs and outputs for embedded NULs

String inputs to convenience conversion functions g_locale_from_utf8(),
g_filename_from_utf8(), and g_filename_to_utf8(), are annotated for the
bindings as NUL-terminated strings of (type utf8) or (type filename).
There is also a len parameter that allows converting part of the string,
but it is exposed to the bindings as a value independent from the string
buffer. Absent any more sophisticated ways to annotate, the way to
provide a safeguard against len argument values longer than the
string length is to check that no nul is encountered within the first
len bytes of the string. strdup_len() includes this check as part of
UTF-8 validation, but g_convert() permits embedded nuls.

For g_filename_from_utf8(), also check the output to prevent embedded NUL
bytes. It's not safe to allow embedded NULs in a string that is going
to be used as (type filename), and no known bytestring encoding for
file names allows them.

Comment 12 Mikhail Zabaluev 2018-01-18 06:45:51 UTC

Created attachment 366990 [details] [review]
Test that g_convert() can handle embedded NUL bytes

Comment 13 Philip Withnall 2018-01-18 11:45:33 UTC

Review of attachment 366986 [details] [review]:

Great, thanks.

Comment 14 Philip Withnall 2018-01-18 11:48:33 UTC

Review of attachment 366987 [details] [review]:

++

Comment 15 Philip Withnall 2018-01-18 11:49:52 UTC

Review of attachment 366988 [details] [review]:

++

Comment 16 Philip Withnall 2018-01-18 11:58:42 UTC

Review of attachment 366989 [details] [review]:

++

Comment 17 Philip Withnall 2018-01-18 12:00:35 UTC

Review of attachment 366990 [details] [review]:

One more check needed.

::: glib/tests/convert.c
@@ +700,3 @@
+  g_assert_cmpuint (bytes_read, ==, 4);
+  g_assert_cmpmem (res, bytes_written, "ab\0\xc3\xb6", 5);
+  g_free (res);

Add in a `g_assert_no_error (error);` call somewhere.

Comment 18 Mikhail Zabaluev 2018-01-18 21:00:37 UTC

Created attachment 367043 [details] [review]
Test that g_convert() can handle embedded NUL bytes

Comment 19 Mikhail Zabaluev 2018-01-18 21:16:01 UTC

Created attachment 367044 [details] [review]
Fixed a style nitpick.

Comment 20 Mikhail Zabaluev 2018-01-18 21:24:20 UTC

Created attachment 367046 [details] [review]
Test that g_convert() can handle embedded NUL bytes

Actually committing the style fix, restoring the patch title.

Comment 21 Philip Withnall 2018-01-19 10:10:09 UTC

Review of attachment 367046 [details] [review]:

Looks good, thanks.

Comment 22 Philip Withnall 2018-01-19 12:12:33 UTC

All pushed to master, thanks.

Attachment 366986 [details] pushed as 413605a - gconvert: Optimize UTF-8 conversions, fix output on error
Attachment 366987 [details] pushed as 81cd815 - gconvert: Tighten, document embedded NUL behavior of UTF-8 conversions
Attachment 366988 [details] pushed as d584ff7 - Test embedded NULs in input of g_{locale,filename}_to_utf8()
Attachment 366989 [details] pushed as f35a6a7 - gconvert: Consistently validate inputs and outputs for embedded NULs
Attachment 367046 [details] pushed as 52f9891 - Test that g_convert() can handle embedded NUL bytes

Comment 23 Philip Withnall 2018-02-01 14:11:38 UTC

*** Bug 738042 has been marked as a duplicate of this bug. ***