Bug 756128 – Fix up annotations in gconvert

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 756128 - Fix up annotations in gconvert


Summary:	Fix up annotations in gconvert


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	introspection
Version:	2.46.x
Hardware:	Other All

Importance:	Normal minor
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:	792516
Blocks:

Reported:	2015-10-06 15:35 UTC by Mikhail Zabaluev
Modified:	2018-02-09 13:06 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
gconvert: Skip the GIConv API from introspection (2.02 KB, patch) 2015-10-06 15:35 UTC, Mikhail Zabaluev	committed	Details \| Review
gconvert: Correctly annotate string types and output parameters (19.34 KB, patch) 2015-10-06 15:36 UTC, Mikhail Zabaluev	none	Details \| Review
gconvert: Correctly annotate string types and output parameters (16.59 KB, patch) 2018-01-05 22:48 UTC, Mikhail Zabaluev	none	Details \| Review
gconvert: Correctly annotate string types and output parameters (11.17 KB, patch) 2018-02-08 21:42 UTC, Mikhail Zabaluev	committed	Details \| Review
gconvert: g_filename_from_utf8() returns (type filename) (1.83 KB, patch) 2018-02-08 21:42 UTC, Mikhail Zabaluev	committed	Details \| Review

Description Mikhail Zabaluev 2015-10-06 15:35:52 UTC

The API in gconvert is poorly annotated and needs love.

Comment 1 Mikhail Zabaluev 2015-10-06 15:35:57 UTC

Created attachment 312740 [details] [review]
gconvert: Skip the GIConv API from introspection

It's ugly:
- The core method, g_iconv(), can't be annotated with good semantics.
- The error value of g_iconv_open() is not representable in today's
  introspection.

Comment 2 Mikhail Zabaluev 2015-10-06 15:36:03 UTC

Created attachment 312741 [details] [review]
gconvert: Correctly annotate string types and output parameters

Note that the iconv wrapper API works with byte arrays. It's wrong to
default to utf8 there, because iconv can read and produce strings with
interior nul characters which are not allowed in (type utf8).
The documentation in places was misleading about that, so that got
corrected as well.

Comment 3 David King 2016-01-11 09:33:06 UTC

Review of attachment 312741 [details] [review]:

::: glib/gconvert.c
@@ +356,3 @@
  *                 stored will the byte offset after the last valid
  *                 input sequence.
+ * @bytes_written: (out): the number of bytes stored in the

Isn't bytes_written also (out) (optional) in all cases? It seems that there are appropriate checks to handle it being NULL in all the functions here.

@@ +364,3 @@
  * 
  * Note that you should use g_iconv() for streaming conversions. 
+ * Despite the fact that @bytes_read can return information about partial

Mmm, typos. :-)

Comment 4 Philip Withnall 2017-09-11 19:46:28 UTC

Review of attachment 312740 [details] [review]:

Ouch, yes.

Comment 5 Philip Withnall 2017-09-11 19:47:36 UTC

Comment on attachment 312740 [details] [review]
gconvert: Skip the GIConv API from introspection

Attachment 312740 [details] pushed as 7759542 - gconvert: Skip the GIConv API from introspection

Comment 6 Philip Withnall 2017-09-11 19:56:54 UTC

Review of attachment 312741 [details] [review]:

::: glib/gconvert.c
@@ +356,3 @@
  *                 stored will the byte offset after the last valid
  *                 input sequence.
+ * @bytes_written: (out): the number of bytes stored in the

Indeed, it’s (optional).

@@ +511,3 @@
  *                 stored will the byte offset after the last valid
  *                 input sequence.
+ * @bytes_written: (out): the number of bytes stored in the output buffer (not

Also (optional).

@@ +593,3 @@
  *                less than @len if there were partial characters
  *                at the end of the input.
+ * @bytes_written: (out): the number of bytes stored in the output buffer (not

Also (optional).

@@ +885,3 @@
  *                 stored will the byte offset after the last valid
  *                 input sequence.
+ * @bytes_written: (out): the number of bytes stored in the

Also (optional).

@@ +929,3 @@
  *                 stored will the byte offset after the last valid
  *                 input sequence.
+ * @bytes_written: (out): the number of bytes stored in the output buffer (not

Also (optional).

@@ +980,3 @@
 /**
  * g_get_filename_charsets:
+ * @charsets: (out) (transfer container) (array zero-terminated=1):

Seems like the parameter is actually called filename_charsets in the function definition (but not the declaration). Would be good to make those consistent.

I also think it’s (transfer none) rather than (transfer container).

@@ +1122,3 @@
 /**
  * g_filename_to_utf8:
+ * @opsysstring: (array length=len) (element-type guint8): a string in the

Instead of (element-type guint8) this should be (type filename). (This change was already made in commit 41013a01f4.)

@@ +1136,3 @@
  *                 stored will the byte offset after the last valid
  *                 input sequence.
+ * @bytes_written: (out): the number of bytes stored in the output buffer (not

Also (optional).

@@ +1146,3 @@
  * the [current locale][setlocale].
  * 
+ * Returns: (array length=bytes_written) (element-type guint8) (transfer full):

This shouldn’t be (element-type guint8) since it’s a UTF-8 string — if adding (element-type) means this is no longer treated as (type utf8), I think this change is wrong (even if it does make the (array length) explicit).

Similarly for some of the other (element-type guint8) annotations on UTF-8 strings in this patch.

@@ +1210,3 @@
  *                 stored will the byte offset after the last valid
  *                 input sequence.
+ * @bytes_written: (out): the number of bytes stored in the output buffer (not

Also (optional).

Comment 7 Mikhail Zabaluev 2018-01-05 21:35:44 UTC

(In reply to Philip Withnall from comment #6)
> Review of attachment 312741 [details] [review] [review]:
> 
> @@ +1122,3 @@
>  /**
>   * g_filename_to_utf8:
> + * @opsysstring: (array length=len) (element-type guint8): a string in the
> 
> Instead of (element-type guint8) this should be (type filename). (This
> change was already made in commit 41013a01f4.)

Is there a way to annotate that the len parameter is always passed -1 instead of being exposed as a parameter by the bindings? This is needed if the opsysstring parameter is annotated as a NUL-terminated (type filename).
(skip) (default -1) would seem like it, if it can be made to work in all bindings; I can't find this combination in use anywhere in glib source tree, so I assume it's not supported currently.

> @@ +1146,3 @@
>   * the [current locale][setlocale].
>   * 
> + * Returns: (array length=bytes_written) (element-type guint8) (transfer
> full):
> 
> This shouldn’t be (element-type guint8) since it’s a UTF-8 string — if
> adding (element-type) means this is no longer treated as (type utf8), I
> think this change is wrong (even if it does make the (array length)
> explicit).
> 
> Similarly for some of the other (element-type guint8) annotations on UTF-8
> strings in this patch.

As I tried to explain in the commit's comment, iconv can produce strings with interior NUL bytes. This is not conformant to (type utf8) and may result in data loss or corruption in the code receiving the string, even though it's perfectly valid UTF-8 as per the Unicode standard.

The input parameters can indeed be annotated (type utf8), but this will narrow the set of possible inputs: the functions can also convert strings containing interior NULs. I believe iconv validates the input in all supported implementations (or at least it should by specification), so restricting the input with static typing or binding-imposed run time validation seems excessive.

Comment 8 Mikhail Zabaluev 2018-01-05 21:45:18 UTC

(In reply to Mikhail Zabaluev from comment #7)
> The input parameters can indeed be annotated (type utf8)

Correction: they could, if not for the issue with the len parameter as mentioned above.

Comment 9 Mikhail Zabaluev 2018-01-05 22:48:55 UTC

Created attachment 366399 [details] [review]
gconvert: Correctly annotate string types and output parameters

Note that the iconv wrapper API works with byte arrays. It's wrong to
default to utf8 there, because iconv can read and produce strings with
interior nul characters which are not allowed in (type utf8).
The documentation was misleading about that in some places, so that got
corrected as well.

Comment 10 Philip Withnall 2018-01-11 12:27:27 UTC

Review of attachment 366399 [details] [review]:

::: glib/gconvert.c
@@ +1125,3 @@
 /**
  * g_filename_to_utf8:
+ * @opsysstring:   (array length=len) (element-type guint8): a string in the

I think we use (type filename) in enough places which don’t have a len argument that your filename encoding basically cannot have embedded nuls if it’s going to be used with GLib.

However, since there is a len parameter here, we probably *should* bind it to opsysstring, so given your explanation, this change looks OK to me.

@@ +1149,3 @@
  * the [current locale][setlocale].
  * 
+ * Returns: (array length=bytes_written) (element-type guint8) (transfer full):

GLib assumes that UTF-8 encoded strings *don’t* contain embedded nuls, so I think this really should be (type utf8). To do otherwise would be misleading.

For example, g_utf8_validate() explicitly returns an error if called with (max_len > 0) on a string which contains an embedded nul.

I think in this case, the bytes_written argument is a convenience to avoid having to call strlen() on the return value. It’s not intended to support embedded nuls. (Unless you have evidence to the contrary?)

In the UTF-8 input path, for example (if get_filename_charset() returns true), I think g_filename_to_utf8() will return an error if called on an opsysstring with an embedded nul.

@@ +1173,3 @@
 /**
  * g_filename_from_utf8:
+ * @utf8string:    (array length=len) (element-type guint8):

Following the example of g_strndup(), I don’t think it’s necessary to mandatorily bind len to utf8string here. Just leave it as (type utf8), and the len argument can be used optionally to limit the input to less than strlen(utf8string).

Comment 11 Philip Withnall 2018-01-11 12:29:32 UTC

Aside from these annotations changes, I think it would help if g_filename_[to|from]_utf8() were clarified about their behaviour for embedded nuls. I think the right thing to do is disallow embedded nuls in the UTF-8, but from reading the code, I think this is only checked on one out of two code paths — and it’s not mentioned in the documentation at all.

Would you be able to put together another patch which tidies up handling of embedded nuls there, adds some tests, and mentions it in the documentation please?

Matthias, do you have any input here?

Comment 12 Mikhail Zabaluev 2018-01-12 07:36:19 UTC

(In reply to Philip Withnall from comment #10)
> Review of attachment 366399 [details] [review] [review]:
> 
> ::: glib/gconvert.c
> @@ +1125,3 @@
>  /**
>   * g_filename_to_utf8:
> + * @opsysstring:   (array length=len) (element-type guint8): a string in the
> 
> I think we use (type filename) in enough places which don’t have a len
> argument that your filename encoding basically cannot have embedded nuls if
> it’s going to be used with GLib.

Indeed. The convenience functions could be annotated with the constrained encoding types, at least on input. Should there be @len: (skip), then?

> @@ +1149,3 @@
>   * the [current locale][setlocale].
>   * 
> + * Returns: (array length=bytes_written) (element-type guint8) (transfer
> full):
> 
> GLib assumes that UTF-8 encoded strings *don’t* contain embedded nuls, so I
> think this really should be (type utf8). To do otherwise would be misleading.
> 
> For example, g_utf8_validate() explicitly returns an error if called with
> (max_len > 0) on a string which contains an embedded nul.
> 
> I think in this case, the bytes_written argument is a convenience to avoid
> having to call strlen() on the return value. It’s not intended to support
> embedded nuls. (Unless you have evidence to the contrary?)
> 
> In the UTF-8 input path, for example (if get_filename_charset() returns
> true), I think g_filename_to_utf8() will return an error if called on an
> opsysstring with an embedded nul.

This is true, but the g_convert() path might in theory return a string with embedded nuls. The fact that it never does is more the result of most (all?) known filesystems prohibiting nul characters in file names than any runtime check.

> @@ +1173,3 @@
>  /**
>   * g_filename_from_utf8:
> + * @utf8string:    (array length=len) (element-type guint8):
> 
> Following the example of g_strndup(), I don’t think it’s necessary to
> mandatorily bind len to utf8string here. Just leave it as (type utf8), and
> the len argument can be used optionally to limit the input to less than
> strlen(utf8string).

g_strndup() has safe-ish behavior when the len argument is larger than the length of the string. The g_convert() path would read past the end of the input buffer in this case, so we shouldn't expose len to the bindings. Should I skip the len argument too, then, or add a validating assertion on the input of g_filename_from_utf()?

Comment 13 Mikhail Zabaluev 2018-01-12 20:11:05 UTC

(In reply to Mikhail Zabaluev from comment #12)
> Indeed. The convenience functions could be annotated with the constrained
> encoding types, at least on input. Should there be @len: (skip), then?

Ignore this; I've been walking in circles.
Until there is a way to tell the bindings to pass -1, we can't skip the len argument.

Comment 14 Mikhail Zabaluev 2018-01-14 17:57:06 UTC

(In reply to Philip Withnall from comment #11)
> Aside from these annotations changes, I think it would help if
> g_filename_[to|from]_utf8() were clarified about their behaviour for
> embedded nuls. I think the right thing to do is disallow embedded nuls in
> the UTF-8, but from reading the code, I think this is only checked on one
> out of two code paths — and it’s not mentioned in the documentation at all.
> 
> Would you be able to put together another patch which tidies up handling of
> embedded nuls there, adds some tests, and mentions it in the documentation
> please?

I've filed bug 792516 with the proposed changes.

Comment 15 Matthias Clasen 2018-01-16 18:22:41 UTC

(In reply to Philip Withnall from comment #11)
> Aside from these annotations changes, I think it would help if
> g_filename_[to|from]_utf8() were clarified about their behaviour for
> embedded nuls. I think the right thing to do is disallow embedded nuls in
> the UTF-8, but from reading the code, I think this is only checked on one
> out of two code paths — and it’s not mentioned in the documentation at all.
> 
> Would you be able to put together another patch which tidies up handling of
> embedded nuls there, adds some tests, and mentions it in the documentation
> please?
> 
> Matthias, do you have any input here?

Not really, no. I'm fine with not allowing embedded nuls in input here.

And I think I'm also ok with a white lie for the output side and just claim that embedded nuls don't happen. We could add an extra paranoid mode that does a runtime check for this before returning

Comment 16 Philip Withnall 2018-01-17 11:10:54 UTC

(In reply to Mikhail Zabaluev from comment #14)
> (In reply to Philip Withnall from comment #11)
> > Aside from these annotations changes, I think it would help if
> > g_filename_[to|from]_utf8() were clarified about their behaviour for
> > embedded nuls. I think the right thing to do is disallow embedded nuls in
> > the UTF-8, but from reading the code, I think this is only checked on one
> > out of two code paths — and it’s not mentioned in the documentation at all.
> > 
> > Would you be able to put together another patch which tidies up handling of
> > embedded nuls there, adds some tests, and mentions it in the documentation
> > please?
> 
> I've filed bug 792516 with the proposed changes.

Thanks a lot, the patches there are good. I’ll come back to *this* bug once we’ve sorted those ones out.

Comment 17 Philip Withnall 2018-01-24 17:56:05 UTC

Mikhail, I think this patch now needs updating since the changes in bug #792516 to disallow embedded nuls in various places. If you still need to specify len as -1 (and the (default -1) annotation doesn’t work), then we could at least get the rest of the changes in and defer the len stuff to a gobject-introspection task for supporting an annotation for that (bug #558620).

Comment 18 Mikhail Zabaluev 2018-02-08 21:42:05 UTC

Created attachment 368167 [details] [review]
gconvert: Correctly annotate string types and output parameters

Note that the g_convert() API works with byte arrays. It's wrong to
default to utf8 there, because iconv can read and produce strings with
interior nul characters which are not allowed in (type utf8).
The documentation was misleading about that in some places, so that got
corrected as well.

Strings in the locale encoding are annotated as dynamic-length byte
arrays because they don't have any guaranteed format and can contain
nul bytes. For UTF-8 strings in g_*_{from,to}_utf8(), GLib assumes
no embedded nul bytes and the (type utf8) annotations on the UTF-8
parameters and return values remain as they were. Likewise for
(type filename).

Comment 19 Mikhail Zabaluev 2018-02-08 21:42:34 UTC

Created attachment 368168 [details] [review]
gconvert: g_filename_from_utf8() returns (type filename)

The existing array annotation is inconsistent with the other
conversion functions. Now that the implementation guarantees
no embedded NULs, the return value can be re-annotated.

Comment 20 Philip Withnall 2018-02-09 13:00:26 UTC

Review of attachment 368168 [details] [review]:

++

Comment 21 Philip Withnall 2018-02-09 13:05:09 UTC

Review of attachment 368167 [details] [review]:

++, thanks

Comment 22 Philip Withnall 2018-02-09 13:06:20 UTC

All pushed, thanks.

Attachment 368167 [details] pushed as 8a93e2d - gconvert: Correctly annotate string types and output parameters
Attachment 368168 [details] pushed as 7f1fd24 - gconvert: g_filename_from_utf8() returns (type filename)