GNOME Bugzilla – Bug 756128
Fix up annotations in gconvert
Last modified: 2018-02-09 13:06:30 UTC
The API in gconvert is poorly annotated and needs love.
Created attachment 312740 [details] [review] gconvert: Skip the GIConv API from introspection It's ugly: - The core method, g_iconv(), can't be annotated with good semantics. - The error value of g_iconv_open() is not representable in today's introspection.
Created attachment 312741 [details] [review] gconvert: Correctly annotate string types and output parameters Note that the iconv wrapper API works with byte arrays. It's wrong to default to utf8 there, because iconv can read and produce strings with interior nul characters which are not allowed in (type utf8). The documentation in places was misleading about that, so that got corrected as well.
Review of attachment 312741 [details] [review]: ::: glib/gconvert.c @@ +356,3 @@ * stored will the byte offset after the last valid * input sequence. + * @bytes_written: (out): the number of bytes stored in the Isn't bytes_written also (out) (optional) in all cases? It seems that there are appropriate checks to handle it being NULL in all the functions here. @@ +364,3 @@ * * Note that you should use g_iconv() for streaming conversions. + * Despite the fact that @bytes_read can return information about partial Mmm, typos. :-)
Review of attachment 312740 [details] [review]: Ouch, yes.
Comment on attachment 312740 [details] [review] gconvert: Skip the GIConv API from introspection Attachment 312740 [details] pushed as 7759542 - gconvert: Skip the GIConv API from introspection
Review of attachment 312741 [details] [review]: ::: glib/gconvert.c @@ +356,3 @@ * stored will the byte offset after the last valid * input sequence. + * @bytes_written: (out): the number of bytes stored in the Indeed, it’s (optional). @@ +511,3 @@ * stored will the byte offset after the last valid * input sequence. + * @bytes_written: (out): the number of bytes stored in the output buffer (not Also (optional). @@ +593,3 @@ * less than @len if there were partial characters * at the end of the input. + * @bytes_written: (out): the number of bytes stored in the output buffer (not Also (optional). @@ +885,3 @@ * stored will the byte offset after the last valid * input sequence. + * @bytes_written: (out): the number of bytes stored in the Also (optional). @@ +929,3 @@ * stored will the byte offset after the last valid * input sequence. + * @bytes_written: (out): the number of bytes stored in the output buffer (not Also (optional). @@ +980,3 @@ /** * g_get_filename_charsets: + * @charsets: (out) (transfer container) (array zero-terminated=1): Seems like the parameter is actually called filename_charsets in the function definition (but not the declaration). Would be good to make those consistent. I also think it’s (transfer none) rather than (transfer container). @@ +1122,3 @@ /** * g_filename_to_utf8: + * @opsysstring: (array length=len) (element-type guint8): a string in the Instead of (element-type guint8) this should be (type filename). (This change was already made in commit 41013a01f4.) @@ +1136,3 @@ * stored will the byte offset after the last valid * input sequence. + * @bytes_written: (out): the number of bytes stored in the output buffer (not Also (optional). @@ +1146,3 @@ * the [current locale][setlocale]. * + * Returns: (array length=bytes_written) (element-type guint8) (transfer full): This shouldn’t be (element-type guint8) since it’s a UTF-8 string — if adding (element-type) means this is no longer treated as (type utf8), I think this change is wrong (even if it does make the (array length) explicit). Similarly for some of the other (element-type guint8) annotations on UTF-8 strings in this patch. @@ +1210,3 @@ * stored will the byte offset after the last valid * input sequence. + * @bytes_written: (out): the number of bytes stored in the output buffer (not Also (optional).
(In reply to Philip Withnall from comment #6) > Review of attachment 312741 [details] [review] [review]: > > @@ +1122,3 @@ > /** > * g_filename_to_utf8: > + * @opsysstring: (array length=len) (element-type guint8): a string in the > > Instead of (element-type guint8) this should be (type filename). (This > change was already made in commit 41013a01f4.) Is there a way to annotate that the len parameter is always passed -1 instead of being exposed as a parameter by the bindings? This is needed if the opsysstring parameter is annotated as a NUL-terminated (type filename). (skip) (default -1) would seem like it, if it can be made to work in all bindings; I can't find this combination in use anywhere in glib source tree, so I assume it's not supported currently. > @@ +1146,3 @@ > * the [current locale][setlocale]. > * > + * Returns: (array length=bytes_written) (element-type guint8) (transfer > full): > > This shouldn’t be (element-type guint8) since it’s a UTF-8 string — if > adding (element-type) means this is no longer treated as (type utf8), I > think this change is wrong (even if it does make the (array length) > explicit). > > Similarly for some of the other (element-type guint8) annotations on UTF-8 > strings in this patch. As I tried to explain in the commit's comment, iconv can produce strings with interior NUL bytes. This is not conformant to (type utf8) and may result in data loss or corruption in the code receiving the string, even though it's perfectly valid UTF-8 as per the Unicode standard. The input parameters can indeed be annotated (type utf8), but this will narrow the set of possible inputs: the functions can also convert strings containing interior NULs. I believe iconv validates the input in all supported implementations (or at least it should by specification), so restricting the input with static typing or binding-imposed run time validation seems excessive.
(In reply to Mikhail Zabaluev from comment #7) > The input parameters can indeed be annotated (type utf8) Correction: they could, if not for the issue with the len parameter as mentioned above.
Created attachment 366399 [details] [review] gconvert: Correctly annotate string types and output parameters Note that the iconv wrapper API works with byte arrays. It's wrong to default to utf8 there, because iconv can read and produce strings with interior nul characters which are not allowed in (type utf8). The documentation was misleading about that in some places, so that got corrected as well.
Review of attachment 366399 [details] [review]: ::: glib/gconvert.c @@ +1125,3 @@ /** * g_filename_to_utf8: + * @opsysstring: (array length=len) (element-type guint8): a string in the I think we use (type filename) in enough places which don’t have a len argument that your filename encoding basically cannot have embedded nuls if it’s going to be used with GLib. However, since there is a len parameter here, we probably *should* bind it to opsysstring, so given your explanation, this change looks OK to me. @@ +1149,3 @@ * the [current locale][setlocale]. * + * Returns: (array length=bytes_written) (element-type guint8) (transfer full): GLib assumes that UTF-8 encoded strings *don’t* contain embedded nuls, so I think this really should be (type utf8). To do otherwise would be misleading. For example, g_utf8_validate() explicitly returns an error if called with (max_len > 0) on a string which contains an embedded nul. I think in this case, the bytes_written argument is a convenience to avoid having to call strlen() on the return value. It’s not intended to support embedded nuls. (Unless you have evidence to the contrary?) In the UTF-8 input path, for example (if get_filename_charset() returns true), I think g_filename_to_utf8() will return an error if called on an opsysstring with an embedded nul. @@ +1173,3 @@ /** * g_filename_from_utf8: + * @utf8string: (array length=len) (element-type guint8): Following the example of g_strndup(), I don’t think it’s necessary to mandatorily bind len to utf8string here. Just leave it as (type utf8), and the len argument can be used optionally to limit the input to less than strlen(utf8string).
Aside from these annotations changes, I think it would help if g_filename_[to|from]_utf8() were clarified about their behaviour for embedded nuls. I think the right thing to do is disallow embedded nuls in the UTF-8, but from reading the code, I think this is only checked on one out of two code paths — and it’s not mentioned in the documentation at all. Would you be able to put together another patch which tidies up handling of embedded nuls there, adds some tests, and mentions it in the documentation please? Matthias, do you have any input here?
(In reply to Philip Withnall from comment #10) > Review of attachment 366399 [details] [review] [review]: > > ::: glib/gconvert.c > @@ +1125,3 @@ > /** > * g_filename_to_utf8: > + * @opsysstring: (array length=len) (element-type guint8): a string in the > > I think we use (type filename) in enough places which don’t have a len > argument that your filename encoding basically cannot have embedded nuls if > it’s going to be used with GLib. Indeed. The convenience functions could be annotated with the constrained encoding types, at least on input. Should there be @len: (skip), then? > @@ +1149,3 @@ > * the [current locale][setlocale]. > * > + * Returns: (array length=bytes_written) (element-type guint8) (transfer > full): > > GLib assumes that UTF-8 encoded strings *don’t* contain embedded nuls, so I > think this really should be (type utf8). To do otherwise would be misleading. > > For example, g_utf8_validate() explicitly returns an error if called with > (max_len > 0) on a string which contains an embedded nul. > > I think in this case, the bytes_written argument is a convenience to avoid > having to call strlen() on the return value. It’s not intended to support > embedded nuls. (Unless you have evidence to the contrary?) > > In the UTF-8 input path, for example (if get_filename_charset() returns > true), I think g_filename_to_utf8() will return an error if called on an > opsysstring with an embedded nul. This is true, but the g_convert() path might in theory return a string with embedded nuls. The fact that it never does is more the result of most (all?) known filesystems prohibiting nul characters in file names than any runtime check. > @@ +1173,3 @@ > /** > * g_filename_from_utf8: > + * @utf8string: (array length=len) (element-type guint8): > > Following the example of g_strndup(), I don’t think it’s necessary to > mandatorily bind len to utf8string here. Just leave it as (type utf8), and > the len argument can be used optionally to limit the input to less than > strlen(utf8string). g_strndup() has safe-ish behavior when the len argument is larger than the length of the string. The g_convert() path would read past the end of the input buffer in this case, so we shouldn't expose len to the bindings. Should I skip the len argument too, then, or add a validating assertion on the input of g_filename_from_utf()?
(In reply to Mikhail Zabaluev from comment #12) > Indeed. The convenience functions could be annotated with the constrained > encoding types, at least on input. Should there be @len: (skip), then? Ignore this; I've been walking in circles. Until there is a way to tell the bindings to pass -1, we can't skip the len argument.
(In reply to Philip Withnall from comment #11) > Aside from these annotations changes, I think it would help if > g_filename_[to|from]_utf8() were clarified about their behaviour for > embedded nuls. I think the right thing to do is disallow embedded nuls in > the UTF-8, but from reading the code, I think this is only checked on one > out of two code paths — and it’s not mentioned in the documentation at all. > > Would you be able to put together another patch which tidies up handling of > embedded nuls there, adds some tests, and mentions it in the documentation > please? I've filed bug 792516 with the proposed changes.
(In reply to Philip Withnall from comment #11) > Aside from these annotations changes, I think it would help if > g_filename_[to|from]_utf8() were clarified about their behaviour for > embedded nuls. I think the right thing to do is disallow embedded nuls in > the UTF-8, but from reading the code, I think this is only checked on one > out of two code paths — and it’s not mentioned in the documentation at all. > > Would you be able to put together another patch which tidies up handling of > embedded nuls there, adds some tests, and mentions it in the documentation > please? > > Matthias, do you have any input here? Not really, no. I'm fine with not allowing embedded nuls in input here. And I think I'm also ok with a white lie for the output side and just claim that embedded nuls don't happen. We could add an extra paranoid mode that does a runtime check for this before returning
(In reply to Mikhail Zabaluev from comment #14) > (In reply to Philip Withnall from comment #11) > > Aside from these annotations changes, I think it would help if > > g_filename_[to|from]_utf8() were clarified about their behaviour for > > embedded nuls. I think the right thing to do is disallow embedded nuls in > > the UTF-8, but from reading the code, I think this is only checked on one > > out of two code paths — and it’s not mentioned in the documentation at all. > > > > Would you be able to put together another patch which tidies up handling of > > embedded nuls there, adds some tests, and mentions it in the documentation > > please? > > I've filed bug 792516 with the proposed changes. Thanks a lot, the patches there are good. I’ll come back to *this* bug once we’ve sorted those ones out.
Mikhail, I think this patch now needs updating since the changes in bug #792516 to disallow embedded nuls in various places. If you still need to specify len as -1 (and the (default -1) annotation doesn’t work), then we could at least get the rest of the changes in and defer the len stuff to a gobject-introspection task for supporting an annotation for that (bug #558620).
Created attachment 368167 [details] [review] gconvert: Correctly annotate string types and output parameters Note that the g_convert() API works with byte arrays. It's wrong to default to utf8 there, because iconv can read and produce strings with interior nul characters which are not allowed in (type utf8). The documentation was misleading about that in some places, so that got corrected as well. Strings in the locale encoding are annotated as dynamic-length byte arrays because they don't have any guaranteed format and can contain nul bytes. For UTF-8 strings in g_*_{from,to}_utf8(), GLib assumes no embedded nul bytes and the (type utf8) annotations on the UTF-8 parameters and return values remain as they were. Likewise for (type filename).
Created attachment 368168 [details] [review] gconvert: g_filename_from_utf8() returns (type filename) The existing array annotation is inconsistent with the other conversion functions. Now that the implementation guarantees no embedded NULs, the return value can be re-annotated.
Review of attachment 368168 [details] [review]: ++
Review of attachment 368167 [details] [review]: ++, thanks
All pushed, thanks. Attachment 368167 [details] pushed as 8a93e2d - gconvert: Correctly annotate string types and output parameters Attachment 368168 [details] pushed as 7f1fd24 - gconvert: g_filename_from_utf8() returns (type filename)