Bug 694735 – giscanner: give error when utf8 strings are annotated with length argument

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 694735 - giscanner: give error when utf8 strings are annotated with length argument


Summary:	giscanner: give error when utf8 strings are annotated with length argument


Status:	RESOLVED OBSOLETE

Product:	gobject-introspection
Classification:	Platform
Component:	g-ir-scanner
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gobject-introspection Maintainer(s)
QA Contact:	gobject-introspection Maintainer(s)

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2013-02-26 11:42 UTC by Simon Feltman
Modified:	2018-02-08 12:20 UTC

See Also:	622123 686263 686264 694448
GNOME target:	---
GNOME version:	---

Description Simon Feltman 2013-02-26 11:42:27 UTC

There have been a number of occasions where APIs either take strings with a length arg or return them. The following example came from bug 694448

/**
 * secret_value_get:
 * @value: the value
 * @length: (out): the length of the secret
 *
 * Returns: (array length=length): the secret data
 */
const gchar *
secret_value_get (SecretValue *value, gsize *length)


This creates the following GIR:

      <method name="get" c:identifier="secret_value_get">
        <return-value transfer-ownership="none">
          <doc xml:whitespace="preserve">the secret data</doc>
          <array length="0" zero-terminated="0" c:type="gchar*">
            <type name="utf8" c:type="gchar"/>
          </array>
        </return-value>


The problem here is language bindings will interpret this as an array of utf8 strings  based on the definition of GI_TYPE_TAG_UTF8 being "UTF-8 encoded string". It is unclear if this is a bug in scanner output. If it is not, it would be nice to support a "char" type so we can annotate this as follows:

Returns: (array length=length) (element-type char):

There is currently a way to work around this by specifying an element-type of "uint8" which bindings will special case. But we cannot assume utf8 encoding when this is done.

Comment 1 Emmanuele Bassi (:ebassi) 2013-02-26 14:05:12 UTC

the API in question is... weird, to say the least.

if it's a user string (i.e. it's UTF-8 encoded) why does the function have a length out argument? the string will be NUL terminated.

if the string is meant to have embedded NULs then you should be returning an array of uint8, not a const char*; but embedded NULs mean that the returned value is a binary blob, and not really a UTF-8 string.

Comment 2 Simon Feltman 2013-02-26 21:11:39 UTC

For the method in question the right thing ended up being to use a raw uint8 array. However, I think the original annotation should produce a better gir output or at least give a warning. The issue from the binding perspective is we interpret this as an array of utf8 string pointers which crashes during marshaling for obvious reasons.

There are other APIs which take input strings with an optional length arg and suffer the same interpretation problem of the annotation (gtk_builder_add_from_string, gtk_text_buffer_insert_text).

In PyGObject we can attempt to work around these APIs but it seems like it would be beneficial for other languages to at least give a warning. This will also help minimize future support load.

Comment 3 Sébastien Wilmet 2014-06-11 13:15:47 UTC

(In reply to comment #0)
> There is currently a way to work around this by specifying an element-type of
> "uint8" which bindings will special case. But we cannot assume utf8 encoding
> when this is done.

Would it work for gtk_text_buffer_insert()? Something like:

> @text: (array length=len) (element-type uint8): text in UTF-8 format
> @len: length of text in bytes, or -1

The type of @text is "const gchar *".

Changing the type of @text to "const guint8 *" is probably not possible, it would be an API break.

Comment 4 André Klapper 2015-02-07 17:10:53 UTC

[Mass-moving gobject-introspection tickets to its own Bugzilla product - see bug 708029. Mass-filter your bugmail for this message: introspection20150207 ]

Comment 5 Mikhail Zabaluev 2018-01-06 15:25:16 UTC

Discussion about byte arrays vs (type utf8) strings aside, the element type in the array is obviously wrong. Since the return value is introspected as an array, the element should be the fundamental type "gchar", not "utf8".

Comment 6 Emmanuele Bassi (:ebassi) 2018-01-06 15:53:09 UTC

(In reply to Sébastien Wilmet from comment #3)
> (In reply to comment #0)
> > There is currently a way to work around this by specifying an element-type of
> > "uint8" which bindings will special case. But we cannot assume utf8 encoding
> > when this is done.
> 
> Would it work for gtk_text_buffer_insert()? Something like:
> 
> > @text: (array length=len) (element-type uint8): text in UTF-8 format
> > @len: length of text in bytes, or -1
> 
> The type of @text is "const gchar *".

Of course not: the argument is a UTF-8 string, NUL-terminated, like every other string in GTK.

The len argument is not an out argument like the one in the description. It's the usual "I'm lazy and I don't want to call strlen(), so do it for me" argument, which can also be used to inject slices of a larger buffer.

> Changing the type of @text to "const guint8 *" is probably not possible, it
> would be an API break.

It would also be wrong; GtkTextBuffer does not work in terms of binary blobs.

Comment 7 Mikhail Zabaluev 2018-01-06 16:07:16 UTC

(In reply to Emmanuele Bassi (:ebassi) from comment #6)
> Of course not: the argument is a UTF-8 string, NUL-terminated, like every
> other string in GTK.
> 
> The len argument is not an out argument like the one in the description.
> It's the usual "I'm lazy and I don't want to call strlen(), so do it for me"
> argument, which can also be used to inject slices of a larger buffer.

The issue with the len argument here and in similar APIs is that it has to be annotated for the bindings to never expose it and always pass -1 to it, if the string argument is to be passed as a NUL-terminated string. If the function can't correctly handle strings with inner NULs, there is trouble either way.

See here for more discussion, and an example of an API that can safely take
byte arrays even if it expects valid UTF-8:
https://bugzilla.gnome.org/show_bug.cgi?id=756128#c7

Comment 8 Sébastien Wilmet 2018-01-06 18:31:01 UTC

IIRC why I commented on this bug, it's not (only) because of the gtk_text_buffer_insert() function, it's because of the GtkTextBuffer::insert-text signal. GtkSourceView was calling gtk_text_buffer_insert() with a string *not* nul-terminated, taking advantage of the length argument. Then it made the application to crash with a Python plugin in gedit, when listening to the signal in Python it didn't work as expected. The workaround was to nul-terminate the string in GtkSourceView.

Comment 9 Sébastien Wilmet 2018-01-11 05:10:12 UTC

Yes, see bug #726689 (pygobject issue).

Comment 10 GNOME Infrastructure Team 2018-02-08 12:20:45 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/gobject-introspection/issues/81.