Bug 788863 – Add more filename type annotations for strings which can contain filenames

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 788863 - Add more filename type annotations for strings which can contain filenames


Summary:	Add more filename type annotations for strings which can contain filenames


Status:	RESOLVED OBSOLETE

Product:	glib
Classification:	Platform
Component:	introspection
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2017-10-12 08:50 UTC by Christoph Reiter (lazka)
Modified:	2018-05-24 19:49 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Add more filename type annotations for strings which can contain filenames. (21.07 KB, patch) 2017-10-12 08:50 UTC, Christoph Reiter (lazka)	committed	Details \| Review

Description Christoph Reiter (lazka) 2017-10-12 08:50:37 UTC

Created attachment 361406 [details] [review]
Add more filename type annotations for strings which can  contain filenames.

This continues the changes done in bug 767245

This makes it possible to pass Python path types as process arguments and env vars
in PyGObject and and makes it clear that the values are not strictly utf-8 and need
to be validated/converted first.

Comment 1 Philip Withnall 2017-10-12 12:27:01 UTC

Review of attachment 361406 [details] [review]:

::: gio/gappinfo.c
@@ +988,3 @@
  * g_app_launch_context_setenv:
  * @context: a #GAppLaunchContext
+ * @variable: (type filename): the environment variable to set

I’m not happy about using (type filename) for *all* types of non-UTF-8 string-like data. Can we introduce a new (type) to represent nul-terminated strings in arbitrary encodings? It would probably end up being an alias for (type filename).

Are there systems where the file system encoding is different from the argv or env encoding?

Comment 2 Christoph Reiter (lazka) 2017-10-12 14:30:43 UTC

(In reply to Philip Withnall from comment #1)
> Review of attachment 361406 [details] [review] [review]:
> 
> ::: gio/gappinfo.c
> @@ +988,3 @@
>   * g_app_launch_context_setenv:
>   * @context: a #GAppLaunchContext
> + * @variable: (type filename): the environment variable to set
> 
> I’m not happy about using (type filename) for *all* types of non-UTF-8
> string-like data. Can we introduce a new (type) to represent nul-terminated
> strings in arbitrary encodings? It would probably end up being an alias for
> (type filename).

We could, (rust has a type called "osstring"). But I'm wondering if it's worth it as these are only a handful of cases in glib/gio and unlikely to be needed anywhere else.

> Are there systems where the file system encoding is different from the argv
> or env encoding?

In glib argv/env have multiple encodings, depending on whether a contained value is a filename (g_filename_*) or a string (g_locale_*). In Python 3 they use the same encoding.

Comment 3 Philip Withnall 2017-10-14 08:25:57 UTC

(In reply to Christoph Reiter (lazka) from comment #2)
> (In reply to Philip Withnall from comment #1)
> > Review of attachment 361406 [details] [review] [review] [review]:
> > 
> > ::: gio/gappinfo.c
> > @@ +988,3 @@
> >   * g_app_launch_context_setenv:
> >   * @context: a #GAppLaunchContext
> > + * @variable: (type filename): the environment variable to set
> > 
> > I’m not happy about using (type filename) for *all* types of non-UTF-8
> > string-like data. Can we introduce a new (type) to represent nul-terminated
> > strings in arbitrary encodings? It would probably end up being an alias for
> > (type filename).
> 
> We could, (rust has a type called "osstring"). But I'm wondering if it's
> worth it as these are only a handful of cases in glib/gio and unlikely to be
> needed anywhere else.

I was going to say: I’m worried about setting a precedent for annotating arbitrary non-UTF-8 strings as (type filename) when (type filename) should strictly mean ‘string in the file system encoding’, which is not necessarily UTF-8, but is consistent and well-defined.

But then I double-checked the annotation glossary, and it looks like last year you defined (type filename) to be:
> The filename type represents an utf-8 string on Windows and a zero terminated guint8 array on Unix. It should be used for filenames, environment variables and process arguments. 

Did you make that change as a result of some discussion or consensus somewhere? Nobody seems to have complained since then, so actually this patch is probably totally fine.

Comment 4 Philip Withnall 2017-10-14 08:29:21 UTC

Review of attachment 361406 [details] [review]:

Looks good to me.

Comment 5 Philip Withnall 2017-10-14 08:29:50 UTC

Rico, do you have any input here?

Comment 6 Christoph Reiter (lazka) 2017-10-14 09:49:54 UTC

(In reply to Philip Withnall from comment #3)
> (In reply to Christoph Reiter (lazka) from comment #2)
> > (In reply to Philip Withnall from comment #1)
> > > Review of attachment 361406 [details] [review] [review] [review] [review]:
> > > 
> > > ::: gio/gappinfo.c
> > > @@ +988,3 @@
> > >   * g_app_launch_context_setenv:
> > >   * @context: a #GAppLaunchContext
> > > + * @variable: (type filename): the environment variable to set
> > > 
> > > I’m not happy about using (type filename) for *all* types of non-UTF-8
> > > string-like data. Can we introduce a new (type) to represent nul-terminated
> > > strings in arbitrary encodings? It would probably end up being an alias for
> > > (type filename).
> > 
> > We could, (rust has a type called "osstring"). But I'm wondering if it's
> > worth it as these are only a handful of cases in glib/gio and unlikely to be
> > needed anywhere else.
> 
> I was going to say: I’m worried about setting a precedent for annotating
> arbitrary non-UTF-8 strings as (type filename) when (type filename) should
> strictly mean ‘string in the file system encoding’, which is not necessarily
> UTF-8, but is consistent and well-defined.
> 
> But then I double-checked the annotation glossary, and it looks like last
> year you defined (type filename) to be:
> > The filename type represents an utf-8 string on Windows and a zero terminated guint8 array on Unix. It should be used for filenames, environment variables and process arguments. 
> 
> Did you make that change as a result of some discussion or consensus
> somewhere? Nobody seems to have complained since then, so actually this
> patch is probably totally fine.

I added the filename definition after asking on https://mail.gnome.org/archives/gir-devel-list/2016-June/msg00000.html and the linked bug reports there. I was hoping to get some input from other binding maintainers there, but didn't (although Jasper did some work on gjs I think). So the fact that nobody complained might just indicate that nobody cares.

I agree that it's a bit of a hijack of the "filename" annotation, but I think it's an improvement over the status quo for bindings.

Comment 7 Philip Withnall 2017-10-16 10:10:04 UTC

(In reply to Christoph Reiter (lazka) from comment #6)
> I added the filename definition after asking on
> https://mail.gnome.org/archives/gir-devel-list/2016-June/msg00000.html and
> the linked bug reports there. I was hoping to get some input from other
> binding maintainers there, but didn't (although Jasper did some work on gjs
> I think). So the fact that nobody complained might just indicate that nobody
> cares.
> 
> I agree that it's a bit of a hijack of the "filename" annotation, but I
> think it's an improvement over the status quo for bindings.

Thanks for that. Given that this is going to affect other bindings in different ways (since I guess they all potentially handle filename/argv/env encoding differently), I think it would be wise to send another e-mail to the list asking for feedback from other binding maintainers. If there’s no feedback in a week, go ahead with these changes (and probably also revive the changes in bug #767266 and bug #767268).

Comment 8 Christoph Reiter (lazka) 2017-10-19 15:58:32 UTC

Thanks, done: https://mail.gnome.org/archives/gir-devel-list/2017-October/msg00000.html

Comment 9 Christoph Reiter (lazka) 2017-10-26 16:53:03 UTC

Comment on attachment 361406 [details] [review]
Add more filename type annotations for strings which can  contain filenames.

Pushed to master: https://git.gnome.org/browse/glib/commit/?id=fed574a0c82e26187358a93c649db084b7727db5

Comment 10 Christoph Reiter (lazka) 2017-10-26 16:53:35 UTC

Thanks!

Comment 11 Sebastian Dröge (slomo) 2018-04-29 07:19:13 UTC

This change seems kind of suboptimal. While e.g. environment variables can contain filenames, they are not filenames. Bindings (e.g. the Rust ones, probably the C# ones too) would now map these strings to their internal filename/path abstractions, and that generally does not make sense: not every such string is a valid path or even supposed to be used as one.

It's also API breakage for such bindings, which I thought GLib guarantees to not do for GIR too unless something is simply completely broken.


It would make more sense to introduce a new string type to GIR to make the actual meaning more clear. From what I understand the intention here is that these are not valid (UTF8!) strings but in the encoding of the OS, whatever that is.

I'd prefer for this to be reverted for the time being until a proper solution is found.


See also
https://github.com/gtk-rs/gir-files/pull/20#issuecomment-385187692
https://github.com/gtk-rs/gir/issues/592

Comment 12 Christoph Reiter (lazka) 2018-04-29 09:33:31 UTC

(In reply to Sebastian Dröge (slomo) from comment #11)
> This change seems kind of suboptimal. While e.g. environment variables can
> contain filenames, they are not filenames. Bindings (e.g. the Rust ones,
> probably the C# ones too) would now map these strings to their internal
> filename/path abstractions, and that generally does not make sense: not
> every such string is a valid path or even supposed to be used as one.

I tried to extend the meaning of the annotation to all values which can potentially contain filenames and are thus not strictly utf-8.

> It's also API breakage for such bindings, which I thought GLib guarantees to
> not do for GIR too unless something is simply completely broken.

That's why I proposed these changed on the first day of the last cycle and one year after the first annotation changes for which I didn't get any complains.

> It would make more sense to introduce a new string type to GIR to make the
> actual meaning more clear. From what I understand the intention here is that
> these are not valid (UTF8!) strings but in the encoding of the OS, whatever
> that is.

yeah, I didn't expect that anyone would depend on "filename means path" as one year ago many filenames were still annotated as text :/

> I'd prefer for this to be reverted for the time being until a proper
> solution is found.

ok :(

Comment 13 Sebastian Dröge (slomo) 2018-04-29 09:51:39 UTC

(In reply to Christoph Reiter (lazka) from comment #12)

> > It's also API breakage for such bindings, which I thought GLib guarantees to
> > not do for GIR too unless something is simply completely broken.
> 
> That's why I proposed these changed on the first day of the last cycle and
> one year after the first annotation changes for which I didn't get any
> complains.

Sorry for not noticing this earlier!

The code-generated bindings (C++, C#, Rust, etc) are usually getting the updates whenever someone decides to update the .gir files they're generated from. So compared to dynamic bindings like the GJS/Python ones, it usually takes a while until there's an actual reason or someone is bored :)

I'm not sure how we can prevent such things in the future, having that in the final GLib release notes of the stable release for example would've been too late too. But generally, similar cases of incompatible annotation changes were rejected due to possibly breaking other bindings even if nobody knew if it actually would break anything.

Having a separate type annotation for this would also be beneficial for at least the Rust bindings, probably also others that have filename/path API and a strong string API with a defined character encoding. The new type could be handled separate from strings and filename/paths, as it should be.


Of course the previous situation was also broken as you could've gotten an supposedly UTF8 string which was not actually valid UTF8. But this now changed from bad to worse: making previously working usage (when you actually got an UTF8 string) impossible* as they are not also valid filenames/paths usually, and making it impossible* to use non-path non-UTF8 strings (because they're usually not valid paths).

* impossible without hacks, of course everything is *possible* :)


For the Rust bindings we're most likely going to add an override configuration now to change between filename, utf8 and a third new type but it would be good to have this solved properly in gobject-introspection.

Comment 14 Philip Withnall 2018-04-30 10:54:44 UTC

(In reply to Sebastian Dröge (slomo) from comment #11)
> It would make more sense to introduce a new string type to GIR to make the
> actual meaning more clear. From what I understand the intention here is that
> these are not valid (UTF8!) strings but in the encoding of the OS, whatever
> that is.

Yes. The documentation of (filename) on the wiki is correct: it represents all byte arrays in the OS character encoding (which is not necessarily UTF-8, and might differ e.g. between the file system encoding and locale encoding): https://wiki.gnome.org/Projects/GObjectIntrospection/Annotations#Default_Basic_Types

> I'd prefer for this to be reverted for the time being until a proper
> solution is found.

Given that any solution is fairly likely to result in these annotations staying the same or changing to some new `(not-utf8)` annotation, I’d rather we discuss the options a bit first before jumping to reverting things.

These changes have been in place for a while now, and were also discussed on the mailing list: https://mail.gnome.org/archives/gir-devel-list/2017-October/msg00000.html

(In reply to Sebastian Dröge (slomo) from comment #13)
> The code-generated bindings (C++, C#, Rust, etc) are usually getting the
> updates whenever someone decides to update the .gir files they're generated
> from. So compared to dynamic bindings like the GJS/Python ones, it usually
> takes a while until there's an actual reason or someone is bored :)

Regenerate them as a Jenkins/CI process, maybe? Would that fit in with the existing process/tooling?

> Having a separate type annotation for this would also be beneficial for at
> least the Rust bindings, probably also others that have filename/path API
> and a strong string API with a defined character encoding. The new type
> could be handled separate from strings and filename/paths, as it should be.

Sure, do you have any suggestions? I suspect we should come up with an outline proposal here, and then move to the gir-devel mailing list so that other bindings authors are involved.

I’m wondering about:
 • Keep (utf8)
 • Deprecate (filename) because its meaning is overloaded
 • Add (path) to indicate a byte array which is in the OS file system encoding, and which represents a path or filename
 • Add (raw) to indicate a byte array which is in a different encoding
 • Allow (raw encoding=blah) to specify an encoding; encoding types could cover things like ‘LC_DATE’, ‘any’, ‘ISO-8859-1’, ‘filesystem’
  - So (utf8) is a synonym for (raw encoding=UTF-8)
  - And (path) is a synonym for (raw encoding=filesystem), with the additional semantic that it definitely refers to a file system path/filename

The downside of this is that there’s no nice way of specifying the encoding for array elements. For example, (array) (element-type raw). There’s currently no good syntax for specifying parameters of keywords in an (element-type) annotation.

Comment 15 Sebastian Dröge (slomo) 2018-05-10 07:00:28 UTC

(In reply to Philip Withnall from comment #14)

> > I'd prefer for this to be reverted for the time being until a proper
> > solution is found.
> 
> Given that any solution is fairly likely to result in these annotations
> staying the same or changing to some new `(not-utf8)` annotation, I’d rather
> we discuss the options a bit first before jumping to reverting things.

Agreed

> (In reply to Sebastian Dröge (slomo) from comment #13)
> > The code-generated bindings (C++, C#, Rust, etc) are usually getting the
> > updates whenever someone decides to update the .gir files they're generated
> > from. So compared to dynamic bindings like the GJS/Python ones, it usually
> > takes a while until there's an actual reason or someone is bored :)
> 
> Regenerate them as a Jenkins/CI process, maybe? Would that fit in with the
> existing process/tooling?

It would be suboptimal. For any new API or annotation changes, you need to double-check if the generated bindings are sensible and safe still, and potentially add (mostly language-specific) overrides. GIR is generally not providing sufficient and/or correct information for generating bindings automatically (see Vala, C#, Rust having their own override system, and even dynamically generated bindings like Python).

> > Having a separate type annotation for this would also be beneficial for at
> > least the Rust bindings, probably also others that have filename/path API
> > and a strong string API with a defined character encoding. The new type
> > could be handled separate from strings and filename/paths, as it should be.
> 
> Sure, do you have any suggestions? I suspect we should come up with an
> outline proposal here, and then move to the gir-devel mailing list so that
> other bindings authors are involved.
> 
> I’m wondering about:
>  • Keep (utf8)
>  • Deprecate (filename) because its meaning is overloaded
>  • Add (path) to indicate a byte array which is in the OS file system
> encoding, and which represents a path or filename
>  • Add (raw) to indicate a byte array which is in a different encoding
>  • Allow (raw encoding=blah) to specify an encoding; encoding types could
> cover things like ‘LC_DATE’, ‘any’, ‘ISO-8859-1’, ‘filesystem’
>   - So (utf8) is a synonym for (raw encoding=UTF-8)
>   - And (path) is a synonym for (raw encoding=filesystem), with the
> additional semantic that it definitely refers to a file system path/filename

That was also the solution I was thinking of, sounds good to me as such.

> The downside of this is that there’s no nice way of specifying the encoding
> for array elements. For example, (array) (element-type raw). There’s
> currently no good syntax for specifying parameters of keywords in an
> (element-type) annotation.

One could add an element-type-encoding annotation for the time being, until (if ever) GIR can be updated with incompatible changes.

Comment 16 Sebastian Dröge (slomo) 2018-05-10 07:01:04 UTC

Also the problem with anything here will be that all dynamically generated bindings will stop working until updated. New things would be added to the typelib and they won't understand those.

Comment 17 Philip Withnall 2018-05-16 09:52:00 UTC

(In reply to Sebastian Dröge (slomo) from comment #15)
> (In reply to Philip Withnall from comment #14)
> > I’m wondering about:
> >  • Keep (utf8)
> >  • Deprecate (filename) because its meaning is overloaded
> >  • Add (path) to indicate a byte array which is in the OS file system
> > encoding, and which represents a path or filename
> >  • Add (raw) to indicate a byte array which is in a different encoding
> >  • Allow (raw encoding=blah) to specify an encoding; encoding types could
> > cover things like ‘LC_DATE’, ‘any’, ‘ISO-8859-1’, ‘filesystem’
> >   - So (utf8) is a synonym for (raw encoding=UTF-8)
> >   - And (path) is a synonym for (raw encoding=filesystem), with the
> > additional semantic that it definitely refers to a file system path/filename
> 
> That was also the solution I was thinking of, sounds good to me as such.

Do you want to propose it on the mailing list then? I don’t have enough bandwidth to lead on this, sorry.

> > The downside of this is that there’s no nice way of specifying the encoding
> > for array elements. For example, (array) (element-type raw). There’s
> > currently no good syntax for specifying parameters of keywords in an
> > (element-type) annotation.
> 
> One could add an element-type-encoding annotation for the time being, until
> (if ever) GIR can be updated with incompatible changes.

That’s not great, but I can’t come up with anything better, and it’s no worse than the whole (element-type) situation anyway (which similarly doesn’t allow nested types).

(In reply to Sebastian Dröge (slomo) from comment #16)
> Also the problem with anything here will be that all dynamically generated
> bindings will stop working until updated. New things would be added to the
> typelib and they won't understand those.

I guess we have a choice between:
 1. revert these (filename) changes and break bindings which already incorporate them, but make transition to a new (raw)/(path) annotation easier; or
 2. don’t revert the (filename) changes, and go straight to using (raw)/(path), which will break bindings which don’t suppose those yet.

Neither is great. This should be discussed on the mailing list. I suspect the right approach depends a lot on what various bindings do to distinguish (utf8) and (filename) — if most bindings treat them the same, then option 2 would definitely be better.

Comment 18 GNOME Infrastructure Team 2018-05-24 19:49:48 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/1293.