After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 746564 - Python 3 file name encoding problems
Python 3 file name encoding problems
Status: RESOLVED FIXED
Product: pygobject
Classification: Bindings
Component: general
Git master
Other Linux
: Normal normal
: ---
Assigned To: Nobody's working on this now (help wanted and appreciated)
Python bindings maintainers
: 764457 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2015-03-21 10:53 UTC by Christoph Reiter (lazka)
Modified: 2017-03-30 08:02 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Add support for non-utf8 file names. (12.34 KB, patch)
2016-06-04 04:31 UTC, Christoph Reiter (lazka)
none Details | Review
Add support for non-utf8 file names (12.38 KB, patch)
2016-09-03 17:24 UTC, Christoph Reiter (lazka)
none Details | Review
Add support for non-utf8 file names (11.04 KB, patch)
2016-09-19 16:49 UTC, Christoph Reiter (lazka)
none Details | Review
Add support for bytes and non-utf-8 file names. (18.35 KB, patch)
2017-03-25 13:15 UTC, Christoph Reiter (lazka)
committed Details | Review

Description Christoph Reiter (lazka) 2015-03-21 10:53:41 UTC
There are a few minor problems with Python 3 and filenames, environ and argv.

Under Python3 they all can be represented as "bytes" or "str" with (optional) surrogates.

Here are some problematic cases which fail under Python 3 but work fine under Python 2:

python3 -c "from gi.repository import Gtk" $(echo -e "\xff")

FOO=$(echo -e "\xff") python3 -c "from gi.repository import GLib; GLib.get_environ()"

python3 -c "from gi.repository import Gio; Gio.File.new_for_path(b'\xff'.decode('utf8', 'surrogateescape'))"
python3 -c "from gi.repository import Gio; Gio.File.new_for_path(b'asda')"

python3 -c "from gi.repository import Gio; Gio.Resource.load(b'\xff'.decode('utf-8', 'surrogateescape'))"
python3 -c "from gi.repository import Gio; Gio.Resource.load(b'ada')"

Im not sure what's up with the "filename" type in GI as it's not used that much (many are marked utf-8)... but it seems like it's indented for this, so the following would be a solution:

Annotate anything representing a path, env var, argv as "filename"; allow passing in "bytes"; on encoding/decoding use the "surrogateescape" error handler and sys.getfilesystemencoding().

Thoughts?
Comment 1 Kai Willadsen 2015-11-13 23:14:46 UTC
I just hit this as well when attempting some Python 2 -> 3 porting. It seems like it's not possible to actually handle unicode paths correctly (i.e., as opaque bytestrings in unix-land) with current pygobject.

There's some understandable quirks, such as e.g., Gtk.FileChooser.set_filename() not taking bytes. However, it feels weird that Gio.File.new_for_path() also requires a unicode string, and even stranger than Gio.File.new_for_uri() does the same.

I feel like the API really should assume and pass on raw bytes, and if it *does* get a str (unicode) then it should encode it as comment 1 suggests before handing it off to Gio.
Comment 2 Christoph Reiter (lazka) 2015-11-14 07:18:09 UTC
I'd recommend to switch to "str" on Python 3 for unix paths anyway. Most Python 3 APIs give you str decoded with the surrogateescape error handler, so they can contain arbitrary data. And some new APIs like pathlib only support str.
Comment 3 Christoph Reiter (lazka) 2016-04-01 12:12:00 UTC
*** Bug 764457 has been marked as a duplicate of this bug. ***
Comment 4 Christoph Reiter (lazka) 2016-06-04 04:31:34 UTC
Created attachment 329106 [details] [review]
Add support for non-utf8 file names.

PY2+UNIX: Convert unicode to bytes using the glib encoding. Pass bytes in as
is. Always return bytes.

PY2+Windows: Convert unicode to utf-8. Pass bytes if they are utf-8. Return
utf-8 encoded bytes.

PY3+UNIX: Convert str using os.fsencode so that the surrogateescape handler
can restore the real path if the source was a Python API such as os.listdir
sys.argv etc. Pass bytes as is. Return str decoded using os.fsdecode so that
it can be passed to Python API such as open, os.listdir etc.

PY3+Windows: Convert str to utf-8. Pass bytes if utf-8. Return str decoded
from utf-8.

PyUnicode_EncodeFSDefault was added in CPython 3.2 so bump the requirement.

(The Windows part is untested..)
Comment 5 Christoph Reiter (lazka) 2016-06-04 16:12:40 UTC
I've filed https://bugzilla.gnome.org/show_bug.cgi?id=767245 which adds annotations for GLib and Gio. If this gets accepted I'll look into the other libs.
Comment 6 Christoph Reiter (lazka) 2016-06-05 16:31:47 UTC
gtk: bug 767266
gdk-pixbuf: bug 767267
gstreamer: bug 767268
Comment 7 Christoph Reiter (lazka) 2016-06-07 16:46:42 UTC
gtk+/gstreamer don't like the filename annotation for argv/environ.
On #gtk+ it was suggested to use null terminated uint8 arrays instead.

We could do that and add overrides for Gdk.init() and Gtk.init()
Comment 8 Christoph Reiter (lazka) 2016-09-03 17:24:57 UTC
Created attachment 334729 [details] [review]
Add support for non-utf8 file names

rebased on master
Comment 9 Christoph Reiter (lazka) 2016-09-19 16:49:31 UTC
Created attachment 335871 [details] [review]
Add support for non-utf8 file names

rebased on master
Comment 10 Christoph Reiter (lazka) 2017-03-25 13:15:43 UTC
Created attachment 348697 [details] [review]
Add support for bytes and non-utf-8 file names.

New version, now also supports all paths on Windows including lone surrogates.

I've made it match the behavior of "open()" now, because with Python 3.6 switching from mbcs to utf-8 on Windows there is no good reason not to. Except with Py2+Windows+bytes, where we use utf-8 instead of mbcs for backwards compat.