GNOME Bugzilla – Bug 746564
Python 3 file name encoding problems
Last modified: 2017-03-30 08:02:59 UTC
There are a few minor problems with Python 3 and filenames, environ and argv. Under Python3 they all can be represented as "bytes" or "str" with (optional) surrogates. Here are some problematic cases which fail under Python 3 but work fine under Python 2: python3 -c "from gi.repository import Gtk" $(echo -e "\xff") FOO=$(echo -e "\xff") python3 -c "from gi.repository import GLib; GLib.get_environ()" python3 -c "from gi.repository import Gio; Gio.File.new_for_path(b'\xff'.decode('utf8', 'surrogateescape'))" python3 -c "from gi.repository import Gio; Gio.File.new_for_path(b'asda')" python3 -c "from gi.repository import Gio; Gio.Resource.load(b'\xff'.decode('utf-8', 'surrogateescape'))" python3 -c "from gi.repository import Gio; Gio.Resource.load(b'ada')" Im not sure what's up with the "filename" type in GI as it's not used that much (many are marked utf-8)... but it seems like it's indented for this, so the following would be a solution: Annotate anything representing a path, env var, argv as "filename"; allow passing in "bytes"; on encoding/decoding use the "surrogateescape" error handler and sys.getfilesystemencoding(). Thoughts?
I just hit this as well when attempting some Python 2 -> 3 porting. It seems like it's not possible to actually handle unicode paths correctly (i.e., as opaque bytestrings in unix-land) with current pygobject. There's some understandable quirks, such as e.g., Gtk.FileChooser.set_filename() not taking bytes. However, it feels weird that Gio.File.new_for_path() also requires a unicode string, and even stranger than Gio.File.new_for_uri() does the same. I feel like the API really should assume and pass on raw bytes, and if it *does* get a str (unicode) then it should encode it as comment 1 suggests before handing it off to Gio.
I'd recommend to switch to "str" on Python 3 for unix paths anyway. Most Python 3 APIs give you str decoded with the surrogateescape error handler, so they can contain arbitrary data. And some new APIs like pathlib only support str.
*** Bug 764457 has been marked as a duplicate of this bug. ***
Created attachment 329106 [details] [review] Add support for non-utf8 file names. PY2+UNIX: Convert unicode to bytes using the glib encoding. Pass bytes in as is. Always return bytes. PY2+Windows: Convert unicode to utf-8. Pass bytes if they are utf-8. Return utf-8 encoded bytes. PY3+UNIX: Convert str using os.fsencode so that the surrogateescape handler can restore the real path if the source was a Python API such as os.listdir sys.argv etc. Pass bytes as is. Return str decoded using os.fsdecode so that it can be passed to Python API such as open, os.listdir etc. PY3+Windows: Convert str to utf-8. Pass bytes if utf-8. Return str decoded from utf-8. PyUnicode_EncodeFSDefault was added in CPython 3.2 so bump the requirement. (The Windows part is untested..)
I've filed https://bugzilla.gnome.org/show_bug.cgi?id=767245 which adds annotations for GLib and Gio. If this gets accepted I'll look into the other libs.
gtk: bug 767266 gdk-pixbuf: bug 767267 gstreamer: bug 767268
gtk+/gstreamer don't like the filename annotation for argv/environ. On #gtk+ it was suggested to use null terminated uint8 arrays instead. We could do that and add overrides for Gdk.init() and Gtk.init()
Created attachment 334729 [details] [review] Add support for non-utf8 file names rebased on master
Created attachment 335871 [details] [review] Add support for non-utf8 file names rebased on master
Created attachment 348697 [details] [review] Add support for bytes and non-utf-8 file names. New version, now also supports all paths on Windows including lone surrogates. I've made it match the behavior of "open()" now, because with Python 3.6 switching from mbcs to utf-8 on Windows there is no good reason not to. Except with Py2+Windows+bytes, where we use utf-8 instead of mbcs for backwards compat.