GNOME Bugzilla – Bug 663610
Cannot retrieve unicode strings from Gtk.TreeModel
Last modified: 2012-02-10 08:25:10 UTC
Created attachment 200973 [details] Test case If you store a utf-8 encoded string in a TreeModel and retrieve it later you get a str object instead of a unicode object and the encoding is no longer utf-8. Therefore, the attached example throws Traceback (most recent call last):
+ Trace 229014
text = "<small>%s %s</small>" % (u"N\xe1me:", model[aiter][0])
The string inserted in the ListStore is u"Trödel" which in bytes looks like 0x54 0x72 0xf6 0x64 0x65 0x6c In the Gtk override u"Trödel".encode("UTF-8") is called which results in the sequence 0x54 0x72 0xc3 0xb6 0x64 0x65 0x6c I'm using version 3.0.1 with python 2.7.2 on Fedora 16.
Created attachment 200976 [details] [review] Convert all strings to utf-8 encoding The attached patch makes sure all strings are utf-8 encoded when retrieving them from a TreeModel.
Comment on attachment 200976 [details] [review] Convert all strings to utf-8 encoding Doh, UTF support is a pain. I guess this is ok. Would be nice to be able to specify byte storage but that is an edge case for stores. You need to add a test to test_overrides.py before I can approve.
Created attachment 201051 [details] [review] Updated patch with unit test I had to do some changes to the patch because columns of type UCHAR are stored as a string as well but should not be decoded. In addition, I added a unit test. Please have a look if the changes are compatible with Python 3.
Created attachment 201052 [details] [review] Cleaned up version of previous patch
Comment on attachment 201052 [details] [review] Cleaned up version of previous patch The tests won't work in python3 due to u"blah" being an invalid construct. You need to do the same unicode tricks we have at the top of the test_overrides.py file.
Created attachment 201221 [details] [review] Fixed issues with Python 3 The test looks extremely ugly now, but at least it works with both python 2 and 3 now.
Comment on attachment 201221 [details] [review] Fixed issues with Python 3 Thanks
Is it okay to commit to the 3.0 branch as well or would this break some kind of freeze?
let it sit in master for a week and if no one complains feel free to commit to 3.0
For the record, since we got the new pygobject release with this patch we start getting all sorts of crashes where retrieving a value from a tree model yields None and thus a crash like
+ Trace 229400
uri = server_store.get_value(iter, 1)
return self._decode_value(treeiter, column, value)
value = value.decode('UTF-8')
None is indeed not handled correctly by this patch, but this part is trivial to fix. I have a test case and a patch ready for it, and will commit that soon. However, it nicely points out the underlying bigger problem that the retrieved values should not be None in the first place -- we put in an actual string. Reverting the _decode_value() mangling fixes this, but the reason for this is still a mystery to me. I'll work on a small reproducer now and follow up here.
Also, this is quite a dramatic behaviour change for python 2.x: previously you always got str (which you then had to .decode() yourself when you wanted unicode), now you always get unicode. This breaks existing software which was written with assuming str data type. It's also incompatible to the rest of Gtk under introspection: Other methods such as Gtk.Label.get_label() return encoded str, not unicode objects, so now GtkTreeModel is a special case which programmers would need to keep in mind. Is there any chance this can be reverted? The original test case seems quite weird to me: The string inserted in the ListStore is u"Trödel" which in bytes looks like 0x54 0x72 0xf6 0x64 0x65 0x6c This is NOT UTF-8, it's latin-1 (ISO-8859-1), which is not allowed in GTK. The UTF-8 representation of ö is 0xc3b6. The original example needs to be fixed to use .encode(): You can't mix str and unicode in interpolation in Python: $ python -c 'print "ä%sb" % u"ö"' Traceback (most recent call last):
+ Trace 229401
You need to either only work with unicode or with str, such as $ python -c 'print "ä%sb" % u"ö".encode("UTF-8")' I think we should revert the patch and close this as invalid. Sebastian, John, WDYT?
I agree that this a change between previous versions of pygobject, but wouldn't you want a UTF-8 encoded string to always stay an UTF-8 encoded string? Considering that these are bindings you want to mimic the behaviour of the C library as close as possible. Regarding the string in the test case, unichar expects the Unicode code not the UTF-8 byte value, therefore I believe that is correct. Not sure what you mean by mixing str and unicode in the example. That said, I am okay with reverting this patch in the stable branch, but we should find a better solution and a fix for the None value issue for the next major version.
(In reply to comment #12) > I agree that this a change between previous versions of pygobject, but wouldn't > you want a UTF-8 encoded string to always stay an UTF-8 encoded string? Well, str *is* UTF-8 encoded (i. e. bytes); unicode != UTF-8 :) But terminology aside, of course you are right: If I initialize a label with an unicode like l = Gtk.Label(u'foo') I'd expect l.get_label() to return an unicode u'foo', not an UTF-8 str 'foo'. But right now pygobject just doesn't work like that, it doesn't track the original data types of arguments through the underlying C library, so what comes back out is always UTF-8 str (corresponding to the char* in C with the assumption that everything uses UTF-8). > Considering that these are bindings you want to mimic the behaviour of the C > library as close as possible. Exactly, and C uses the equivalent of UTF-8 encoded str, not unicode (whose C equivalent would be wchar*). > > Regarding the string in the test case, unichar expects the Unicode code not the > UTF-8 byte value, therefore I believe that is correct. Sorry, I didn't mean the test case in the committed patch, but the one in unicode_test.py (the "Test case") attachment: If you replace text = "<small>%s %s</small>" % (u"N\xe1me:", model[aiter][0]) (which uses an invalid encoding of the first Name string, and mixes unicode and str) with text = "<small>%s %s</small>" % ("Nàme:", model[aiter][0]) It works correctly without the patch here. > That said, I am okay with reverting this patch in the stable branch, but we > should find a better solution and a fix for the None value issue for the next > major version. Frankly, I think the unicode support for python 2.x and pygobject in particular is pretty much FUBAR. The only real solution that I see here is to use python 3, which finally has a clear model and distinction of strings and byte arrays; python 2.x has traditionally been a wild mix. If we want to fix it for python 2.x, we'd need to consistently make _all_ returned values from function calls unicode instead of str, but this would be a major API breakage. So I think it's better to stay with UTF-8 str for python 2.x :-/
> That said, I am okay with reverting this patch in the stable branch OK, I did that for now, but I really recommend reverting it from master as well. Thanks, and sorry for all the troubles! (If I had a penny for all UnicodeDecodeException crash I've seen in Python projects, argh..)
Could you still provide a test case where the patch causes the NoneType object issue, please?
Created attachment 204805 [details] [review] Fix None values in Gtk.TreeModel Sure: With just the test case part of that patch you get ERROR: test_tree_model_unicode (test_overrides.TestGtk) ---------------------------------------------------------------------- Traceback (most recent call last):
+ Trace 229403
self.assertEqual(model[-1][0], None)
return self.model.get_value(self.iter, key)
The patch also contains the fix.
(In reply to comment #13) > > Regarding the string in the test case, unichar expects the Unicode code not the > > UTF-8 byte value, therefore I believe that is correct. > > Sorry, I didn't mean the test case in the committed patch, but the one in > unicode_test.py (the "Test case") attachment: If you replace > > text = "<small>%s %s</small>" % (u"N\xe1me:", model[aiter][0]) > > (which uses an invalid encoding of the first Name string, and mixes unicode and > str) with For the record, in Python 2.x the \xNN notation inside Unicode string literals refers to Unicode code points and is equivalent to \u00NN, so I think your invalid encoding comment stems from a misconception of some kind. The bit about mixing unicode with (non-ASCII) str is spot on. > text = "<small>%s %s</small>" % ("Nàme:", model[aiter][0]) > > It works correctly without the patch here.
There was a red hat bug that this patch also caused. Can we simply return a str with utf-8 code points instead of unicode?
I reverted the patch for 3.1.0 to restore the old behaviour, we could revert it in stable and push a new release as well, if requested.
Yes please revert in stable. I'm not sure master should have reverted though. The fix is half correct. The second half would simply not be using unicode but instead returning a UTF-8 encoded string in Python 2. GTK specifies UTF-8 encoding so anything you put in will be encoded as such.
(In reply to comment #20) > The second half would simply not be using unicode but > instead returning a UTF-8 encoded string in Python 2. Isn't that you the behaviour without the patch anyway, you can put in an unicode object, but you will always get utf-8 encoded str from GTK. In my opinion, the advice is to avoid using unicode objects in Python 2 and always use utf-8 encoded str instead.
Hmm, ok, I thought we were getting a traceback when trying to access the value in the store. Looks like it is more an application issue so a revert should be enough.
This was reverted in both 3-0 and trunk, and I don't think there's much we can do any more for python 2. Closing.