Bug 663610 – Cannot retrieve unicode strings from Gtk.TreeModel

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 663610 - Cannot retrieve unicode strings from Gtk.TreeModel


Summary:	Cannot retrieve unicode strings from Gtk.TreeModel


Status:	RESOLVED WONTFIX

Product:	pygobject
Classification:	Bindings
Component:	gio
Version:	3.0.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Nobody's working on this now (help wanted and appreciated)
QA Contact:	Python bindings maintainers

URL:
Whiteboard:

Depends on:
Blocks:	663525

Reported:	2011-11-08 11:20 UTC by Sebastian Pölsterl
Modified:	2012-02-10 08:25 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Test case (601 bytes, text/x-python) 2011-11-08 11:20 UTC, Sebastian Pölsterl		Details
Convert all strings to utf-8 encoding (1.73 KB, patch) 2011-11-08 11:39 UTC, Sebastian Pölsterl	needs-work	Details \| Review
Updated patch with unit test (3.26 KB, patch) 2011-11-09 10:58 UTC, Sebastian Pölsterl	none	Details \| Review
Cleaned up version of previous patch (3.25 KB, patch) 2011-11-09 11:02 UTC, Sebastian Pölsterl	needs-work	Details \| Review
Fixed issues with Python 3 (4.07 KB, patch) 2011-11-11 11:15 UTC, Sebastian Pölsterl	committed	Details \| Review
Fix None values in Gtk.TreeModel (1.52 KB, patch) 2012-01-07 11:11 UTC, Martin Pitt	none	Details \| Review

Description Sebastian Pölsterl 2011-11-08 11:20:21 UTC

Created attachment 200973 [details]
Test case

If you store a utf-8 encoded string in a TreeModel and retrieve it later you get a str object instead of a unicode object and the encoding is no longer utf-8. Therefore, the attached example throws

Traceback (most recent call last):

+ Trace 229014

File "unicode_test.py", line 5 in get_text

text = "<small>%s %s</small>" % (u"N\xe1me:", model[aiter][0])

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)


The string inserted in the ListStore is u"Trödel" which in bytes looks like

0x54 0x72 0xf6 0x64 0x65 0x6c

In the Gtk override u"Trödel".encode("UTF-8") is called which results in the sequence

0x54 0x72 0xc3 0xb6 0x64 0x65 0x6c

I'm using version 3.0.1 with python 2.7.2 on Fedora 16.

Comment 1 Sebastian Pölsterl 2011-11-08 11:39:53 UTC

Created attachment 200976 [details] [review]
Convert all strings to utf-8 encoding

The attached patch makes sure all strings are utf-8 encoded when retrieving them from a TreeModel.

Comment 2 johnp 2011-11-08 19:37:32 UTC

Comment on attachment 200976 [details] [review]
Convert all strings to utf-8 encoding

Doh, UTF support is a pain.  I guess this is ok.  Would be nice to be able to specify byte storage but that is an edge case for stores.  You need to add a test to test_overrides.py before I can approve.

Comment 3 Sebastian Pölsterl 2011-11-09 10:58:21 UTC

Created attachment 201051 [details] [review]
Updated patch with unit test

I had to do some changes to the patch because columns of type UCHAR are stored as a string as well but should not be decoded. In addition, I added a unit test. Please have a look if the changes are compatible with Python 3.

Comment 4 Sebastian Pölsterl 2011-11-09 11:02:12 UTC

Created attachment 201052 [details] [review]
Cleaned up version of previous patch

Comment 5 johnp 2011-11-10 17:05:55 UTC

Comment on attachment 201052 [details] [review]
Cleaned up version of previous patch

The tests won't work in python3 due to u"blah" being an invalid construct.  You need to do the same unicode tricks we have at the top of the test_overrides.py file.

Comment 6 Sebastian Pölsterl 2011-11-11 11:15:49 UTC

Created attachment 201221 [details] [review]
Fixed issues with Python 3

The test looks extremely ugly now, but at least it works with both python 2 and 3 now.

Comment 7 johnp 2011-11-14 18:32:14 UTC

Comment on attachment 201221 [details] [review]
Fixed issues with Python 3

Thanks

Comment 8 Sebastian Pölsterl 2011-11-14 22:10:47 UTC

Is it okay to commit to the 3.0 branch as well or would this break some kind of freeze?

Comment 9 johnp 2011-11-14 22:38:14 UTC

let it sit in master for a week and if no one complains feel free to commit to 3.0

Comment 10 Martin Pitt 2012-01-06 14:54:08 UTC

For the record, since we got the new pygobject release with this patch we start getting all sorts of crashes where retrieving a value from a tree model yields None and thus a crash like

+ Trace 229400

File "/usr/lib/python2.7/dist-packages/softwareproperties/gtk/SoftwarePropertiesGtk.py", line 477 in on_combobox_server_changed
```
uri = server_store.get_value(iter, 1)
```
File "/usr/lib/python2.7/dist-packages/gi/overrides/Gtk.py", line 978 in get_value
```
return self._decode_value(treeiter, column, value)
```
File "/usr/lib/python2.7/dist-packages/gi/overrides/Gtk.py", line 804 in _decode_value
```
value = value.decode('UTF-8')
```

AttributeError: 'NoneType' object has no attribute 'decode'


None is indeed not handled correctly by this patch, but this part is trivial to fix. I have a test case and a patch ready for it, and will commit that soon.

However, it nicely points out the underlying bigger problem that the retrieved values should not be None in the first place -- we put in an actual string. Reverting the _decode_value() mangling fixes this, but the reason for this is still a mystery to me. I'll work on a small reproducer now and follow up here.

Comment 11 Martin Pitt 2012-01-06 15:16:50 UTC

Also, this is quite a dramatic behaviour change for python 2.x: previously you always got str (which you then had to .decode() yourself when you wanted unicode), now you always get unicode. This breaks existing software which was written with assuming str data type.

It's also incompatible to the rest of Gtk under introspection: Other methods such as Gtk.Label.get_label() return encoded str, not unicode objects, so now GtkTreeModel is a special case which programmers would need to keep in mind.

Is there any chance this can be reverted? The original test case seems quite weird to me:

The string inserted in the ListStore is u"Trödel" which in bytes looks like

0x54 0x72 0xf6 0x64 0x65 0x6c

This is NOT UTF-8, it's latin-1 (ISO-8859-1), which is not allowed in GTK. The UTF-8 representation of ö is 0xc3b6. The original example needs to be fixed to use .encode(): You can't mix str and unicode in interpolation in Python:

$ python -c 'print "ä%sb" % u"ö"'
Traceback (most recent call last):

+ Trace 229401

File "<string>", line 1 in <module>

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)


You need to either only work with unicode or with str, such as

  $ python -c 'print "ä%sb" % u"ö".encode("UTF-8")'

I think we should revert the patch and close this as invalid. Sebastian, John, WDYT?

Comment 12 Sebastian Pölsterl 2012-01-06 15:53:03 UTC

I agree that this a change between previous versions of pygobject, but wouldn't you want a UTF-8 encoded string to always stay an UTF-8 encoded string? Considering that these are bindings you want to mimic the behaviour of the C library as close as possible.

Regarding the string in the test case, unichar expects the Unicode code not the UTF-8 byte value, therefore I believe that is correct. Not sure what you mean by mixing str and unicode in the example.

That said, I am okay with reverting this patch in the stable branch, but we should find a better solution and a fix for the None value issue for the next major version.

Comment 13 Martin Pitt 2012-01-06 16:05:37 UTC

(In reply to comment #12)
> I agree that this a change between previous versions of pygobject, but wouldn't
> you want a UTF-8 encoded string to always stay an UTF-8 encoded string?

Well, str *is* UTF-8 encoded (i. e. bytes); unicode != UTF-8 :)

But terminology aside, of course you are right: If I initialize a label with an unicode like

  l = Gtk.Label(u'foo')

I'd expect l.get_label() to return an unicode u'foo', not an UTF-8 str 'foo'. But right now pygobject just doesn't work like that, it doesn't track the original data types of arguments through the underlying C library, so what comes back out is always UTF-8 str (corresponding to the char* in C with the assumption that everything uses UTF-8).

> Considering that these are bindings you want to mimic the behaviour of the C
> library as close as possible.

Exactly, and C uses the equivalent of UTF-8 encoded str, not unicode (whose C equivalent would be wchar*).

> 
> Regarding the string in the test case, unichar expects the Unicode code not the
> UTF-8 byte value, therefore I believe that is correct.

Sorry, I didn't mean the test case in the committed patch, but the one in unicode_test.py (the "Test case") attachment: If you replace


    text = "<small>%s %s</small>" % (u"N\xe1me:", model[aiter][0])

(which uses an invalid encoding of the first Name string, and mixes unicode and str) with

   text = "<small>%s %s</small>" % ("Nàme:", model[aiter][0])

It works correctly without the patch here.

> That said, I am okay with reverting this patch in the stable branch, but we
> should find a better solution and a fix for the None value issue for the next
> major version.

Frankly, I think the unicode support for python 2.x and pygobject in particular is pretty much FUBAR. The only real solution that I see here is to use python 3, which finally has a clear model and distinction of strings and byte arrays; python 2.x has traditionally been a wild mix. If we want to fix it for python 2.x, we'd need to consistently make _all_ returned values from function calls unicode instead of str, but this would be a major API breakage. So I think it's better to stay with UTF-8 str for python 2.x :-/

Comment 14 Martin Pitt 2012-01-07 08:50:27 UTC

> That said, I am okay with reverting this patch in the stable branch

OK, I did that for now, but I really recommend reverting it from master as well.

Thanks, and sorry for all the troubles! (If I had a penny for all UnicodeDecodeException crash I've seen in Python projects, argh..)

Comment 15 Sebastian Pölsterl 2012-01-07 10:54:46 UTC

Could you still provide a test case where the patch causes the NoneType object issue, please?

Comment 16 Martin Pitt 2012-01-07 11:11:54 UTC

Created attachment 204805 [details] [review]
Fix None values in Gtk.TreeModel

Sure: With just the test case part of that patch you get

ERROR: test_tree_model_unicode (test_overrides.TestGtk)
----------------------------------------------------------------------
Traceback (most recent call last):

+ Trace 229403

File "/home/martin/upstream/pygobject/tests/test_overrides.py", line 1295 in test_tree_model_unicode
```
self.assertEqual(model[-1][0], None)
```
File "../gi/overrides/Gtk.py", line 1050 in __getitem__
```
return self.model.get_value(self.iter, key)
```
File "../gi/overrides/Gtk.py", line 977 in get_value
```
return self._decode_value(treeiter, column, value)
```
File "../gi/overrides/Gtk.py", line 803 in _decode_value
```
value = value.decode('UTF-8')
```

AttributeError: 'NoneType' object has no attribute 'decode'


The patch also contains the fix.

Comment 17 Marius Gedminas 2012-02-07 02:14:06 UTC

(In reply to comment #13)
> > Regarding the string in the test case, unichar expects the Unicode code not the
> > UTF-8 byte value, therefore I believe that is correct.
> 
> Sorry, I didn't mean the test case in the committed patch, but the one in
> unicode_test.py (the "Test case") attachment: If you replace
> 
>     text = "<small>%s %s</small>" % (u"N\xe1me:", model[aiter][0])
> 
> (which uses an invalid encoding of the first Name string, and mixes unicode and
> str) with

For the record, in Python 2.x the \xNN notation inside Unicode string literals refers to Unicode code points and is equivalent to \u00NN, so I think your invalid encoding comment stems from a misconception of some kind.

The bit about mixing unicode with (non-ASCII) str is spot on.

>    text = "<small>%s %s</small>" % ("Nàme:", model[aiter][0])
> 
> It works correctly without the patch here.

Comment 18 johnp 2012-02-07 05:02:31 UTC

There was a red hat bug that this patch also caused.  Can we simply return a str with utf-8 code points instead of unicode?

Comment 19 Sebastian Pölsterl 2012-02-07 06:56:55 UTC

I reverted the patch for 3.1.0 to restore the old behaviour, we could revert it in stable and push a new release as well, if requested.

Comment 20 johnp 2012-02-07 15:56:53 UTC

Yes please revert in stable.  I'm not sure master should have reverted though.  The fix is half correct.  The second half would simply not be using unicode but instead returning a UTF-8 encoded string in Python 2.  GTK specifies UTF-8 encoding so anything you put in will be encoded as such.

Comment 21 Sebastian Pölsterl 2012-02-07 17:22:56 UTC

(In reply to comment #20)
> The second half would simply not be using unicode but
> instead returning a UTF-8 encoded string in Python 2.
Isn't that you the behaviour without the patch anyway, you can put in an unicode object, but you will always get utf-8 encoded str from GTK. In my opinion, the advice is to avoid using unicode objects in Python 2 and always use utf-8 encoded str instead.

Comment 22 johnp 2012-02-07 20:17:08 UTC

Hmm, ok, I thought we were getting a traceback when trying to access the value in the store.  Looks like it is more an application issue so a revert should be enough.

Comment 23 Martin Pitt 2012-02-10 08:25:10 UTC

This was reverted in both 3-0 and trunk, and I don't think there's much we can do any more for python 2. Closing.