GNOME Bugzilla – Bug 672546
Ensure that "utf8" arguments actually get valid UTF-8 values
Last modified: 2014-09-09 23:40:18 UTC
When I pass non utf-8 date to markup_escape_text segfaults I get a segfault. I have been using glib through intospection. When using C I could not reproduce the behavior. The bug was discovered when trying to parse a ssid that has been announced by an AP in non utf-8 data. A simple way of reproducing is: {{{ # -*- coding: latin-1 -*- from gi.repository import GLib GLib.markup_escape_text('Gruß Möglichkeiten Quiñones') GLib.markup_escape_text('\xc3\xdc\xc2\xeb13\xb8\xf61') }}} This is the stack trace I get from the second call: {{{ (gdb) bt
+ Trace 229927
More info is about the discovery is at: http://dev.laptop.org/ticket/11698
As documented, g_markup_escape_text() expects valid UTF-8 input; passing it non-UTF-8 text is a programmer error.
(In reply to comment #0) > When I pass non utf-8 date to markup_escape_text I get a segfault. Yes. Don't do that. Virtually all glib/gtk+ functions that take strings require them to be valid UTF-8 (and they don't verify that they are, because verifying them would slow things down by adding lots of additional checks that are only needed if you're using the API incorrectly). If you have data which you don't know if it's valid UTF-8 or not, you have to call g_utf8_validate() on it first, and then fix up the invalid bits in some way if it's not.
As a general rule, text handling functions in GLib assume that their input is valid utf8 text and don't validate it before doing the job. In the case of g_markup_escape_text() it's even explicitly documented as such: text : some valid UTF-8 text So this is definitely not a bug in GLib. That said, the binding really should not be allowing you to feed non-utf8 text into that function, so I'd say that is a bug indeed -- in the binding. I can reproduce it here, too. I'm going to reassign to pygobject.
The issue also happens for >>> import glib >>> glib.markup_escape_text('\xc3\xdc\xc2\xeb13\xb8\xf61') This is handled by static C code in glibmodule.c. It appears that the gi.repository route is not handled that way, though -- this may be two bugs, in fact.
Thanks for all the feedback, I should have read the API more clearly, tried to use g_utf8_validate to make sure we have valid data - with pygobject this does segfault, filed #672889 for that.
So I guess we can close this report? If Python crashes iff C crashes, then it does a pretty good job of maintaining API and behaviour compatibility :-) bug 672889 is a real issue, of course, I'm looking into this now.
Closing, with Simon's consent on IRC.
It's my opinion that this bug should be fixed -- probably by ensuring that we don't accept non-ascii strings from Python (unless of course they're unicode strings -- then we convert those to utf8). We could also say that we expect any non-ascii content in non-unicode strings to be utf8... That's a bit more magic, but it would fit with the way we use our C APIs as well...
(In reply to comment #8) > It's my opinion that this bug should be fixed -- probably by ensuring that we > don't accept non-ascii strings from Python (unless of course they're unicode > strings -- then we convert those to utf8). It would certainly be nice to throw a proper exception instead of crashing. It would add quite a lot of runtime overhead, though. But I don't mind reopening this for the more general case. I'm not a fan of adding an override with an UTF-8 check for this particular function, though.
Simon, this doesn't actually crash for me any more, I get a proper UnicodeDecodeError with python 2, and with python3 it just works (as the interpreter itself ensures that strings are properly encoded). Do you have a current reproducer for this, or did that stop being relevant now?
{{{ # -*- coding: latin-1 -*- from gi.repository import GLib GLib.markup_escape_text('Gruß Möglichkeiten Quiñones') GLib.markup_escape_text('\xc3\xdc\xc2\xeb13\xb8\xf61') }}} I can still see the issue that the second 'markup_escape_text'-call above does segfault. Python 2.7.3, Glib 2.34, Pygobject3 3.4.2 #672889 has been fixed, so I can validate first if the string is UTF-8 and use the API correctly.
Hm, that's what I tried; with python3 it succeeds, with python2.7 I get an UnicodeDecodeError. With adding print() around the statements: $ python3 /tmp/test.py Gruß Möglichkeiten Quiñones ÃÜÂë13¸ö1 (which is right as my locale is UTF-8) $ python /tmp/test.py Gruß Möglichkeiten Quiñones Traceback (most recent call last):
+ Trace 231364
print(GLib.markup_escape_text('\xc3\xdc\xc2\xeb13\xb8\xf61'))
return GLib.markup_escape_text(text.decode('UTF-8'), length)
return codecs.utf_8_decode(input, errors, True)
That's pygobject 3.7.3, though; maybe the removal of the static glib bindings changed this?
I also cannot reproduce this with later versions of pygi, I get the exact results as Martin in comment #12.