Bug 672546 – Ensure that "utf8" arguments actually get valid UTF-8 values

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 672546 - Ensure that "utf8" arguments actually get valid UTF-8 values


Summary:	Ensure that "utf8" arguments actually get valid UTF-8 values


Status:	RESOLVED FIXED

Product:	pygobject
Classification:	Bindings
Component:	general
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Nobody's working on this now (help wanted and appreciated)
QA Contact:	Python bindings maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2012-03-21 12:49 UTC by Simon Schampijer
Modified:	2014-09-09 23:40 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Simon Schampijer 2012-03-21 12:49:43 UTC

When I pass non utf-8 date to markup_escape_text segfaults I get a segfault. I have been using glib through intospection. When using C I could not reproduce the behavior. The bug was discovered when trying to parse a ssid that has been announced by an AP in non utf-8 data.

A simple way of reproducing is:

{{{
# -*- coding: latin-1 -*-                                                                                                                                                              

from gi.repository import GLib

GLib.markup_escape_text('Gruß Möglichkeiten Quiñones')

GLib.markup_escape_text('\xc3\xdc\xc2\xeb13\xb8\xf61')
}}}

This is the stack trace I get from the second call:

{{{
(gdb) bt

+ Trace 229927

#0 append_escaped_text
at gmarkup.c line 2107
#1 g_markup_escape_text
at gmarkup.c line 2183
#2 pyglib_markup_escape_text
#3 PyCFunction_Call
#4 call_function
at /usr/src/debug/Python-2.7.2/Python/ceval.c line 4090
#5 PyEval_EvalFrameEx
#6 PyEval_EvalCodeEx


More info is about the discovery is at: http://dev.laptop.org/ticket/11698

Comment 1 Christian Persch 2012-03-21 13:11:36 UTC

As documented, g_markup_escape_text() expects valid UTF-8 input; passing it non-UTF-8 text is a programmer error.

Comment 2 Dan Winship 2012-03-21 13:13:07 UTC

(In reply to comment #0)
> When I pass non utf-8 date to markup_escape_text I get a segfault.

Yes. Don't do that. Virtually all glib/gtk+ functions that take strings require them to be valid UTF-8 (and they don't verify that they are, because verifying them would slow things down by adding lots of additional checks that are only needed if you're using the API incorrectly).

If you have data which you don't know if it's valid UTF-8 or not, you have to call g_utf8_validate() on it first, and then fix up the invalid bits in some way if it's not.

Comment 3 Allison Karlitskaya (desrt) 2012-03-21 13:16:35 UTC

As a general rule, text handling functions in GLib assume that their input is valid utf8 text and don't validate it before doing the job.

In the case of g_markup_escape_text() it's even explicitly documented as such:

text : some valid UTF-8 text

So this is definitely not a bug in GLib.

That said, the binding really should not be allowing you to feed non-utf8 text into that function, so I'd say that is a bug indeed -- in the binding.  I can reproduce it here, too.  I'm going to reassign to pygobject.

Comment 4 Allison Karlitskaya (desrt) 2012-03-21 13:29:06 UTC

The issue also happens for

>>> import glib
>>> glib.markup_escape_text('\xc3\xdc\xc2\xeb13\xb8\xf61')

This is handled by static C code in glibmodule.c.

It appears that the gi.repository route is not handled that way, though -- this may be two bugs, in fact.

Comment 5 Simon Schampijer 2012-03-27 06:36:17 UTC

Thanks for all the feedback, I should have read the API more clearly, tried to use g_utf8_validate to make sure we have valid data - with pygobject this does segfault, filed #672889 for that.

Comment 6 Martin Pitt 2012-03-27 09:53:28 UTC

So I guess we can close this report? If Python crashes iff C crashes, then it does a pretty good job of maintaining API and behaviour compatibility :-)

bug 672889 is a real issue, of course, I'm looking into this now.

Comment 7 Martin Pitt 2012-03-27 11:05:01 UTC

Closing, with Simon's consent on IRC.

Comment 8 Allison Karlitskaya (desrt) 2012-03-27 13:12:05 UTC

It's my opinion that this bug should be fixed -- probably by ensuring that we don't accept non-ascii strings from Python (unless of course they're unicode strings -- then we convert those to utf8).

We could also say that we expect any non-ascii content in non-unicode strings to be utf8... That's a bit more magic, but it would fit with the way we use our C APIs as well...

Comment 9 Martin Pitt 2012-03-27 16:50:10 UTC

(In reply to comment #8)
> It's my opinion that this bug should be fixed -- probably by ensuring that we
> don't accept non-ascii strings from Python (unless of course they're unicode
> strings -- then we convert those to utf8).

It would certainly be nice to throw a proper exception instead of crashing. It would add quite a lot of runtime overhead, though.

But I don't mind reopening this for the more general case. I'm not a fan of adding an override with an UTF-8 check for this particular function, though.

Comment 10 Martin Pitt 2013-01-08 16:22:57 UTC

Simon, this doesn't actually crash for me any more, I get a proper UnicodeDecodeError with python 2, and with python3 it just works (as the interpreter itself ensures that strings are properly encoded). Do you have a current reproducer for this, or did that stop being relevant now?

Comment 11 Simon Schampijer 2013-01-10 10:27:20 UTC

{{{
# -*- coding: latin-1 -*-                                                       

from gi.repository import GLib

GLib.markup_escape_text('Gruß Möglichkeiten Quiñones')

GLib.markup_escape_text('\xc3\xdc\xc2\xeb13\xb8\xf61')
}}}

I can still see the issue that the second 'markup_escape_text'-call above does segfault. Python 2.7.3, Glib 2.34, Pygobject3 3.4.2

#672889 has been fixed, so I can validate first if the string is UTF-8 and use the API correctly.

Comment 12 Martin Pitt 2013-01-10 11:32:52 UTC

Hm, that's what I tried; with python3 it succeeds, with python2.7 I get an UnicodeDecodeError. With adding print() around the statements:

$ python3 /tmp/test.py 
GruÃ&#x9f; MÃ¶glichkeiten QuiÃ±ones
ÃÜÂë13¸ö1

(which is right as my locale is UTF-8)

$ python /tmp/test.py 
Gruß Möglichkeiten Quiñones
Traceback (most recent call last):

+ Trace 231364

File "/tmp/test.py", line 7 in <module>

print(GLib.markup_escape_text('\xc3\xdc\xc2\xeb13\xb8\xf61'))

File "/usr/lib/python2.7/dist-packages/gi/overrides/GLib.py", line 438 in markup_escape_text
```
return GLib.markup_escape_text(text.decode('UTF-8'), length)
```
File "/usr/lib/python2.7/encodings/utf_8.py", line 16 in decode
```
return codecs.utf_8_decode(input, errors, True)
```

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: invalid continuation byte


That's pygobject 3.7.3, though; maybe the removal of the static glib bindings changed this?

Comment 13 Simon Feltman 2014-09-09 23:40:18 UTC

I also cannot reproduce this with later versions of pygi, I get the exact results as Martin in comment #12.