Bug 132040 – abusing setdefaultencoding()

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 132040 - abusing setdefaultencoding()


Summary:	abusing setdefaultencoding()


Status:	RESOLVED WONTFIX

Product:	pygtk
Classification:	Bindings
Component:	pango
Version:	1.99.x/2.0.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Nobody's working on this now (help wanted and appreciated)
QA Contact:	Python bindings maintainers

URL:
Whiteboard:	gnome[unmaintained]

Duplicates:	324323 (view as bug list)
Depends on:
Blocks:

Reported:	2004-01-20 22:38 UTC by Jon Willeke
Modified:	2018-08-17 13:39 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Jon Willeke 2004-01-20 22:38:12 UTC

I'm using Python 2.3.3 and PyGTK 2.0.0 on SuSE Linux 8.2.

PyGTK sets the default encoding to UTF-8 with
PyUnicode_SetDefaultEncoding().  This is gross, because it overrides
site.py.  If a script then calls sys.setdefaultencoding(), things may break.

This looks like a hack to enable the use of the "s" and "z" parameter
codes, the behavior of which depends on the default encoding.  I think it
would be better to use "es".  Unfortunately, there doesn't seem to be an
"ez", but it could be simulated with "O&".

Comment 1 Christian Reis (not reading bugmail) 2004-02-29 00:32:08 UTC

Needs triaging

Comment 2 Gustavo Carneiro 2004-02-29 12:14:06 UTC

Unfortunately, the "es" conversion requires that the caller frees the
memory allocated by the encoding process.  That's what makes it
difficult to handle in the code generator without causing memory
leaks.    Moreover, I think the code generator needs a bit of code
refactoring to handle this.
But I agree that setdefaultencoding is not a very good idea.  This is
just like the problem of python changing LC_NUMERIC underneath us...
Let's keep this bug open.  Difficult is not impossible, but right now
we're concentrating on simple bugs to make a release soon.

Comment 3 John Ehresman 2004-07-21 21:48:10 UTC

Marking this as for a future milestone.  I think this should be done, but it's
difficult and will change the api if we want to return strings as unicode objects.

Comment 4 Johan (not receiving bugmail) Dahlin 2005-12-17 10:51:18 UTC

*** Bug 324323 has been marked as a duplicate of this bug. ***

Comment 5 Rauli Ruohonen 2006-01-21 13:10:40 UTC

This behavior is depended upon in some strange places. I'm not sure if I should make a new bug report for this one or not, so I rather decided to add just a comment, as it's closely related. I don't know what places other than
layout.set_text() have this problem, but in any case, the user shouldn't
need to do unicode -> utf-8 conversions where they're obvious (IMHO). FWIW,
I'm using the kludge in my own code for now (because "import gtk" would
require a display).

The following script should not print any errors, but it does:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import pango
import pangocairo
import cairo
very_ugly_kludge = False
def bugtest():
  surface = cairo.ImageSurface(cairo.FORMAT_ARGB32, 20, 20)
  context = pangocairo.CairoContext(cairo.Context(surface))
  layout = context.create_layout()
  txt = u'Ä'
  if very_ugly_kludge: txt = txt.encode('utf-8')
  layout.set_text(txt)
try: bugtest()
except UnicodeEncodeError:
  print 'Error: layout.set_text() does not accept unicode input'
else: print 'OK without gtk imported'
try: import gtk
except RuntimeError, reason:
  print "Can't import gtk, not testing further (gtk: %s)"%reason
else:
  try: bugtest()
  except UnicodeEncodeError:
    print ('Error: layout.set_text() does not accept unicode input '
           'even with gtk imported')
  else: print 'OK with gtk imported'

Comment 6 John Ehresman 2006-01-21 16:34:01 UTC

Is the bug in the last comment that the automatic unicode -> utf8 conversion doesn't work if gtk is never imported, but pango is imported and used.  If so, please open another bug for it; it is a bug and needs fixing.  If the bug is something else, please clarify.

This bug is that the automatic conversion depends on setting the interpreter-wide default encoding.

Comment 7 Rauli Ruohonen 2006-01-21 18:01:17 UTC

Added as bug 328031.

Comment 8 John Dennis 2008-02-23 17:23:25 UTC

Please indulge me for a moment and allow me to rant, I think my points are valid.

It's very important to get this fixed, this should not be a low priority issue IMHO. This is a terrible hack with nasty hidden side effects. It took me a long time to diagnose why we were getting unicode encoding errors sometimes but not others. As it turns out it all came down to when and if gtk was imported :-(

The problem was seen in libraries other than PyGTK. I investigated the problem by extracting the relevant code into a small test case which did not include PyGTK. A thorough reading of the CPython documentation and source code pointed directly at python's default encoding. However it's impossible to change the default encoding from ascii to utf-8 from python code, it's immutable. Little did I know another optionally loaded component was secretly modifying this variable from C code and affecting everything globally! Something which would not been seen in an isolated test case.

It's hard enough as it is to fully understand all the issues which impinge upon i18n, a challenge exacerbated by the number of diverse components involved, all of which must follow the same rules to get the right end result. When one of the components "cheats" in a hidden obscure manner which then impacts other participating components it makes it difficult to understand if one's conception of i18n handling is correct as well as what needs to be done to assure 18n handling is correct in one's own code and in each of the participating components.

It is an unfortunate fact most developers do not understand i18n and there is much incorrect information floating around. People draw false conclusions when they observe misbehaving components producing seemingly correct results and then they perpetuate the i18n misinformation as if it were fact. At a minimum PyGTK's munging of the default encoding should be prominently and clearly documented lest it also contribute to the body of incorrect information surrounding i18n coding practices.

It would really be best to avoid the temptation to code hidden global side effects just because it's expedient. Such things inevitably cause a lot of down stream pain.

The binding needs to use the 'es' family of format conversion specifiers in the PyArg_ParseTuple and PyArg_BuildValue CPython API families explicitly specifying the encoding for the C library functions being called in the binding is utf-8.

I hope my comments are viewed as constructive and not mean spirited. I know when I'm the recipient of a rant I sometimes feel personally offended and I really don't want that to be the end result here. But I do feel strongly this is directly contributing to a lot of needless i18n woes. i18n needs less headaches, not more.

Comment 9 Gustavo Carneiro 2008-02-23 18:02:24 UTC

John, please read comment #2.  I don't like to repeat myself.

To summarize, yes, I think we all agree using "es" is desirable, but I developers also agree that using "es" is not easy to implement in the current PyGTK framework.  But working patches are, of course, welcome... :-)

Comment 10 Johan (not receiving bugmail) Dahlin 2008-02-23 18:18:29 UTC

(In reply to comment #8)
> Please indulge me for a moment and allow me to rant, I think my points are
> valid.
> 
> It's very important to get this fixed, this should not be a low priority issue
> IMHO. This is a terrible hack with nasty hidden side effects. It took me a long
> time to diagnose why we were getting unicode encoding errors sometimes but not
> others. As it turns out it all came down to when and if gtk was imported :-(
> 
> The problem was seen in libraries other than PyGTK. I investigated the problem
> by extracting the relevant code into a small test case which did not include
> PyGTK. A thorough reading of the CPython documentation and source code pointed
> directly at python's default encoding. However it's impossible to change the
> default encoding from ascii to utf-8 from python code, 

It is actually possible*, so you could do something like this if you want:

old = sys.getdefaultencoding()
import gtk
import sys
reload(sys)
sys.setdefaultencoding(old)

*) the default site.py deletes the setdefaultencoding function from sys. Fortunately for the small group of persistent hackers, that technique is worked around by using reload, which re-creates the module namespace.

Comment 11 André Klapper 2018-08-17 13:39:17 UTC

pygtk is not under active development anymore and had its last code changes
in 2013. Its codebase has been archived:
https://gitlab.gnome.org/Archive/pygtk/commits/master

PyGObject at https://gitlab.gnome.org/GNOME/pygobject is its successor. See https://pygobject.readthedocs.io/en/latest/guide/porting.html for porting info.

Closing this report as WONTFIX as part of Bugzilla Housekeeping to reflect
reality. Feel free to open a task in GNOME Gitlab if the issue described in this task still applies to a recent version of PyGObject. Thanks!