GNOME Bugzilla – Bug 132040
abusing setdefaultencoding()
Last modified: 2018-08-17 13:39:17 UTC
I'm using Python 2.3.3 and PyGTK 2.0.0 on SuSE Linux 8.2. PyGTK sets the default encoding to UTF-8 with PyUnicode_SetDefaultEncoding(). This is gross, because it overrides site.py. If a script then calls sys.setdefaultencoding(), things may break. This looks like a hack to enable the use of the "s" and "z" parameter codes, the behavior of which depends on the default encoding. I think it would be better to use "es". Unfortunately, there doesn't seem to be an "ez", but it could be simulated with "O&".
Needs triaging
Unfortunately, the "es" conversion requires that the caller frees the memory allocated by the encoding process. That's what makes it difficult to handle in the code generator without causing memory leaks. Moreover, I think the code generator needs a bit of code refactoring to handle this. But I agree that setdefaultencoding is not a very good idea. This is just like the problem of python changing LC_NUMERIC underneath us... Let's keep this bug open. Difficult is not impossible, but right now we're concentrating on simple bugs to make a release soon.
Marking this as for a future milestone. I think this should be done, but it's difficult and will change the api if we want to return strings as unicode objects.
*** Bug 324323 has been marked as a duplicate of this bug. ***
This behavior is depended upon in some strange places. I'm not sure if I should make a new bug report for this one or not, so I rather decided to add just a comment, as it's closely related. I don't know what places other than layout.set_text() have this problem, but in any case, the user shouldn't need to do unicode -> utf-8 conversions where they're obvious (IMHO). FWIW, I'm using the kludge in my own code for now (because "import gtk" would require a display). The following script should not print any errors, but it does: #!/usr/bin/env python # -*- coding: utf-8 -*- import pango import pangocairo import cairo very_ugly_kludge = False def bugtest(): surface = cairo.ImageSurface(cairo.FORMAT_ARGB32, 20, 20) context = pangocairo.CairoContext(cairo.Context(surface)) layout = context.create_layout() txt = u'Ä' if very_ugly_kludge: txt = txt.encode('utf-8') layout.set_text(txt) try: bugtest() except UnicodeEncodeError: print 'Error: layout.set_text() does not accept unicode input' else: print 'OK without gtk imported' try: import gtk except RuntimeError, reason: print "Can't import gtk, not testing further (gtk: %s)"%reason else: try: bugtest() except UnicodeEncodeError: print ('Error: layout.set_text() does not accept unicode input ' 'even with gtk imported') else: print 'OK with gtk imported'
Is the bug in the last comment that the automatic unicode -> utf8 conversion doesn't work if gtk is never imported, but pango is imported and used. If so, please open another bug for it; it is a bug and needs fixing. If the bug is something else, please clarify. This bug is that the automatic conversion depends on setting the interpreter-wide default encoding.
Added as bug 328031.
Please indulge me for a moment and allow me to rant, I think my points are valid. It's very important to get this fixed, this should not be a low priority issue IMHO. This is a terrible hack with nasty hidden side effects. It took me a long time to diagnose why we were getting unicode encoding errors sometimes but not others. As it turns out it all came down to when and if gtk was imported :-( The problem was seen in libraries other than PyGTK. I investigated the problem by extracting the relevant code into a small test case which did not include PyGTK. A thorough reading of the CPython documentation and source code pointed directly at python's default encoding. However it's impossible to change the default encoding from ascii to utf-8 from python code, it's immutable. Little did I know another optionally loaded component was secretly modifying this variable from C code and affecting everything globally! Something which would not been seen in an isolated test case. It's hard enough as it is to fully understand all the issues which impinge upon i18n, a challenge exacerbated by the number of diverse components involved, all of which must follow the same rules to get the right end result. When one of the components "cheats" in a hidden obscure manner which then impacts other participating components it makes it difficult to understand if one's conception of i18n handling is correct as well as what needs to be done to assure 18n handling is correct in one's own code and in each of the participating components. It is an unfortunate fact most developers do not understand i18n and there is much incorrect information floating around. People draw false conclusions when they observe misbehaving components producing seemingly correct results and then they perpetuate the i18n misinformation as if it were fact. At a minimum PyGTK's munging of the default encoding should be prominently and clearly documented lest it also contribute to the body of incorrect information surrounding i18n coding practices. It would really be best to avoid the temptation to code hidden global side effects just because it's expedient. Such things inevitably cause a lot of down stream pain. The binding needs to use the 'es' family of format conversion specifiers in the PyArg_ParseTuple and PyArg_BuildValue CPython API families explicitly specifying the encoding for the C library functions being called in the binding is utf-8. I hope my comments are viewed as constructive and not mean spirited. I know when I'm the recipient of a rant I sometimes feel personally offended and I really don't want that to be the end result here. But I do feel strongly this is directly contributing to a lot of needless i18n woes. i18n needs less headaches, not more.
John, please read comment #2. I don't like to repeat myself. To summarize, yes, I think we all agree using "es" is desirable, but I developers also agree that using "es" is not easy to implement in the current PyGTK framework. But working patches are, of course, welcome... :-)
(In reply to comment #8) > Please indulge me for a moment and allow me to rant, I think my points are > valid. > > It's very important to get this fixed, this should not be a low priority issue > IMHO. This is a terrible hack with nasty hidden side effects. It took me a long > time to diagnose why we were getting unicode encoding errors sometimes but not > others. As it turns out it all came down to when and if gtk was imported :-( > > The problem was seen in libraries other than PyGTK. I investigated the problem > by extracting the relevant code into a small test case which did not include > PyGTK. A thorough reading of the CPython documentation and source code pointed > directly at python's default encoding. However it's impossible to change the > default encoding from ascii to utf-8 from python code, It is actually possible*, so you could do something like this if you want: old = sys.getdefaultencoding() import gtk import sys reload(sys) sys.setdefaultencoding(old) *) the default site.py deletes the setdefaultencoding function from sys. Fortunately for the small group of persistent hackers, that technique is worked around by using reload, which re-creates the module namespace.
pygtk is not under active development anymore and had its last code changes in 2013. Its codebase has been archived: https://gitlab.gnome.org/Archive/pygtk/commits/master PyGObject at https://gitlab.gnome.org/GNOME/pygobject is its successor. See https://pygobject.readthedocs.io/en/latest/guide/porting.html for porting info. Closing this report as WONTFIX as part of Bugzilla Housekeeping to reflect reality. Feel free to open a task in GNOME Gitlab if the issue described in this task still applies to a recent version of PyGObject. Thanks!