Bug 649165 – Enumeration symbols are LC_CTYPE specific

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 649165 - Enumeration symbols are LC_CTYPE specific


Summary:	Enumeration symbols are LC_CTYPE specific


Status:	RESOLVED FIXED

Product:	pygobject
Classification:	Bindings
Component:	introspection
Version:	Git master
Hardware:	Other Linux

Importance:	Normal major
Target Milestone:	---
Assigned To:	Martin Pitt
QA Contact:	Python bindings maintainers

URL:
Whiteboard:

Duplicates:	655528 (view as bug list)
Depends on:
Blocks:

Reported:	2011-05-02 09:43 UTC by Martin Pitt
Modified:	2011-09-11 18:14 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
proposed patch (4.74 KB, patch) 2011-05-02 14:02 UTC, Martin Pitt	none	Details \| Review
proposed patch (4.74 KB, patch) 2011-05-03 06:41 UTC, Martin Pitt	none	Details \| Review
proposed patch (5.03 KB, patch) 2011-05-03 06:43 UTC, Martin Pitt	none	Details \| Review
proposed patch (4.58 KB, patch) 2011-05-03 08:49 UTC, Martin Pitt	accepted-commit_now	Details \| Review

Description Martin Pitt 2011-05-02 09:43:36 UTC

We recently got a bug report about PyGI programs failing in Turkish locales (https://launchpad.net/bugs/747796). Under tr_TR.UTF-8, exported symbols with an uppercase "I" come out as lowercase "i':

$ python -c "import locale; from gi.repository import Gtk; locale.setlocale(locale.LC_ALL, 'tr_TR.UTF-8'); print dir(Gtk.IconSize)"
['BUTTON', 'DND', 'DiALOG', 'iNVALiD', ...]

It is correct in C and e. g. de_DE.UTF-8:
 PYTHONPATH=. python -c "import locale; from gi.repository import Gtk; locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8'); print dir(Gtk.IconSize)"
['BUTTON', 'DIALOG', 'DND', 'INVALID', ...]

In tr_TR.UTF-8, you get an AttributeError when you try to access e. g. Gtk.IconSize.DIALOG.

It has been demonstrated that this can be worked around by enclosing a GI call into a pair of locale.setlocale(locale.LC_CTYPE, 'C') / locale.setlocale(locale.LC_CTYPE, '') calls. In other words, the initial "from gi.repository import Gtk" import worked correctly, but the bug seems to happen for (un)marshalling the symbol names. 

At the symbol marshalling time there should not happen any LC_CTYPE conversion? Or do we need to consider programming languages which actually support non-ASCII identifiers?

Comment 1 Martin Pitt 2011-05-02 09:47:58 UTC

A simple reproducer is to run "LANG=tr_TR.UTF-8 make check". This will fail in both test_gi and test_gdbus, due to this problem.

Interestingly, calling "LANG=tr_TR.UTF-8 TEST_NAMES=test_gi make check" does NOT fail, neither does TEST_NAMES=test_gdbus. It seems that running some other test (like the Gtk ones?) "clutters" the marshalling in some way.

Comment 2 Martin Pitt 2011-05-02 10:45:11 UTC

For the record, Turkish locales map two ASCII characters to non-ASCII characters with respect to case conversion:
    "I" (capital I) is mapped to "ı" (lowercase i-dotless)
    "i" (lowercase i) is mapped to "İ" (capital I-dotted)

This seems to be related, and it has been demonstrated that hacking the tr_TR locale definition for "toupper" and "tolower" to map i -> I and I->i also fixes the problem.

Indeed there is a case conversion going on here with enumeration values. E. g. if the gir specifies 

    <enumeration name="IconSize"
[...]
      <member name="invalid"

then this is accessed in python as "INVALID".

I think gi/module.py IntrospectionModule.__getattr__() does this conversion. If I add some debugging to the "for value_info in info.get_values():" loop, I get:

  IntrospectionModule __getattr_(IconSize): upping dialog
  == DiALOG

Importing Gtk somehow triggers a setlocale(), which explains why the gdbus and gi tests work when being called on their own, but fail under tr_TR.UTF-8 when test_overrides also run:

$ python -c 'import locale; print locale.getlocale(); from gi.repository import Gtk; print locale.getlocale();'
(None, None)
('de_DE', 'UTF8')

So I think I have found the root cause here.

The question is now whether we should really use a locale specific upper() call here, or just switch to the C locale for this? I think we should do the latter. locale-specific constant names do sound a bit crazy to me.

Comment 3 Martin Pitt 2011-05-02 11:12:35 UTC

This patch fixes it:

--- a/gi/module.py
+++ b/gi/module.py
@@ -24,6 +24,7 @@ from __future__ import absolute_import

 import os
 import gobject
+import locale

 import gi
 from .overrides import registry
@@ -121,7 +122,13 @@ class IntrospectionModule(object):
                 wrapper.__module__ = 'gi.repository.' + info.get_namespace()

                 for value_info in info.get_values():
+                    # switch to C locale for upper() to avoid locale specific
+                    # identifier conversion (e. g. in Turkish 'i'.upper() == 'i')
+                    # see https://bugzilla.gnome.org/show_bug.cgi?id=649165
+                    old_ctype = locale.getlocale(locale.LC_CTYPE)
+                    locale.setlocale(locale.LC_CTYPE, 'C')
                     value_name = value_info.get_name().upper()
+                    locale.setlocale(locale.LC_CTYPE, old_ctype)
                     setattr(wrapper, value_name, wrapper(value_info.get_value()))

I'm now working on a proper test case, and then submit a complete git-formatted patch as an attachment.

Comment 4 Martin Pitt 2011-05-02 14:02:14 UTC

Created attachment 187038 [details] [review]
proposed patch

Proposed patch, including test suite integration.

Unfortunately the standard Python library does not have an ASCII only variant of upper(). gi.repository.GLib provides g_ascii_toupper(), but this only works on a single char, and I'm not sure whether it's a good idea to use this in gi/module.py (feels so recursive). Also, performance-wise it would be a lot worse, I suppose.

Comment 5 Barry Warsaw 2011-05-02 15:40:16 UTC

A few comments.  First str.upper() is defined to be locale-dependent for 8-bit strings.  I don't know what that implies for unicode strings.  Second, I agree that locale-dependent identifiers is crazy talk :).

I think I'd actually implement the temporary switch in locales as a context manager so that 1) it works for the with-statement; 2) you can't accidentally *not* reset the locale should some exception occur during the work.  E.g. (mostly untested):


import locale
from contextlib import contextmanager


@contextmanager
def c_locale():
    old_ctype = locale.getlocale(locale.LC_CTYPE)
    locale.setlocale(locale.LC_CTYPE, 'C')
    try:
        yield
    finally:
        locale.setlocale(locale.LC_CTYPE, old_ctype)


with c_locale():
    for value_info in info.get_values():
        value_name = value_info.get_name().upper()
        setattr(wrapper, value_name, wrapper(value_info.get_value()))


If you don't like the call syntax in the with statement, you could instead create a class with __enter__() and __exit__() methods, then instantiate that class globally so everyone has access to it.

Comment 6 Martin Pitt 2011-05-03 06:41:39 UTC

Created attachment 187099 [details] [review]
proposed patch

Thanks Barry for your review. This indeed looks a bit complicated, but I agree that this should be implemented in a more robust way. So I just used a try:/finally: around the loop now, which should achieve the same effect.

Comment 7 Martin Pitt 2011-05-03 06:43:25 UTC

Created attachment 187100 [details] [review]
proposed patch

Meh, forgot to run format-patch, this is really the updated one.

Comment 8 Nils Philippsen 2011-05-03 08:34:36 UTC

Hmm, I'm not sure how this fares with threaded applications -- could this affect locale-dependent functionality in threads running concurrently with this?

Alternatively to switching locales back and forth, we could just use a private, locale-unaware toupper() function, like this one:

--- 8< ---
from string import maketrans

fromchars = 'abcdefgjhijklmnopqrstuvwxyz'
tochars = 'ABCDEFGJHIJKLMNOPQRSTUVWXYZ'

transtable = maketrans(fromchars, tochars)

def toupper(s):
    return s.translate(transtable)
--- >8 ---

Comment 9 Martin Pitt 2011-05-03 08:49:59 UTC

Created attachment 187105 [details] [review]
proposed patch

Of course, thanks Nils! KISS for the win :-)

Comment 10 johnp 2011-05-04 21:36:39 UTC

Comment on attachment 187105 [details] [review]
proposed patch

Looks good to me.  The local switching was insane but this is much better and most likely better performing.

Comment 11 Martin Pitt 2011-05-05 05:00:34 UTC

Pushed to master and 2-28 branches.

Comment 12 John Stowers 2011-07-28 19:57:34 UTC

*** Bug 655528 has been marked as a duplicate of this bug. ***

Comment 13 Alec Leamas 2011-09-11 17:23:58 UTC

The fix contains an error: the call   string.maketrans(... should read str.maketrans(... 

As a consequence of this fix, all imports of Gtk fails. See https://bugzilla.redhat.com/show_bug.cgi?id=737375. Would really like to reopen bug, but I can't (dumb? not allowed?).

Comment 14 Dieter Verfaillie 2011-09-11 18:14:49 UTC

(In reply to comment #13)
> The fix contains an error: the call   string.maketrans(... should read
> str.maketrans(... 
> 
> As a consequence of this fix, all imports of Gtk fails. See
> https://bugzilla.redhat.com/show_bug.cgi?id=737375. Would really like to reopen
> bug, but I can't (dumb? not allowed?).

This was fixed some time ago, in the pygobject-2-28 branch:
http://git.gnome.org/browse/pygobject/commit/?h=pygobject-2-28&id=667bec76ccbc85cc1d54a0e68977dbda241c028c

and in the master branch:
http://git.gnome.org/browse/pygobject/commit/?id=8f89ff24fcac627ce15ca93038711fded1a7c5ed