GNOME Bugzilla – Bug 524740
Strings should be demarshalled to unicode
Last modified: 2008-05-20 14:57:46 UTC
If I read this correctly: http://mail.gnome.org/archives/orbit-list/2002-December/msg00008.html It is safe to assume that CORBA strings will always be UTF-8. If we could demarshall CORBA strings to python unicode objects we would avoid a lot of confusion. I am encountering this a lot in Orca.
Created attachment 108156 [details] [review] Proposed patch This one-liner creates a python unicode object instead of a string object. I think the main concern is that this will affect apps that use this library. It has breakage potential. For example I need to alter pyatspi a bit to accommodate this.
We basically need to make sure to do encode('utf-8') where things need to be 1 character wide. I think. I also think that this patch is safe for pre-unicode pyorbit too, so that should be a relief.
(In reply to comment #0) > If I read this correctly: > http://mail.gnome.org/archives/orbit-list/2002-December/msg00008.html > > It is safe to assume that CORBA strings will always be UTF-8. You are not reading it correctly IMHO. Michael Meeks is only saying we should be using utf-8 encoded strings everywhere. The CORBA standard only says CORBA_string is mapped to a C char* string, with no assumption being made on the encoding. You can't decode a string assuming encoding is utf-8; what about non-GNOME apps? > If we could demarshall CORBA strings to python unicode objects we would avoid a > lot of confusion. I am encountering this a lot in Orca. Besides not being standard compliant~[1], this patch introduces an API change. Incidentally, PyGtk, which could more easily switch to unicode strings by default because there is no standard to forbid it, also has this problem and is still using python non-unicode strings because of API compatibility. The same change in PyORBit would break both the CORBA standard and break backward API, and since PyORBit is part of the GNOME Language Bindings platform it cannot change API without creating a new parallel installable version of itself. So, sorry, thanks for the patch, but no thanks. [1] "Both the bounded and the unbounded string type of IDL are mapped to the Python string type.", in "Python Language Mapping, v1.2 November 2002".
(In reply to comment #3) > You are not reading it correctly IMHO. Michael Meeks is only saying we should > be using utf-8 encoded strings everywhere. The CORBA standard only says > CORBA_string is mapped to a C char* string, with no assumption being made on > the encoding. Ok, so at which level can this assumption be made? If I understand it well (I'm not a Gnome hacker) CORBA doesn't specify that but in Gnome it is always used with UTF-8. So is there a common point, where decoding could be done safely? Leaving it up to the application would seem quite unfortunate to me. And even if we need to do it at the application level, are we safe to always assume UTF-8?
Sure, GNOME uses UTF-8 everywhere, but PyORBit and ORBit provide a generic CORBA implementation and are not in any way GNOME specific except for the use of GLib as runtime. Unicode handling has to be left to whatever sits on top of CORBA, application or whatever.
Gustavo: Are anyone actually using ORBit/PyORBit outside of a GNOME-related application in practice? This is similar to setting the default python encoding in pango/gtk. It's theoretically far from correct, but in practice it'll make it easier to get the common (99%) use cases right. Or am I missing something?
I have no idea who's using it. In any case, returning unicode instead of str objects would be an incompatible API change. That alone would block this change, so...