GNOME Bugzilla – Bug 111925
Bad unicode management
Last modified: 2006-04-12 11:07:28 UTC
When compiling vte, in doc/reference, the following error is generated: *** Scanning header files *** [...] creating vte-scan ** ERROR **: Don't know how to read native-endian unicode data! aborting... Scan failed The reason is in src/matcher.c: _vte_matcher_find_valid_encoding() uses for encoding wide characters the charset "UNICODE" (g_iconv_open() works fine on AIX if we have "unicode UNICODE" in charset.alias). The following test fails: if (memcmp(outbuf, buffer, outbytes) == 0) { because g_iconv() encodes characters in outbuf on two bytes but buffer is a gunichar array, and gunichar is guint32 (four bytes). I think this algorithm is not appropriate to find if encoding is valid because it doesn't follow unicode specification. Remember: http://www.unicode.org/faq/basic_q.html#19 Q. I understand that all Unicode characters are 16 bits, and that the high byte is used to switch between code blocks. Is that correct? A. Absolutely not! Unicode characters may be encoded at any code point from U+0000 to U+10FFFF. The size of the code unit used for expressing those code points may be 8 bits (for UTF-8), 16 bits (for UTF-16), or 32 bits (for UTF-32) [See UTF & BOM]. Even when Unicode characters are expressed with 16-bit code units, there is no concept of a high byte switching values between "code pages" expressed in the low byte. The entire 16-bit value expresses the entire character, period. [KW]
After more investigation, I think this problem can appear on all systems. Another charset checked is ISO-10646 and it exists under two encoding forms: UCS-2 (2 bytes) and UCS-4 (4 bytes) See http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html According to this document, "2. The structure of the coding space": The third and fourth octets gives the row number and the cell number of the character. Those characters that can be represented by the 2-octet form of UCS belong to plane 0 of group 0, which is called the Basic Multilingual Plane, BMP. The 4-octet representation of a character in the BMP is produced by putting two 0 octets before its 2-octet representation. Still no characters have been allocated to positions outside the BMP, and only the 2-octet form is used in practice. ------------- (it's certainly why my AIX has got UCS-2 and not UCS-4)
The terminal is looking for a giconv target name which it can use to convert from multibyte encodings directly to gunichars and back. If there's a bug here, it's that the error message is potentially misleading. I expect that building glib with libiconv instead of the OS-supplied version of iconv will provide the needed capability. Changing the error message to "Don't know how to convert to/from gunichar data!" is more accurate, so marking as fixed in CVS.
Change this bug to be AIX specific. I add a patch that port vte to AIX in several ways: - detect if we need to use /dev/ptmx or /dev/ptc - check if we can include both utmp.h and utmpx.h - remove comma at end of enum (xlC compiler) - replace g_iconv("UCS-4","UTF-8") by g_utf8_to_ucs4() and g_ucs4_to_utf8(). - remove detection of good charset name (ISO8859-1, ISO-8859-1,...), this work must be done in glib/libcharset using charset.alias
Created attachment 16488 [details] [review] AIX patch
Lowering pri on this to be in line with normal setting.
Nalin? Is this ok?
Doesn't apply any longer. Sorry for the delay in getting to this. Does anyone have an updated patch for this? It seems some parts of this has already been applied though since patch complained about previously applied hunks...
Like Kjartan said, parts of this are already committed. Since the patch is three years old now, I believe we can close this. I really doubt that vte doesn't compile on AIX these days.
Yes it compiles on AIX. Can close this, thanks. Jean-Pierre Dion