Bug 111925 – Bad unicode management

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 111925 - Bad unicode management


Summary:	Bad unicode management


Status:	RESOLVED OBSOLETE

Product:	vte
Classification:	Core
Component:	general
Version:	0.10.x
Hardware:	Other AIX

Importance:	High critical
Target Milestone:	---
Assigned To:	VTE Maintainers
QA Contact:	VTE Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2003-04-30 11:23 UTC by Laurent Vivier
Modified:	2006-04-12 11:07 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
AIX patch (59.89 KB, patch) 2003-05-13 09:34 UTC, Laurent Vivier	needs-work	Details \| Review

Description Laurent Vivier 2003-04-30 11:23:44 UTC

When compiling vte, in doc/reference, the following error is generated:

*** Scanning header files ***
[...]
creating vte-scan

** ERROR **: Don't know how to read native-endian unicode data!
aborting...
Scan failed

The reason is in src/matcher.c:

_vte_matcher_find_valid_encoding() uses for encoding wide characters the
charset "UNICODE" (g_iconv_open() works fine on AIX if we have "unicode
UNICODE" in charset.alias).

The following test fails:

    if (memcmp(outbuf, buffer, outbytes) == 0) {

because g_iconv() encodes characters in outbuf on two bytes but buffer is a
gunichar array, and gunichar is guint32 (four bytes).

I think this algorithm is not appropriate to find if encoding is valid
because it doesn't follow unicode specification.

Remember:

http://www.unicode.org/faq/basic_q.html#19

Q. I understand that all Unicode characters are 16 bits, and that the high
byte is used to switch between code blocks. Is that correct?

A. Absolutely not! Unicode characters may be encoded at any code point from
U+0000 to U+10FFFF. The size of the code unit used for expressing those
code points may be 8 bits (for UTF-8), 16 bits (for UTF-16), or 32 bits
(for UTF-32) [See UTF & BOM]. Even when Unicode characters are expressed
with 16-bit code units, there is no concept of a high byte switching values
between "code pages" expressed in the low byte. The entire 16-bit value
expresses the entire character, period. [KW]

Comment 1 Laurent Vivier 2003-04-30 12:36:43 UTC

After more investigation, I think this problem can appear on all systems.

Another charset checked is ISO-10646 and it exists under two encoding
forms: UCS-2 (2 bytes) and UCS-4 (4 bytes)

See http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html

According to this document, "2. The structure of the coding space":

The third and fourth octets gives the row number and the cell number
of the character. Those characters that can be represented by the
2-octet form of UCS belong to plane 0 of group 0, which is called the
Basic Multilingual Plane, BMP. The 4-octet representation of a
character in the BMP is produced by putting two 0 octets before its
2-octet representation.

Still no characters have been allocated to positions outside the BMP,
and only the 2-octet form is used in practice.
-------------
(it's certainly why my AIX has got UCS-2 and not UCS-4)

Comment 2 Nalin Dahyabhai 2003-04-30 18:35:50 UTC

The terminal is looking for a giconv target name which it can use to
convert from multibyte encodings directly to gunichars and back.  If
there's a bug here, it's that the error message is potentially
misleading.  I expect that building glib with libiconv instead of the
OS-supplied version of iconv will provide the needed capability.

Changing the error message to "Don't know how to convert to/from
gunichar data!" is more accurate, so marking as fixed in CVS.

Comment 3 Laurent Vivier 2003-05-13 09:33:13 UTC

Change this bug to be AIX specific.

I add a patch that port vte to AIX in several ways:

- detect if we need to use /dev/ptmx or /dev/ptc
- check if we can include both utmp.h and utmpx.h
- remove comma at end of enum (xlC compiler)
- replace g_iconv("UCS-4","UTF-8") by g_utf8_to_ucs4() and
  g_ucs4_to_utf8().
- remove detection of good charset name (ISO8859-1, ISO-8859-1,...),
  this work must be done in glib/libcharset using charset.alias

Comment 4 Laurent Vivier 2003-05-13 09:34:40 UTC

Created attachment 16488 [details] [review]
AIX patch

Comment 5 Kjartan Maraas 2003-10-30 22:49:22 UTC

Lowering pri on this to be in line with normal setting.

Comment 6 Kjartan Maraas 2004-10-18 09:36:09 UTC

Nalin? Is this ok?

Comment 7 Kjartan Maraas 2005-02-14 21:43:45 UTC

Doesn't apply any longer. Sorry for the delay in getting to this. Does anyone
have an updated patch for this? It seems some parts of this has already been
applied though since patch complained about previously applied hunks...

Comment 8 Behdad Esfahbod 2006-04-12 09:58:56 UTC

Like Kjartan said, parts of this are already committed.  Since the patch is three years old now, I believe we can close this.  I really doubt that vte doesn't compile on AIX these days.

Comment 9 Jean-Pierre Dion 2006-04-12 11:07:28 UTC

Yes it compiles on AIX. Can close this, thanks.

Jean-Pierre Dion