GNOME Bugzilla – Bug 105626
Add g_unichar_iswide_cjk()
Last modified: 2011-02-18 16:13:53 UTC
The function g_unichar_iswide() returns incorrect value for some GBK (Simplified Chinese) punctuation chars. An example of such punctuation characters is the unicode char 0x201c (whose GBK code is 0xa1b0). The bug causes incorrect cursor positioning of gnome-terminal on lines containing such chars. (Sorry I forgot if I have already reported this bug. I only remember that I have reported this bug to bugzilla.redhat.com).
Sorry I have made a mistake in my report. The GBK code of the character which produces the bug should be 0xb0a1 (i.e., 0xa1 0xb0).
Note that a number of Unicode characters have _ambiguous_ width. http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c (A newer version of what g_unichar_iswide() is based upon) has both wcwidth() and wcwidth_cjk. Current gnome-terminal I believe has it's on wcwidth function and does tricks to base the width of these ambiguous characters on the source encoding.
I'm using the version of gnome-terminal from RH8.0. It is based on vte and the latter uses g_unichar_iswide() to calculate cursor position. I don't know much about the unicode. But I think g_unichar_iswide() should take into account the current locale. My workaround for gnome-terminal is to replace function g_unichar_iswide() with a simple function (in vte/src/vte.c) which calls iconv() to calculate the width. It works well and the only drawback is the speed.
Basing it off the locale wouldn't give the right results for gnome-terminal or other applications. Imagine a user running in an English locale, but: Going to a chinese web page Reading a chinese email Or vice-versa.
GNU libc's wcwidth() implementation sets the widths of characters on a per-encoding basis, and assumes that the current codeset provides the correct widths. For example, in the ja_JP.UTF-8 locale, glibc treats ambiguous-width characters as single-width, but in ja_JP.eucJP, it treats them as double-width. Something along the lines of a g_unichar_is_ambiguously_wide() function would probably be more useful because it would allow applications to select the right value for these cases, including gnome-terminal which frequently deals with data which is encoded in a non-default encoding.
Created attachment 17082 [details] [review] patch to implement (slow) g_unichar_is_ambiguously_wide
*** Bug 338305 has been marked as a duplicate of this bug. ***
Created attachment 64371 [details] [review] Implement g_unichar_iswide_cjk() Copying table from Markus Kuhn and using bsearch(3). The data seems to be the same for Unicode 4.1 and 5.0.
looks fine to me in principle. how big is the table ? probably not worth putting the single character ranges in a separate table to save some space, or is it ?
The table is between 1kb and 1.5kb. I don't see it worth saving like 500 bytes (at most) at the cost of having to generate the table ourselves. And I know I'm going to replace these all within a year or two when I write a separate library for UCD...
fine with me (keeping one table) but separate library == more dirty pages...
2006-04-27 Behdad Esfahbod <behdad@gnome.org> * docs/reference/glib/glib-sections.txt, * glib/gunicode.h glib/guniprop.c: Implement g_unichar_iswide_cjk(). (#105626)
We can start by including the generated sources in glib. The idea is to have a single library that everybody uses to access UCD, to not have to rerun a zillion different scripts in a million modules to update to the next UCD version...
Moving off API freeze milestone, since the API was added.