Bug 50633 – More native-encoded character sets and fonts support

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 50633 - More native-encoded character sets and fonts support


Summary:	More native-encoded character sets and fonts support


Status:	RESOLVED FIXED

Product:	pango
Classification:	Platform
Component:	general
Version:	0.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	1.0.0
Assigned To:	Owen Taylor
QA Contact:	Owen Taylor

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2001-02-06 23:41 UTC by Hidetoshi Tajima
Modified:	2004-12-22 21:47 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
fixs to support big5 encoding, comments to follow... (204.04 KB, patch) 2001-07-03 17:25 UTC, Patrick Sung	none	Details \| Review
fix to add big5-0, jisx0201 and jisx0212 (314.39 KB, patch) 2001-08-06 23:23 UTC, Hidetoshi Tajima	none	Details \| Review
conv_euc was fixed to handle jisx0208 case properly (314.34 KB, patch) 2001-08-06 23:51 UTC, Hidetoshi Tajima	none	Details \| Review
*new patch for jisx and big5 - see the above comment** (315.03 KB, patch) 2001-08-08 18:06 UTC, Hidetoshi Tajima	none	Details \| Review
a new patch - okay to check-in? change for charset alias is not in yet (315.01 KB, patch) 2001-11-28 18:47 UTC, Hidetoshi Tajima	none	Details \| Review

Description Hidetoshi Tajima 2001-02-06 23:41:21 UTC

Native-encoded fonts for JISX0201/JISX0212/BIG-5/CNS11643-* character
sets should be supported in Pango in addition that iso10646 encoded fonts
are used to display these characters.

Comment 1 Patrick Sung 2001-07-03 17:25:54 UTC

Created attachment 712 [details] [review]
fixs to support big5 encoding, comments to follow...

Comment 2 Patrick Sung 2001-07-03 17:28:21 UTC

The above patch only adds big5 encoding support, as I don't have the
other type of the fonts I can't test out the others.

With this patch, pango will look for *-big5-0 fonts.  I know there are
fonts with *-big5.eten-0 but I think we should be vendor neutral and
you easily make an alias of *-big5-0 to *-big5.eten-0.

Comment 3 Hidetoshi Tajima 2001-08-06 23:23:30 UTC

Created attachment 859 [details] [review]
fix to add big5-0, jisx0201 and jisx0212

Comment 4 Hidetoshi Tajima 2001-08-06 23:34:47 UTC

Created a patch to add jisx0212, jisx0201 and big5 into the
table-big.i
and modify conv_euc for codeset2(0x8eXX) and codeset3
(0x8fXXXX) of Japanese euc. Note that conv_big5() is the same as
the old conv_euc().  I'm not sure if charset_ordersings[] in the
table-big 
is really proper especially for 'zh-cn' and 'zh-tw'.

Listed below are the tables I copied from Unicode.org and used to
create
new table for char_mask_maps[] and char_masks[]. I hope I have not
missed
anything, but is there any way to see if the produced tables are
correct?

 iso-8859-1   iso-8859-15  iso-8859-5  iso-8859-9  koi8-r
 iso-8859-10  iso-8859-2   iso-8859-6  jis-0201   tis-620 big5     
 iso-8859-13  iso-8859-3   iso-8859-7  jis-0208   gb-2312 
iso-8859-14  
 iso-8859-4   iso-8859-8    jis-0212

There are yet two more tables we may add: CNS11643(1, 2 and 15 planes
only) and 8859-16. Can I continue to add these twos?

Comment 5 Hidetoshi Tajima 2001-08-06 23:45:02 UTC

I attached a wrong patch, will attach a correct one shortly.

Comment 6 Hidetoshi Tajima 2001-08-06 23:51:18 UTC

Created attachment 860 [details] [review]
conv_euc was fixed to handle jisx0208 case properly

Comment 7 Hidetoshi Tajima 2001-08-07 16:27:21 UTC

Can we read external files like "etc/pango/pangox.charsets" and
"~/.pangox.charsets"  to get charset names and their orderings?
(To be locale specific, we can read etc/pango/ja/pangox.charsets
first).

File can be as follows:

# File defining charsets of Pango X
#
# A charset has 5 fields:
#   charset-id  the-name-for-the-charset \ 
#   registry/encoding-fields-for-the-XLFD \
#   byte_len_per_char \
#   need_conversion(TRUE or FALSE)
#
# The byte_len_per_char and need_conversion fileds are used to
# determine coversion_func:
#	1 and FALSE => conv_ucs4
#	1 and TRUE   => conv_8bit
#	2 and FALSE => conv_ucs4
#	2 and TRUE   => conv_euc
# conv_euc should be modified so that it shoud chop off the leading
# single shift bytes 0x8e and 0x8f, and caluculate glyth index from
# the remaining 1 or 2 bytes (The remaining byte should be the same
# as the byte_len_per_char field).
  { 0,  "ISO-8859-1",   "iso8859-1",       1 FALSE}
  { 1,  "ISO-8859-2",   "iso8859-2",       1 TRUE}
  { 2,  "ISO-8859-3",   "iso8859-3",       1 TRUE}
  { 3,  "ISO-8859-4",   "iso8859-4",       1 TRUE}
  { 4,  "ISO-8859-5",   "iso8859-5",       1 TRUE}
  { 5,  "ISO-8859-6",   "iso8859-6",       1 TRUE}
  { 6,  "ISO-8859-7",   "iso8859-7",       1 TRUE}
  { 7,  "ISO-8859-8",   "iso8859-8",       1 TRUE}
  { 8,  "ISO-8859-9",   "iso8859-9",       1 TRUE}
  { 9,  "ISO-8859-10",  "iso8859-10",      1 TRUE}
  { 10, "ISO-8859-13",  "iso8859-13",      1 TRUE}
  { 11, "ISO-8859-14",  "iso8859-14",      1 TRUE}
  { 12, "ISO-8859-15",  "iso8859-15",      1 TRUE}
  { 13, "KOI8-R",       "koi8-r",          1 TRUE}
  { 14, "TIS-620",      "tis620.2529-1",   1 TRUE}
  { 15, "EUC-JP",       "jisx0208.1983-0", 2 TRUE}
  { 16, "EUC-CN",       "gb2312.1980-0",   2 TRUE}
  { 17, "EUC-KR",       "ksc5601.1987-0",  2 TRUE}
  { 18, "EUC-JP",       "jisx0201.1976-0", 2 TRUE}  # chop-off 0x8e
  { 19, "EUC-JP",       "jisx0212.1990-0", 2 TRUE}  # chop-off 0x8f
  { 20, "BIG5",	        "big5-0",	   2 TRUE}
  { 20, "CNS115643",	"cns11643-1",	   2 TRUE}  # chop-off 0x8fa1
  { 21, "ISO-10646",    "iso10646-1",      1 FALSE}
}

orderings {
  18, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 19, 16,
17, 20, 21
}

Comment 8 Hidetoshi Tajima 2001-08-07 20:20:07 UTC

conv_euc should be modified as below for generic EUC.

static PangoGlyph
conv_euc (CharCache  *cache,
	  Charset     *charset,
	  const char *input)
{
  GIConv cd;
  char outbuf[4];

  const char *inptr = input;
  size_t inbytesleft;
  char *outptr = outbuf;
  size_t outbytesleft = 4;

  inbytesleft = g_utf8_next_char (input) - input;
  
  cd = find_converter (cache, charset);

  g_iconv (cd, (char **)&inptr, &inbytesleft, &outptr, &outbytesleft);

  if ((guchar)outbuf[0] < 128)
    return outbuf[0];
  else if ((guchar)outbuf[0] == 0x8e || (guchar)outbuf[0] == 0x8f)
    {
      if (outbytesleft == 2)
	return ((guchar)outbuf[1]);
      else if (outbytesleft == 1)
	return ((guchar)outbuf[1] & 0x7f) * 256 + ((guchar)outbuf[2] & 0x7f);
      else if (outbytesleft == 0)
	return ((guchar)outbuf[2] & 0x7f) * 256 + ((guchar)outbuf[3] & 0x7f);
      else if (outbytesleft == 3)
	return 0;
    }
  else if (outbytesleft == 2)
    return ((guchar)outbuf[0] & 0x7f) * 256 + ((guchar)outbuf[1] &
0x7f);
  else
    return 0;
}

Comment 9 Hidetoshi Tajima 2001-08-08 18:04:03 UTC

created new patch for jisx0201,jisx0212 and big5-0. This time,
conv_eucjp() is used for japanese character sets used in euc-jp,
and other 2 byte character sets are handled in conv_16bit(),
which is the same as the old conv_euc. let me know if I can commit 
the patch.

cns11643-* is not included yet, and code to read external file
to get charset ordering  is not, either. I'd implement this if I can
get your permission.

Comment 10 Hidetoshi Tajima 2001-08-08 18:06:52 UTC

Created attachment 866 [details] [review]
new patch for jisx* and big5 - see the above comment

Comment 11 Hidetoshi Tajima 2001-10-12 17:46:08 UTC

I'd like to hear from you how's the status of this bug.
Are there concerns and issues about the patch?
I'd appreciate if you can kindly let me know what I can do.

Comment 12 Owen Taylor 2001-11-02 22:21:02 UTC

The patch looks about right. A few questions:

 * Is it really right to make jisx0201 the preferred
   charset for ASCII. The XFree86 Xlib config files
   do, for comparison:
 
   fs0     {
        charset {
                name            ISO8859-1:GL
        }
        font    {
                primary         ISO8859-1:GL
                substitute      JISX0201.1976-0:GL
                vertical_rotate all
        }
   }

   Since jisx0201 is obsolete, it seems that one might
   expect the fonts to be unmaintained or of less
   quality than iso8859* fonts.

   I assume that the high range of jisx0201 is only
   mapped to the FF80-FFEE compat range of Unicode
   and not to the normal katakana area?

 * big5-0 is not the standard encoding name for XFree86 
   anyways. On XFree86 the standard name is big5-eten.0.
   If it's different on Solaris, that would be an argument
   for making this configurable.

 * Is there some reason for using convert-eucjp for 
   Big5-0?; as I understand it Big5 is always two-byte,
   just with extended two-byte ranges. 

 * Why does conv_eucjp need to check the number of bytes
   in the output?

As for the character set configuration file. I guess I see
two points to it:

 - Preferences for preferred font encodings
 - Different names between systems

It looks to me like there is a bit too much information in
your file, however, since we can't actually add new
character sets -- the information in the big char_masks
array is fixed.

The most general thing that seems useful would be something
like a tab-separated file

iso8859-1        ISO-8859-1
big5.eten-0      BIG-5
...

That is, just a list of font names and charset names - with
the charset names being derived from the ENC_ISO_8859_1, etc
enumeration.

I suppose you could argue that the way of going from unicode
to font encoding might depend on the font as well as the
charset, and might involve a system-dependent iconv name,
but we are taking care of the system-depenent iconv name
at the GLib level, and is the other problem really a 
concern?

Comment 13 Hidetoshi Tajima 2001-11-02 23:05:04 UTC

Thanks for the detailed comments.

 * Why jisx0201?
    In Solaris "ja" locale, jisx0201 is primary font and
    iso8859-1 is secondary for cs0/fs0. That was the reason,
    but I agree to what you said. Let's stay with the    
    current.
 * why "big5-0"
    That's the charset name of big5 fonts on Solaris, so
    we would need some configuration for it. Creating
    an alias file seems a good idea.
 * convert-eucjp for Big5-0
    That should be conv_16bit. My typo.
 * Why conv_eucjp checks the number of bytes?
    Certainly it is not really needed, but just testing
    outbuf[0] is enough. Perhaps, I did so to make
    sure outbuf[1] and outbuf[2] should have assigned
    values.

Regarding to configuration file, I like yours. That
is simple enough but will meet the purposes.

Can we allow it to be configured by users or venders, 
and also make it locale-specific? For instance, Solaris/Sun, might
want to use JISX0201 first in Japanese locales
while ISO8859-1 is used in the rest of locales, but
some users do like to take the other way.

Comment 14 Hidetoshi Tajima 2001-11-28 18:47:12 UTC

Created attachment 6105 [details] [review]
a new patch - okay to check-in? change for charset alias is not in yet

Comment 15 Hidetoshi Tajima 2001-11-30 22:32:05 UTC

have changed bug priority to normal from enhancement.

Comment 16 Hidetoshi Tajima 2001-12-05 03:25:17 UTC

I'd appreciate if you please take a look patch again and commit if it
is okay. I really like to see this in 1.3.11 tarball if possible.

Comment 17 Owen Taylor 2001-12-06 16:47:41 UTC

Please go ahead and check this in. Thinking about it some more,
for the big-5 problem I think we need to do something like:

"big-5,big5-eten-0"

we need a list of possible font names, because the configuration 
file won't be sufficient when we need to deal with remote displays from 
Linux to Solaris or vice versa.

Comment 18 Hidetoshi Tajima 2001-12-06 17:59:52 UTC

Thanks, I'll check in the patch.

I'll look into the 66174 today, and see if we can address big-5
problem under it. Otherwise, I'll log a new bug or change 
the summary of this bug.

Comment 19 Owen Taylor 2002-01-31 19:11:28 UTC

We (Red Hat) seem to ues big5-0 as our encoding for our traditional
chinese product, so maybe it is in fact the standard on Linux.

I don't think further enhancements here are needed before Pango-1.0.0;

I've filed bug 70196 to track the issue of making the mapping
more flexible, either with a config file or just allowing multiple
X names for a charset in a hardcoded fashion.

Comment 20 Abel Cheung 2002-05-03 04:14:12 UTC

The name "big5-eten-0" is not appropriate here since "eten" is a
vendor's name for some certain obsolete chinese display system.