GNOME Bugzilla – Bug 50633
More native-encoded character sets and fonts support
Last modified: 2004-12-22 21:47:04 UTC
Native-encoded fonts for JISX0201/JISX0212/BIG-5/CNS11643-* character sets should be supported in Pango in addition that iso10646 encoded fonts are used to display these characters.
Created attachment 712 [details] [review] fixs to support big5 encoding, comments to follow...
The above patch only adds big5 encoding support, as I don't have the other type of the fonts I can't test out the others. With this patch, pango will look for *-big5-0 fonts. I know there are fonts with *-big5.eten-0 but I think we should be vendor neutral and you easily make an alias of *-big5-0 to *-big5.eten-0.
Created attachment 859 [details] [review] fix to add big5-0, jisx0201 and jisx0212
Created a patch to add jisx0212, jisx0201 and big5 into the table-big.i and modify conv_euc for codeset2(0x8eXX) and codeset3 (0x8fXXXX) of Japanese euc. Note that conv_big5() is the same as the old conv_euc(). I'm not sure if charset_ordersings[] in the table-big is really proper especially for 'zh-cn' and 'zh-tw'. Listed below are the tables I copied from Unicode.org and used to create new table for char_mask_maps[] and char_masks[]. I hope I have not missed anything, but is there any way to see if the produced tables are correct? iso-8859-1 iso-8859-15 iso-8859-5 iso-8859-9 koi8-r iso-8859-10 iso-8859-2 iso-8859-6 jis-0201 tis-620 big5 iso-8859-13 iso-8859-3 iso-8859-7 jis-0208 gb-2312 iso-8859-14 iso-8859-4 iso-8859-8 jis-0212 There are yet two more tables we may add: CNS11643(1, 2 and 15 planes only) and 8859-16. Can I continue to add these twos?
I attached a wrong patch, will attach a correct one shortly.
Created attachment 860 [details] [review] conv_euc was fixed to handle jisx0208 case properly
Can we read external files like "etc/pango/pangox.charsets" and "~/.pangox.charsets" to get charset names and their orderings? (To be locale specific, we can read etc/pango/ja/pangox.charsets first). File can be as follows: # File defining charsets of Pango X # # A charset has 5 fields: # charset-id the-name-for-the-charset \ # registry/encoding-fields-for-the-XLFD \ # byte_len_per_char \ # need_conversion(TRUE or FALSE) # # The byte_len_per_char and need_conversion fileds are used to # determine coversion_func: # 1 and FALSE => conv_ucs4 # 1 and TRUE => conv_8bit # 2 and FALSE => conv_ucs4 # 2 and TRUE => conv_euc # conv_euc should be modified so that it shoud chop off the leading # single shift bytes 0x8e and 0x8f, and caluculate glyth index from # the remaining 1 or 2 bytes (The remaining byte should be the same # as the byte_len_per_char field). { 0, "ISO-8859-1", "iso8859-1", 1 FALSE} { 1, "ISO-8859-2", "iso8859-2", 1 TRUE} { 2, "ISO-8859-3", "iso8859-3", 1 TRUE} { 3, "ISO-8859-4", "iso8859-4", 1 TRUE} { 4, "ISO-8859-5", "iso8859-5", 1 TRUE} { 5, "ISO-8859-6", "iso8859-6", 1 TRUE} { 6, "ISO-8859-7", "iso8859-7", 1 TRUE} { 7, "ISO-8859-8", "iso8859-8", 1 TRUE} { 8, "ISO-8859-9", "iso8859-9", 1 TRUE} { 9, "ISO-8859-10", "iso8859-10", 1 TRUE} { 10, "ISO-8859-13", "iso8859-13", 1 TRUE} { 11, "ISO-8859-14", "iso8859-14", 1 TRUE} { 12, "ISO-8859-15", "iso8859-15", 1 TRUE} { 13, "KOI8-R", "koi8-r", 1 TRUE} { 14, "TIS-620", "tis620.2529-1", 1 TRUE} { 15, "EUC-JP", "jisx0208.1983-0", 2 TRUE} { 16, "EUC-CN", "gb2312.1980-0", 2 TRUE} { 17, "EUC-KR", "ksc5601.1987-0", 2 TRUE} { 18, "EUC-JP", "jisx0201.1976-0", 2 TRUE} # chop-off 0x8e { 19, "EUC-JP", "jisx0212.1990-0", 2 TRUE} # chop-off 0x8f { 20, "BIG5", "big5-0", 2 TRUE} { 20, "CNS115643", "cns11643-1", 2 TRUE} # chop-off 0x8fa1 { 21, "ISO-10646", "iso10646-1", 1 FALSE} } orderings { 18, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 19, 16, 17, 20, 21 }
conv_euc should be modified as below for generic EUC. static PangoGlyph conv_euc (CharCache *cache, Charset *charset, const char *input) { GIConv cd; char outbuf[4]; const char *inptr = input; size_t inbytesleft; char *outptr = outbuf; size_t outbytesleft = 4; inbytesleft = g_utf8_next_char (input) - input; cd = find_converter (cache, charset); g_iconv (cd, (char **)&inptr, &inbytesleft, &outptr, &outbytesleft); if ((guchar)outbuf[0] < 128) return outbuf[0]; else if ((guchar)outbuf[0] == 0x8e || (guchar)outbuf[0] == 0x8f) { if (outbytesleft == 2) return ((guchar)outbuf[1]); else if (outbytesleft == 1) return ((guchar)outbuf[1] & 0x7f) * 256 + ((guchar)outbuf[2] & 0x7f); else if (outbytesleft == 0) return ((guchar)outbuf[2] & 0x7f) * 256 + ((guchar)outbuf[3] & 0x7f); else if (outbytesleft == 3) return 0; } else if (outbytesleft == 2) return ((guchar)outbuf[0] & 0x7f) * 256 + ((guchar)outbuf[1] & 0x7f); else return 0; }
created new patch for jisx0201,jisx0212 and big5-0. This time, conv_eucjp() is used for japanese character sets used in euc-jp, and other 2 byte character sets are handled in conv_16bit(), which is the same as the old conv_euc. let me know if I can commit the patch. cns11643-* is not included yet, and code to read external file to get charset ordering is not, either. I'd implement this if I can get your permission.
Created attachment 866 [details] [review] new patch for jisx* and big5 - see the above comment
I'd like to hear from you how's the status of this bug. Are there concerns and issues about the patch? I'd appreciate if you can kindly let me know what I can do.
The patch looks about right. A few questions: * Is it really right to make jisx0201 the preferred charset for ASCII. The XFree86 Xlib config files do, for comparison: fs0 { charset { name ISO8859-1:GL } font { primary ISO8859-1:GL substitute JISX0201.1976-0:GL vertical_rotate all } } Since jisx0201 is obsolete, it seems that one might expect the fonts to be unmaintained or of less quality than iso8859* fonts. I assume that the high range of jisx0201 is only mapped to the FF80-FFEE compat range of Unicode and not to the normal katakana area? * big5-0 is not the standard encoding name for XFree86 anyways. On XFree86 the standard name is big5-eten.0. If it's different on Solaris, that would be an argument for making this configurable. * Is there some reason for using convert-eucjp for Big5-0?; as I understand it Big5 is always two-byte, just with extended two-byte ranges. * Why does conv_eucjp need to check the number of bytes in the output? As for the character set configuration file. I guess I see two points to it: - Preferences for preferred font encodings - Different names between systems It looks to me like there is a bit too much information in your file, however, since we can't actually add new character sets -- the information in the big char_masks array is fixed. The most general thing that seems useful would be something like a tab-separated file iso8859-1 ISO-8859-1 big5.eten-0 BIG-5 ... That is, just a list of font names and charset names - with the charset names being derived from the ENC_ISO_8859_1, etc enumeration. I suppose you could argue that the way of going from unicode to font encoding might depend on the font as well as the charset, and might involve a system-dependent iconv name, but we are taking care of the system-depenent iconv name at the GLib level, and is the other problem really a concern?
Thanks for the detailed comments. * Why jisx0201? In Solaris "ja" locale, jisx0201 is primary font and iso8859-1 is secondary for cs0/fs0. That was the reason, but I agree to what you said. Let's stay with the current. * why "big5-0" That's the charset name of big5 fonts on Solaris, so we would need some configuration for it. Creating an alias file seems a good idea. * convert-eucjp for Big5-0 That should be conv_16bit. My typo. * Why conv_eucjp checks the number of bytes? Certainly it is not really needed, but just testing outbuf[0] is enough. Perhaps, I did so to make sure outbuf[1] and outbuf[2] should have assigned values. Regarding to configuration file, I like yours. That is simple enough but will meet the purposes. Can we allow it to be configured by users or venders, and also make it locale-specific? For instance, Solaris/Sun, might want to use JISX0201 first in Japanese locales while ISO8859-1 is used in the rest of locales, but some users do like to take the other way.
Created attachment 6105 [details] [review] a new patch - okay to check-in? change for charset alias is not in yet
have changed bug priority to normal from enhancement.
I'd appreciate if you please take a look patch again and commit if it is okay. I really like to see this in 1.3.11 tarball if possible.
Please go ahead and check this in. Thinking about it some more, for the big-5 problem I think we need to do something like: "big-5,big5-eten-0" we need a list of possible font names, because the configuration file won't be sufficient when we need to deal with remote displays from Linux to Solaris or vice versa.
Thanks, I'll check in the patch. I'll look into the 66174 today, and see if we can address big-5 problem under it. Otherwise, I'll log a new bug or change the summary of this bug.
We (Red Hat) seem to ues big5-0 as our encoding for our traditional chinese product, so maybe it is in fact the standard on Linux. I don't think further enhancements here are needed before Pango-1.0.0; I've filed bug 70196 to track the issue of making the mapping more flexible, either with a config file or just allowing multiple X names for a charset in a hardcoded fashion.
The name "big5-eten-0" is not appropriate here since "eten" is a vendor's name for some certain obsolete chinese display system.