GNOME Bugzilla – Bug 95708
enhancing Hangul shaper (Xft) with Oxxx/Nxxx fonts
Last modified: 2005-08-15 01:36:54 UTC
This is spun off from bug 95569 and below is copy'n'pasted from it. ----------- Another possibility is to make Pango do what Yudit and Lambda do with Ogulim/Obatang/Ogunseo and fonts. These fonts are distributed in Korean MS Office 2000 and Ogulim is also available as 'Old Korean support kit' at MS web site. They do not have OT tables for Hangul Jamos. Ogulim has a set of glyphs for all known consonant and vowel clusters which can be assembled together to render a pretty generic sequence of Hangul Jamos. There are another set of fonts in MS Word 2000(Korean) and Old Korean support kit, namely Ngulim/Nbatang/Ngungseo. They have precomposed glyphs for all known precomposed Hangul syllables(thousands of them) ever found in Korean literature. Producing the mapping from Hangul Jamo sequences to those precomposed syllables are tedious, but doable. I'm wondering whether this font-specific 'hack' can be included in Pango. This is sort of like a hack for KAIST/Iyagi BDF johab fonts. If there's a way to uniquely identify these fonts, I think it's possible. and would dramatically improve Pango's ability to render pre-1933 orthography Korean text until Korean OTFs with Korean Hangul Jamo support are widely available. ----------- In Unicode, Hangul syllable S is defined as a sequence of Hangul leading consonants (L+), a sequence of Hangul vowels(V+) and an optional sequence of Hangul trailing consonants(T*). S: = L+V+T*M? (1) Although the number of L,V and T's in each sequence can be any number in theory, we can define a Hangul syllable S in practice as following: S:= L{1,3} V{1,3} T{0,3} M? (2) where M is either U+302E or U+302F (Hangul tone mark) Using Ogulim/Obatang/Ogungseo, it's possible to render about 500k syllables formed out of all known instances of Jamo clusters found in existing Korean literature. This is not the most generic nor even close to supporting the limited definition given by (2). Nonetheless, it's a great step forward and should be sufficient for virtually all Korean linguists and general public except for select few with wide-ranging imagination and desire to come up with and to use novel(hitherto unused/not found) Jamo clusters. It seems not so hard to add this to pango/modules/hangul/hangul-xft.c. One question I have is whether it's all right to call pango_ft2_font_get_face()(which is necessary for figuring out the name of a font face for which this font-specific hack is applicable) in hangul-xft.c. That introduces a dependency on FT2 in otherwise Xft part. FYI, Ogulim/Ngulim fonts are available at http://office.microsoft.com/korea/assistance/2000/weboldhg.aspx
No, FT2 code is not available in Xft modules. pango_xft_lock_face() is pretty similar though.
Owen, Thanks for info. on pango_xft_lock_face(). BTW, I put up the list of extra Jamos avaialble in Oxxx.ttf at http://jshin.net/i18n/korean/jamos_ogulim.txt The list of precomposed syllables (pre-1933-orthography) in Nxxx.ttf is at http://jshin.net/i18n/korean/ngulim.html. (Obviously, Ngulim.ttf has to be installed and Mozilla-Xft works fine in that case.)
Created attachment 11622 [details] [review] patch v1
this is still a work in progress. just wanted to put it somewhere safer than my disk. Nonetheless, it works in the sense that : - fonts with spacing jamo glyphs and fonts with combining jamo glyphs are distinguished and jamo sequences are rendered accordingly - baisc jamo sequences are automatically 'normalized' to jamo clusters with code points of their own. backing-store remains intact. - oxxx shoudl work if I can find a work-around for the following problem - tone mark (with fallback : still more work to do here) I encountered a problem with Ogulim and fontconfig(?), though. Ogulim (Gulim Old Hangul Jamo) has a hack-encoding and even if I specify it to be used explicitly(in an application like gedit), fontconfig(this is my suspicion only) comes back with 'New Gulim' Therefore, pango_xft_lock_face(font)->family_name has 'New Gulim'. Perhaps, this requires a kind of hackery Owen mentioned in Mozilla bug for Xft (http://bugzilla.mozilla.org/show_bug.cgi?id=126919#c87). gedit may not be the best application to test this.. Is there any Gtk application that makes use of 'fontset' in PangoXft and 'Pango coverage map'(?) so that I can test Oxxx rendering? * Works to do: - clean up and optimize - add support for Nxxxx - figure out how to use Oxxx code(already written) - fix 'spacing' problem with tone mark
tone marks work well if a font(e.g. CODE2000) has (combining) glyphs for them. 'glyph positioning problem arises' when I have to resort to a fallback (':' and 'middle dot' for U+302E and U+302F)
Created attachment 11660 [details] [review] patch v2
Created attachment 11693 [details] [review] patch v3
Now every thing I wanted to implement is in place except that I have yet to complete precomposed syllable look-up table for New Gulim. There are about 5000 precomposed syllable in the font and the table included in the patch is about 250 (a 20th of the total). However, with them I was able to test my look-up routine and it works well as intended. Works to do: - More extensive testing - Clean-up (including indentation changes) and reorganization (hangul-x.c may needs some of utility functions currently in hangul-xft.c) - implement 'best-possible' rendering approach to long jamo clusters unrendereable as syllables even with Ngulim and Ogulim. They must be 'new' syllables invented by creative minds :-) As for using Ogulim despite its hack-encoding, I'm pretty certain that fontconfig can make things really easy only if 'charset' property and 'lang' property of a font can be 'editable' in the configuration (fonts.conf) as is the case of other properties(e.g. family). Then, upstream-clients of fontconfig like Pango would have very little to do except for changes similar to what my patch does.
Currently, my patch does two extra things other than Nxxx/Oxxx support, adding Hangul tone mark support and distinuishing fonts with combining glyphs and spacing glyphs for Hangul vowel and trailing consonants. I'll file two separate bugs for them if it's deemed necessary to expedite things. Hangul tone mark handling seems to be a easier target for this separation(I've just filed bug 96299 for Hangul tone mark handling). It also involves a bit complex handling of width/x-offset setting(not yet implemented in my patch, which makes a case for the separation even stronger.
In my patch, distinuishing fonts with combining glyphs and fonts with spacing glyphs for Hangul vowel and trailing consonants is done in the same routine set_render_func() as selecting render_func based on family-name (for Nxxx and Oxxx) is done. This makes it hard to separate the former (dist. fonts with comb. glyphs and fonts with spacing glyphs) from this bug. Therefore, I prefer to leave it as it is for now. Later when my patch is committed, we can file another bug about selecting rendering function in both hangul-x and hangul-xft. Because it seems like what's done in Thai shaper has to be done here (caching per font, defining a type or two for Hangul-font type and charset, using gquark, etc). For the record, I filed bug 96300 for this issue. As bug 96299 is spun off from this for Hangul tone mark handling, I'll upload a new patch without Hangul tone mark handling here soon.
Created attachment 11706 [details] [review] a new patch (tone mark handling routine removed)
attachment 11706 [details] [review] does not incude table-ext-jamos.i and table-jamos2.i file because they haven't changed since patch v3. Hangul tone mark handling routine (render_tone()) was seprated out to bug 96299. Beginning with attachment 11706 [details] [review], patches uploaded here include calling sequences to that function from various render_func's, but render_tone() itself wont' be inlcuded. Besides, my patches (from now on) are against my personal tree with the latest patch for bug 96299 applied. That is, patches here are produced as if bug 96299 were resolved with my latest patch. Now, let me break down my patch here and explain what each part does: * a new type __jamo_norm_map defined in hangul-defs.h : used by jamo_srch_repl() invoked by normalize_jamo() and og_transform(). __u1100_jamo_clusters[] in tables-jamos2.i and __ext_xx_clusters[] (where xx is lc|vo|tc) in tables-ext-jamos.i It holds a mapping from a sequence of upto 3 (MAX_BASIC_JAMOS) basic Jamos to a cluster jamo. The definition of 'basic' jamo is a bit fluid. It means any jamo that can be regarded as a subcomponent of a cluster jamo. * tables-jamos2.i : included by hangul-xft.c and used in normalize_jamo(). - has __u1100_jamo_clusters[]. This array is automatically generated by compatibility decomposition mapping in Unicode 2.0 data file. * tables-ext-jamos.i : included by hangul-xft.c and used in og_transform(). - includes various #define's for Oxxxx and Nxxxx fonts (OG_*'s and NG_*'s. OG and NG are for Ogulim and Ngulim, respecitively) - __oj_to_ns is typedef'ed as __jamo_norm_map. a new name is used because the symantics is different. This type holds a mapping from a sequence of Oxxx Jamos (extended Jamo set including jamos not given codepoints of their own in U+1100 block) to a precomposed syllable glyph position in Nxxx(Ngulim) fonts. Thus the name, oj_to_ns stands for 'Ogulim Jamo to Ngulim Syllable'. - three arrays __ext_xx_clusters[] (xx : lc, vo, tc) hold mapping from basic jamo sequence to OG cluster jamos used in og_transform. copied from my implementation (extending CHO Jin-Hwan's implementation) in Lambda and Yudit. - __ogulim_xx_gidx (xx=lc,vo,tc) : mappings from extended Jamo code points (temporary) to 'glyph code points' in Oxxx fonts. - __ogulim_....map : 4 of them : Oxxx fonts have six glyphs for each LC, 2 glyphs for VO and 4 glyphs for each TC. Which of these glyphs to use when forming a syllable depends on whether it has TC and what kind of vowel is used in a syllable (horizontal, vertical, or both horizontal and vertical). These 4 arrays hold mappings to use in selecting a glyph based on those factors. Worked out manually by CHO Jin-Hwan and extended to support extended Jamos by me. - __og_jamos_to_ng_syllable[] : an array of type __og_to_ns. a mapping table from a sequence of OG Jamos to a NG precomposed syllable. Nxxx(Ngulim) fonts have about 5000 precomposed syllables in PUA. All those syllables are formed out of OG-extended Jamos. * functions in hangul-xft.c - jamo_srch_repl(__jamo_norm_map *cluster, gunichar *in, int *len): search for cluster->seq in 'in' and replace it with cluster->liga in place. returns the difference in length between before and after the replacement. called by normalize_jamo, og_transform and render_...with_ngulim() - gunichar* normalize_jamo(const gunichar* in, int *len): 1. Normalize (regularize) a jamo sequence to put it in a regular syllable form defined Unicode 3.2 section 3.11 to the extent that it's useful in rendering by render_func's(). 2. Replace a compatibly decomposed Jamo sequence (unicode 2.0 definition) with a 'precomposed' Jamo cluster (with codepoint of its own in U+1100 block). For instance, a seq. of U+1100, U+1100 is replaced by U+1101. It actually more than Unicode 2.0 decomposition map suggests. For a Jamo cluster made up of three basic Jamos (e.g. U+1133 : Sios, Piup, Kiyeok), not only a sequence of Sios(U+1109), Piup(U+1107) and Kiyeok(U+1100) but also two more sequences, {U+1132(Sios-Pieup), U+1100(Kiyeok) and {Sios(U+1109), U+111E(Piup-Kiyeok)} are mapped to U+1133. 3. the result is returned in a newly malloced(g_new'd) gunichar*. A calling function has to g_free it. - typedef : void (* RenderSyllableFunc) : the same usage as in hangul-x.c. all render_syllable_xxx funcs are of this type and used in set_render_func() - render_as_precomp_syllable() : invoked by render_syllable_with_(combining|spacing|ngulim). When a Jamo sequence can be converted to a precomposed syllable in U+AC00 block and a font has a glyph for it, this is invoked - render_syllable_base() : one additional argument to distinguish bet. a font with spacing jamo glyphs and a font with combining jamo glyphs. As discussed in bug 95569, when combing glyphs for a simple overstriking are available in a font, the best-possible-effort may lead to an undesirable result. So, treat two cases differently. As in all other render_syllable_xxx()'s, tone mark is processed first and normalize_jamo() is invoked before processing further. I won't mention this for other render_syllable_xxx()'s. - render_syllable_with_(combining|spacing) : just calls render_syllable_base() with the last argument set depending on a type of font - static void og_transform (gunichar *text, int *length) 1. shift jamo sequences to three disjoint code blocks in PUA (0xF000 for LC, 0xF1000 for VO, 0xF200 for TC). 2. replace a jamo sequence with a precomposed OG-extended cluster jamo code point in PUA 3. this replacement is done 'in place' - render_syllable_with_ogulim() 1. OG_Xform a jamo seqeunce 2. If rendereable with OG-extended Jamo glyphs, do it using various mapping tables defined in tables-ext-jamos.i 3. otherwise, render it with glyphs for jamos in a sequence enumerated. V and T's are prepended with Lf to advance cursor position because glyphs for V and T in Oxxx fonts are non-spacing. - oj_ns_comp() : a comparison function used to bsearch() for a OG-ext. Jamo sequence in __og_jamos_to_ng_syllable[] array. - render_syllable_with_ngulim() : 1. after processing common in all render_syllable_xxx()'s, try to render a seq. with a precompose syllable in U+AC00 block. 2. og_transform() it 3. if the result is not a sequence that can form a syllable with two or three OG-ext. jamos, go to fallback(#6) 4. bsearch() for a OG-ext. Jamo seq. in __og _jamos_to_ng_syllable[]. If a match is found, use that precomposed syllable glyph 5. if not, render a sequence as a seq. of OG-ext. jamo glyphs designed in such a way that a simple overstrking results in a syllable glyph. 6. enumerate stand-alone jamos as a fallback. as in render_syllable_with_ogulim(), Lf is put before V and T to advance 'cursor' because V and T glyphs are not spacing in Nxxx fonts. - set_render_func(PangoFont *font, RenderSyllableFunc *render_func) 1. invoke FT_Face to figure out family name of font 2. Set render_func() based on family name first 3. inspect U+1161 (vowel A) glyph and see if it's spacing or combining, set render_func accordingly. - hangul_engine_shape() : 1. set_render_func() is called 2. render_func() is invoked instead of render_syllable(). I hope this explanation will help understand my patch and expedite commiting it. Comments are all welcome.
Created attachment 11728 [details] [review] a new patch(normalization routine put in a separate file)
Created attachment 11729 [details] hangul-utils.c (a new file) for jamo normalization
Created attachment 11730 [details] a new file(hangul-utils.h) for jamo normalization
Created attachment 11731 [details] tables-ext-jamos.i (a new file) for Oxxx/Nxxx mapping
Created attachment 11732 [details] tables-jamos2.i (a new file) for Jamo normalization
In the latest patch (one patch against HEAD, 4 new files), the dependence on bug 96299 is completely gone. This patch can go in independently of bug 96299. Jamo normalization related routines are put in two new files (hangul-utils.c and hangul-utils.h. I'm open to a suggestion for a better name if any) in the expectation of this routine being used by hangul-x.c as well in the future (see bug 96314). It's to be noted that I can't deal with Jamo-normalization in a new bug independent of this one because og_transform in hangul-xft.c shares a function (jamo_srch_repl()) and data structure (__jamo_norm_map) with normalize_jamo().
Created attachment 11755 [details] [review] a new patch(following gnome coding-style convention)
Created attachment 11756 [details] hangul-utils.h(new : variable/fucn/type name change)
Created attachment 11757 [details] hangul-utils.c (new: gnome convention)
Created attachment 11787 [details] [review] a new patch(use bsearch instead of lin. search reducing function calls by 30 ~ 100)
Created attachment 11788 [details] hangul-utils.h(new : modified for bsearch)
Created attachment 11789 [details] hangul-utils.c(new : modified for bsearch)
Created attachment 11790 [details] tables-ext-jamos.i (new: modified for bsearch : sorted)
Created attachment 11791 [details] tables-jamos2.i(new: modified for bsearch , sorted)
- /* Well, no unicode rendering engine could render Hangul Jamo area - _exactly_, I sure. */ + /* XXX : If font is Oxxx or Nxxx, set to PANGO_COVERAGE_EXACT for U+1100 Jamos */ You can do that if the Oxxx/Nxxx code show "U+1100 U+1100 U+1100 ... (100 times) U+1161 U+1161 ... (10000 times)" as a reasonable syllable form. It's still PANGO_COVERAGE_FALLBACK.
Am I supposed to respond to your last comment? Did you want me to? How about these lines in hangul-x.c? ----------- else if (render_func == render_syllable_with_ksx1001johab) { for (i = 0x1100; i <= 0x11ff; i++) pango_coverage_set (result, i, PANGO_COVERAGE_EXACT); ---------------- EXACT may well be a bit of overstatement for Oxxx/Nxxx style fonts for an example like yours and APPROXIMATE may be about right. On the other hand, in light of the above and other similar coverage settings in hangul-x.c, EXACT is not much of an overstatment. FALLBACK is cleray an understatement. It also has to be noted that Oxxx/Nxxx style fonts cover over one and half million syllables (all the syllable combinations that can be composed out of all the known consonants and vowels in every single book published since 1443. Of course, there may be still some omissions and some creative - or not so creative minds like mine - can come up with new vowel clusters and consonant clusters at any time.) Anyway, they have the best coverage and can be a good model for developing new fonts for a better support of Hangul Jamos.
That also should be fixed. The ksc5601.1992-3 stuff was not written by me. Originally it's from Sun Microsystems. I did not notice the lines when applying the patch from Sun.
Hmm.. I don't recall writing that you're responsible for that in hangul-x.c. I didn't even imply it(because of Owen's comment around that part of the file) although you apparently thought that way. Anyway, why don't you just tell me what the exact graphical form (as mentioned in PANGO Coverage level doc.) is for your example (100 U+1100's foll. by 10,000 U+1161's)? While you're at it, could you tell me what the exact graphical form is for U+0041 followed by 100 combining diacritic marks for Latin alphabets, in turn,followed by a combining enclosing circle? Can Pango claim 'EXACT' coverage for U+0041 and diacritic combining marks? If it currently does, does the level have to be degraded for them because PANGO can only stack up up to, say, 3, diacritics over U+0041?
Moving bugs to new hangul component
Jungshik, could you make a *single* patch against current CVS HEAD? I could not build pango with your patches. BTW, Because of the End User License Agreement of the Microsoft Nxx/Oxx fonts (which are permitted to be used only with Windows OS), I will just commit the patches without testing.
Created attachment 12947 [details] [review] all in one patch
Changwoo, Can you take a look and apply the patch? There are a couple of enhancement I can make, but they can be put off until 1.2.1 or later, I think. (BTW, because I don't cvs write access here, new files are diffed against /dev/null without RCS/Index heading, but that shouldn't be a problem when you apply it)
I'm really not comfortable adding this patch at this point. It's 600 new lines of code (ignoring the tables). I think we're best off saving this for Pango-1.3.x, when we can get some testing before the final release. If it is working well in the 1.3.x branch, perhaps we can consider a backport to the stable 1.2.x series.
I can understand you don't feel very comfortable committing this long patch not long before the release, but the length of the patch should not scare you much. Most of patch is to add new features(old Korean text rendering). They wouldn't affect users who don't use them because existing features and functionalities are little, if any, affected by the change. Functions are shuffled around in hangul-xft.c to add new features(, which makes the patch long), but existing featuers are well preserved. I've been using Pango with the patch for the last month and half and it worked rather well for me. Besides, basically the same implementation has been tested on three other programs (Omega/Lambda, Yudit and Mozilla). This wouldn't guarantee that there's no bug, but at least I can tell you that there's no known regression and major bug. I feel rather strongly about implementing this so that it'd be nice if you could consider this issue one more time. Thanks.
It's sometimes possible that even if a patch is working well for someone who has everything configured right on the system, it might cause some unexpected side effect for people without a good Korean configuration. We've actually seen crashes like this earlier. As I understand it, this is a fairly specialized addition; I'm sure that it's very important for some users, but maybe not for the typical Korean user? I suspect that the average Korean user is probalby more concerned by the fact that the delete key deletes an entire syllable than the fact than the missing support for these fonts... Also, from cwryu's comments apparently fonts that can be used with this patch and Linux aren't widely available. Given that, I just don't see it as worth the risk of adding this code right before the freeze, and without testing by a wider group of people. That's not to say I don't wantthis added, I just don't want this added right now.
Created attachment 19543 [details] [review] a new patch against the head
attachment 19543 [details] [review] does not include new files but has only diffs against files in the cvs. New files (e.g. tables-jamos2.i) haven't changed since last December.
Created attachment 19546 [details] [review] updaed patch (using smaller and revised arrays for cluster jamo mapping)
Created attachment 19547 [details] [review] new files (hangul-utils.*, tables-ext-jamos.i, tables-jamos2.i)
In the latest two attachments (attachment 19546 [details] [review] and attachment 19547 [details] [review]), I back-ported changes I made in Mozilla (http://bugzilla.mozilla.org/show_bug.cgi?id=176315). They include fixing some mistakes in mapping tables and cutting down the size of some arrays. Otherwise, they're identical to the previous patch.
If you can find someone else to review this patch, I'm OK with it going in, though it does seem like a lot of code to add for fonts that are (?) only available as part of MS Office.
Fonts are available to everyone for download(e.g. http://www.korean.go.kr has the link to it for old Hangul display) except that in _some_ countries, the EULA bind you not to use on platforms other than Windows. In other countries, it doesn't. For instance, in Germany, it appears that the EULA is not effective, but I'm not a lawyer and I don't feel 'comfortable' either. Anyway, this patch includes a lot of stuffs that can be made use of for other Hangul fonts. As for a reviewer, I can't think of any other than cwryu and noah. cwryu seems very busy, but it'd be nice if he can review (he sorta did last year(see his comment : 2002-12-09 [1]) although the patch had to be modified to fit the new pango framework(?)). noah, would you take a look? It's large but the principle is rather simple. Besides, basically the same patch has been in Mozilla since 1.4(?) (except that Mozilla patch is for a different font) and in Yudit. Actually, I may find someone from the Korean Linux community (e.g. Choi Hwanjin who wrote gtk2 input modules for Korean) if necessary. [1] gnome-bugzilla has to be upgraded for an easier reference to comments (like comment #15)
Hi Jungshik, could you update the patch so it applies to current HEAD?
Hi Noah, I'll do next week. Would it work?
Created attachment 22104 [details] [review] a new patch (updated against the trunk)
Hi Noah, I'm sorry it took longer than I told you. Just attached is a patch against HEAD. new files (uploaded on Aug 27) can be used as they're. It'd be great if you could take a look and commit as appropriate.
Adding the PATCH keyword and marking the priority level to high.
I don't understand why we should make effort to support these fonts. FYI, the below is the EULA of the Oxxx/Nxxx fonts. It is NOT legal to use these fonts in non-Windows environments. Nobody can use these fonts anyway. Nobody can even test this patch without violating this damn EULA. OK, of course some free alternative fonts could use this encoding scheme in the future. But I didn't hear of any news about such development. ------------------- MICROSOFT Old Hangul Support Pack ADDENDUM TO END USER LICENSE AGREEMENT FOR MICROSOFT PRODUCT ("EULA") The Old Hangul support package you have installed or downloaded ("Language Support Software") enables you to use the versions of Microsoft products identified as eligible for the Old Hangul Support Software (SOFTWARE PRODUCT) to view, input, manipulate or otherwise make use of information presented in Old Hangul. You may install and use one copy of the Old Hangul Support Pack solely as an integrated component of a validly licensed copy of the SOFTWARE PRODUCT and Windows 95 or Windows NT 4.0 or later versions thereof. Your use of the Old Hangul Support Software is governed by this Addendum and the End User License Agreement applicable to the SOFTWARE PRODUCT.
I don't think we should apply this unless people are making free versions of fonts with these encodings. And since Microsoft's direction in this area is OpenType fonts (http://www.microsoft.com/typography/otfntdev/hangulot/default.htm) I don't expect people to make fonts to match what they were doing in the past.