GNOME Bugzilla – Bug 111166
Tamil: Left matra placement
Last modified: 2003-07-25 16:12:44 UTC
Tamil unicode rendering is broken in the current Pango library. This causes all gtk2+ based applcations nusable if localised to tamil Unicode. This URL http://tamil-Mandrake.sourceforge.net/screenshots/pango.jpg shows the problem in current pango. With more applications like Mozilla, openoffice and other applications depending on pango for rendering it is very important to fix asap.
To describe the problem in more detail - in Tamil (and Malayalam) when you have a vowel with a left part, that should go to the left of the base glyph rather than to the left of the entire syllable. This is a known issue. Eric Mader has plans to work on this, but won't have time to do so for another month or two. If you want it to fix it before then, you might want to contact him (mader@jtcsv.com) and ask him what is involved. [ Neither Mozilla nor OpenOffice uses Pango currently, and I don't know of plans's to do so for either ]
Bug was first reported more than 12 months ago to Indic module author. (I think when RedHat 7.3 came out). Thanks
Very unlikely that *this* particular bug was reported that long ago, since the current OpenType Indic support was only released as part of Pango-1.2, which came out a couple of months ago :-) [ The older bitmap font support for Tamil could have had a similar bug, of course ]
Yes. You may be right. But the rendering issue has been around for sometime. It was not there when Vikram Subramanian introduced the first Tamil module for Pango (way back when). The image rendered by the link above (provided by Prabu) was generated on RedHat 7.3 (when it had just been released). Thanks. -D
Well, Red Hat 7.3 had *no* capability to render antialiased Tamil. So, if they were rendered on Red Hat 7.3, the software must have been installed separately (Pango CVS?) Still, I don't think debating when the problem first appeared is really at all important here :-)
Created attachment 16244 [details] [review] Fix left matra placement bug for Tamil in indic engine
This is a preliminary fix. It is against pango-1.1.1 rpm for RH8.0. But I did verify it against the current CVS version and found them to be same. I have only tested it for Tamil rendering. The changes may possibly affect other scripts. This fix will not correct the problem for Malayalam. I will consult with some Malayalam friends and get another patch later.
I discussed the patch above with Eric Mader some; Fundamentally, the patch above is not in the right place, since it affects the ordering of the glyphs *before* the GSUB stage, not after the GSUB stage, so conjunct consonants won't be formed properly. Since Tamil has very few conjunct consonants, this is only a minor problem there, but it is a big problem for Malayalam. For Tamil, the problem is apparently with the KA+SSA conjunct. If you have, say: KA + VIRAMA + SSA + VOWEL_SIGN_O The display should be: O_LEFT KA_SSA_CONJUNCT + O_RIGHT But with the above patch, you'll get KA + VIRAMA + O_LEFT + KA + O_RIGHT It may still make sense to try to fix things in something like this simple fashion for Pango-1.2. It should be possible to extend things to properly hand KA+SSA since as I understand, that's the only conjunct for Tamil and should be present n every font. The code should simply recognize KA+VIRAMA+SSA and place the left matra to the left of all the glyphs before applying GSUB. Here's a description of the right fix from Eric: === The Indic code which I added to Pango is based on the Indic OpenType code in the ICU LayoutEngine (http://oss.software.ibm.com/icu). I have fixed this problem in the ICU code, but cannot easilly port the change to Pango because the details of the information available are different... In brief, the ICU solution involves remembering the location of the matra and the base consonant before the GSUB table is applied, and using the saved information to move the matra afterwords. The complication in Pango is two-fold: the first complication is that in ICU for each output glyph I record the character index of the character which originally produced that glyph - in Pango, what's remembered for each glyph is the character index of the first character in the syllable. The other complication is that in ICU there are always as many glyphs as there were input characters - if I form a ligature, I replace the glyph for the first component of the ligature with the glyph for the ligature, and the glyphs for the rest of the components are replaced by special "place holder" glyphs - this means that the position of a particular glyph in the output glyph array will remain the same during GSUB processing. In Pango, when a ligature is formed, the glyph array is shortened to contain *only* the glyph for the ligature. (i.e. no "place holder" glyphs are used) This makes it harder to find the glyphs for the matra and the base consonant. So, what needs to change is that for each glyph, the Pango code needs to remember the actual character offset for that glyph rather than the offset for the first character in the syllable, and the code which locates the matra and the base glyph after GSUB processing will need to look for these character offsets in the glyph output rather than just using them as indices into the array. Also, after the matra fixups have been run, the character offsets for each glyph need to be changed back to the offset for the first character in the syllable. I don't expect any of this to be particularly hard, though I haven't worked out the low-level details of how to do it yet. === Implementation note on the above: To keep track of which input glyph each character corresponds to without changing the pango-ot.h interfaces, you can use glyphs->log_clusters. What you would do is in indic-fc.c:set_glyphs(), instead of setting log_clusters[i] = indices[i], set log_clusters[i]=i. Then, on exit from pango_ot_shape(), the value of log_clusters[] for each output glyph gives the index of the corresponding input glyph. You can then compute the real log_clusters by returning from indic_engine_shape, by looping over the output glyphs and doing: log_clusters[i] = indices[log_clusters[i]]
If we want a quick-fix for 1.2.x, someone will need to fix up the patch above for KA+SSA (or tell me that it doesn't matter)
I have raised the issue in tamilinix mailing list. There is a problem with the way KSHA is handled elsewhere. The sequence is not always supposed to form a ligature. There are many instances where it should not form a ligature, but rendered as indivitual characters. So I would prefer it to be ligated only when it has a ZWJ within: KA + VIRAMA (pulli) + ZWJ + SSA I will update this bug based on the response from the list.
It's not clear to me that the above is allowed by the Unicode standard; the Unicode standard explicitely describes using ZWNJ to prevent conjunct forms and does *not* mention a use of ZWJ as above. Sticking to the Unicode standard here is important, or text from various implementations won't be interchangable.
Links to currently relevant code: http://oss.software.ibm.com/cvs/icu/icu/source/layout/MPreFixups.h http://oss.software.ibm.com/cvs/icu/icu/source/layout/MPreFixups.cpp http://oss.software.ibm.com/cvs/icu/icu/source/layout/IndicReordering.cpp It looks like this really should be close to trival to port to Pango with the small patch I posted to bug 112433 ... an hour or two perhaps. With that in mind, I'm not sure a quick-hack fix really makes sense.
This issue is currently being debated in INFITT working group 2 for Unicode related matters. The choices currently seem to be either define a new character XA for the ligature or let the current definition alone. Use of ZWJ doesn't seem to be favoured. So I am retracting from my previous position. In either of the above cases MPreFixup code makes sense. Besides, this code is also needed for Malayalam anyway.
Created attachment 18428 [details] [review] Port of mprefixups code from ICU engine. Also includes Owen's patch from bug# 112433.
The above patch contains port of MPreFixup code from ICU engine to Pango. I have tested it for Tamil script only. It corrects the pre-modifier problem and doesn't break KSHA ligature. The patch is against pango cvs as of July 11. However, this doesn't fix an additional problem reported in Tamilinix list. Behaviour of backspace wrt Tamil is incorrect in Pango. Currently backspace will delete the whole syllable. While this behaviour might make sense in other indic scripts, it is unacceptable in Tamil. Backspace should only delete one Tamil character, namely a consonant followed by its modifiers, except in case of KSHA and SRI, which are by definition /foreign/ characters. One of the main reasons this is incorrect is, in pure Tamil, starting any word with a CONSONANT + VIRAMA is illegal. This rule is fundamental to the Tamil language and is like, first grade grammer. If I understand correctly, with the current setup in Pango (and ICU), many soft-hyphenation will occur right before a CONSONANT + VIRAMA. I do not know how we would be able to solve this problem with this mprefixups code. Only suitable way I can think of right now is to change the indic_ot_class_tables logic to handle this situation. I am looking into that alternative. Please let me know your comments on this.
I'll try to look at the MPre code soon. I'm not sure if I'm understanding your other issue correctly (and the golden rule of bugzilla is one issue per bug report), but the Indic modules have nothing to do with the delete key handling. The delete and cursor navigation boundaries are determined solely by the results of pango_break() The plan for handling the problem of Delete deleting too much in Indic languages is to add another flag to PangoLogAttr, perhaps "is_delete_break"; using "is_cursor_position" as we do currently doesn't work very well, but you can't simply delete one *character* - European combinations of Character+accent are seen by users as being a single letter.
Thanks Owen. I will open another bug report for this problem, and also look into the code you have mentioned above. In that case, if this mpre code works, we can close this bug.
I've gone through your patch and adjusted it a bit. Here's what I did in some detail: - Renamed mprefix_add/remove to indic_mprefixups_add/ indic_mprefixups_remove, since all exported symbols need to be in a unique namespace. - Gave mprefixups Pango style constructors/destructors indic_mprefixups_new()/indic_mprefixups_destroy() - Moved the _FixupData structure definition into mprefixups.c. - Moved various variables in mprefixups_apply() to be closer to their location in the original instead of at the outermost block. - Restore a comment to indic_mprefixups_add() that got lost along the way. - Removed the charIndices parameter from mprefixups_apply() since it was unused - Made mprefixups_apply() move around glyphs->log_clusters[], since that corresponds to charIndices, and needs to be reordered in order for the GPOS processing to work right. - make outMPrefixups in indic_ot_reorder() a MPreFixups ** as it is in the ICU code. Allow it to be NULL in the case where we don't need the result. (Your patch seems to leak the MPrefixups from the first call to indic_ot_reorder()) - Remove indic_ot_adjust_mpres(); it didn't add anything over calling mprefixups_apply() directly, and made the memory management more confusing. The patch seems to work for me in very limited testing; I'd appreciate it if someone could give it a more thorough runthrough.
Created attachment 18585 [details] [review] Cleaned up version of port patch
I've committed my patch now, both to the stable pango-1.2 branch and to HEAD. Testing still much appreciated. I made one small change beyond the above - if outMPreFixups is NULL when passed in, I don't bother creating it at all, rather than creating it, storing information in it, then freeing it. Fri Jul 25 12:07:21 2003 Owen Taylor <otaylor@redhat.com> * modules/indic/mprefixups.[ch] modules/indic/indic-ot.[ch] modules/indic/indic-{xft,ft2}.c: Port pre-base-mantra fixup code from ICU, as needed for Tamil and Malayalam. (Based on patch from Sivaraj Doddannan, #111166)