GNOME Bugzilla – Bug 113551
Bugs in the Bengali rendering system of Pango.
Last modified: 2006-01-16 09:41:15 UTC
Hello, Thanks for the work on the Bengali rendering in Pango. However, there are a few bugs which we have come across while working on font development and Bengali l10n. I am listing them below, and there are two screenshots - http://www.nongnu.org/freebangfont/pango_bugs/shot_bugs_gedit.jpg showing the problem strings as rendered in Gedit2, and the other http://www.nongnu.org/freebangfont/pango_bugs/shot_bugs_yudit.jpg showing the correct rendering via Yudit. The file used for generating the screenshots is also downloadable from http://www.nongnu.org/freebangfont/pango_bugs/bugs.txt . (Note that I am using Pango version 1.2.1 - the one provided with Mandrake 9.1) 1. Yaphala --------------- a. The string য্য is rendered incorrectly. For some reasons, the Yafala mark is getting rendered twice. More information on yaphala can be found at http://www.microsoft.com/typography/otfntdev/bengalot/features.htm (section on "Post-base form of consonant") b. The sequence 0985 09CD 09AF 09BE (অ্যা) is not rendered properly. I quote from the Unicode Indic FAQ. Q: What are the Bengali characters used to transcribe the sound "a" (as in English "bat") in Unicode? A: In Bengali, the sequence "zophola" (U+09CD U+09AF) + the "aa" matra (U+09BE) is used for transcribing the English "a" in "bat". This zophola_aa can be seen as a special "composite" matra to write a new Bengali sound, imported from English. Represent these sequences using a halant (virama): Vowel_A_zophola_AA = 0985 09CD 09AF 09BE ( a- halant ya -aa ) Vowel_E_zophola_AA = 098F 09CD 09AF 09BE ( e- halant ya -aa ) If you need to add a candrabindu or other combining mark in the sequence, represent the sequence as: Vowel_A_zophola_AA + candrabindu = 0985 09CD 09AF 09BE 0981 ( a- halant ya -aa candrabindu ) 2. Baphala --------------- Pango, for some reasons is confusing between the sequence 09AC 09CD. This sequence can be substituted by two different lookups - pres, and blws. Examples are given below. pres - জব্দ blws - জ্বদ I have attached a screenshot of how the above two examples look in Yudit. More details on blws can be found at http://www.microsoft.com/typography/otfntdev/bengalot/features.htm (section on Below-base substitutions) 3. ZWNJ & ZWJ --------------------- Rendering of certain strings have led us to believe that Pango is somehow confusing between Zero Width Non Joiner (ZWNJ) and Zero Width Joiner (ZWJ). <consonant> <halant> <ZWJ> <consonant> is rendered in the exact same way as <consonant> <halant> <ZWNJ> <consonant>. This should not happen - as the screenshot taken in Yudit shows. <consonant> <halant> <ZWJ> should render the "half form" of the consonant, while Pango is rendering the "halant form" instead (or it may be simply putting the consonant followed by the halant - I am not very sure). This issue becomes important when we handle the khanda-ta character in Bengali - a short write-up on this can be found in the Unicode Indic FAQ.
Patches, are of course, much appreciated.
Created attachment 17020 [details] [review] Patch to fix 1a
Here is a small patch for 1a. This seems like a problem with indic-ot, not just Bengali. I am not quite sure if the patch is correct for other languages but it works for Bengali, and I am hoping it will give Owen some indication about what is the real problem.
Created attachment 17022 [details] [review] Patch to fix 1a and 2
The second patch includes the previous fix for 1a, and fix for 2. Owen, can you please take a look at the third issue? It seems like a word with ZWJ or ZWNJ are broken into three items (in pango_itemize), and then treated alike.
I don't think the patch is quite right, having multiple post base forms is allowed in Bengali, I believe, and your patch will prevent such cases from rendering correctly. See: http://oss.software.ibm.com/cvs/icu/icu/source/layout/IndicReordering.cpp.diff?r1=1.8&r2=1.9 For how the problem was fixed in ICU. The immediately relevant part of the patch is the change: - while (baseConsonant >= baseLimit) { + while (baseConsonant > baseLimit) { But probably the other parts of the patch need to be ported to Pango as well.
I am quite sure (99%) you can't have multiple post-base in Bengali (I am not sure about other indic languages). In Bengali only 0x09AF has post-base form, and I haven't seen any word where it repeats itself. I am not sure how to test the other languages. I'll try out what you mentioned Owen, but I doubt I can port all the changes from ICU to Pango anytime soon...
Okay, seems like there weren't too many ICU changes for the reorder function. Attached is the port of the diff you pointed out. Please take a look and see if you can come up with an official patch some time soon. as for issue 1b, I don't think there is anything in ICU. I will try to propose something.
Created attachment 17023 [details] [review] Patch to port changes between 1.8 and 1.9 of IndicReordering.cpp in ICU code to Pango
Created attachment 17026 [details] Effect of allowing reph for U+9AC
Created attachment 17027 [details] [review] Trivial patch changing charclass for U+9AC
Regarding 2. - it seems that your change disallows below-base-forms for all characters, which can't be right, can it? In a brief look, perhaps the problem is that "reph" is not being done for U+9AC, which I believe, as the Bengali Ra should be getting it? If I make the change of U+9AC from _cb (consonant with below-base, to _rb, consonant with below base and reph), I get the image that I've attached above. I have no idea if this is correct or not, though at least there are different results for the two sequences.... (If this change is correct, then ICU needs it as well.)
U+9AC should be _bb (right now in CVS it is _bb not _cb). Reph is only for U+9B0. I am attaching two screenshots with _bb and _rb. As you can see for _rb the result is the same, which is not correct. The result should be as produced by _bb. Also, a very quick hack (and a bit ugly) is to set U+985 to _ct from _iv, this will fix the 1b issue. I will also upload an image with the result. There is a small side effect, but I am sure everyone can live with that, instead of pango rendering it wrong.
Created attachment 17028 [details] The correct rendering result for U+9AC with _bb
Created attachment 17029 [details] The incorrect rendering result for U+9AC with _rb
Created attachment 17030 [details] fix/work around for issue 1 and 2
Created attachment 17031 [details] [review] My version of indic-ot.c (diff -upw)
Created attachment 17032 [details] [review] Ugly version of ICU port
Created attachment 17035 [details] Results with patch 17031 OR 17032
Owen, hmmm with patch 17031 or 17032 nothing is rendered as expected. The attachment 17030 [details] shows the expected result...
Created attachment 17036 [details] [review] my trivial work around for 1b
Created attachment 17037 [details] This is the text file for the images
I've attached two copies of a version of version of your backport - the first for legibility is with diff -w, (ignore whitespace),the second is a diff that can be applied. Changes from your version: - Remove 'if (lastConsonant >= prev) {' and reindent - Get the other part of the ICU change (remove pstf from base consonants) as well. - Remove code that you only #if 0'ed. If you could check whether this fixes 1a for you, that would be appreciated. I don't want to give up on fixing 1b right and put in a hack, without making any effort to figure out For 2, OK, my change wasn't right .... I really don't know anything about Bengali, as you can tell :-). So, do we have any idea *what* is going wrong? The output of indic_ot_reorder, with the features *not* applied is: U+99C U+9AC U+9CD U+9A6 dist dist dist dist rphf rphf rphf rphf bwlf bwlf half half pstf pstf Tracing through TT_GPOS_Apply_String, the features that take effect are first, the middle two characters are combined into a ra-below-base form by 'blwf', then second, 'blws' combines the first and second glyphs. Eric would have know better, but I'm wondering if the problem isn't simply that the features are supposed to be applied syllable by syllable and we're doing the whole string at once. Your issue 3. is bug 91542 .. in Pango currently, every character has to be assigned to *some* script. Is there an easy workaround short of fixing 91542? We can't assign ZWNJ to indic-fc, because it is needed, e.g., for displaying Persian in Arabic script, but perhaps we can add ZWJ to the list of characters that indic-fc.c handles? As it turns out, that won't work either because the Indic engine advertises itself as one engine for each different Indic language. So, only one Indic script can get ZWJ... So, in the end, I don't have any idea other than fixing bug 91542.
Note that my patches above do *not* contain your workaround for 2, do they not work for the problem in 1a?
Two quick thoughts on 1b: Does the 'independent vowel + halant + ya + aa' combination work in Windows? The OT bengali specification strongly implies that uniscribe doesn't handle it. It should be pretty trivial to handle by adding an extra flag to scriptFlags and writing a special case for it in indic_ot_reorder().
I tried what you said, 1b does not get fixed with out the _ct hack. Let me explain this problem. Take the following input: U+985 U+9CD U+9AF U+9BE The problem with this is that U+985 is an independent vowel, and right now this input will become three syllables, (U+985) (U+9CD) (U+9AF U+9BE). This is not right obviously. Even if we somehow treat it as one syllable, we end up setting the tag blwf_p to all of them. This is a very very special case for U+985 where it acts as a consonant instead of a vowel. If you want to deal with it properly then we will have to add quite a few checks for U+985 in the reorder code to add proper tags. But as indic-ot.c is used by all the indic scripts, I think it will be even a bigger hack, risk, and extra delay. As this is a pure Bengali issue, I thought it will be better to keep the hack limited to Bengali :) The only side effect for my hack is that U+985 can now take up other independent vowels, which may actually be considered as a feature :) And I don't have access to a windows box at home, don't know what windows does. Can someone else please check? For 2, the problem is with the tags. Consider the following two inputs: U+99C U+9AC U+9CD U+9A6 U+99C U+9CD U+9AC U+9A6 After reorder, both should be (and is): U+99C U+9AC U+9CD U+9A6 The difference is in the tags. For the first case, we should have blwf_p for U+9AC U+9CD. With out the patch I proposed, pango sets blwf_p by default to everything, as result to the second case too. As for 3, today was my first day hacking pango... no way I can make a meaningful comment on this one. The only idea that crossed my mind is to consider ZWJ as part of the language left (or right in case of LTR) to it. Most of the code in indic directory seems to be checking for CC_ZERO_WIDTH_MARK, but currently this case can not happen. I am not sure about other engines.
It seems to me that the next step for 1b is to: - Find a uniscribe enabled copy of Microsoft windows - See if 'U+985 U+9CD U+9AF U+9BE' renders as desired - Try another sequence that would make sense for a consonant, but doesn't make sense for U+985, say U+985 + halant + <normal consonant> and see how that renders. Another approach would be simply to ask on the OpenType mailing list (http://www.microsoft.com/typography/otspec/otlist.htm) and ask for clarification of the relationship between the Unicode Indic FAQ item and the Bengali OpenType spec. About 2, one concern would be a case where you have a subscript form beneath a dead consonant (C + virama + C_below + virama + C) or devanagari ra, this is described in R8 of the Unicode book's Devanagari section (Chapter is available for download from http://www.unicode.org/versions/Unicode4.0.0/.) R8 is specifically mentioned as applying to other subscript consonants for Gurmukhi in the Unicode chapter as well. So, you only want to supress blwf on the *first* consonant of the syllable, not on all pre-base consonants. So, something as simple as: gulong tag = (i == baseLimit) ? half_p : blwf_p may be right, but I'd really like to get Eric Mader to look at this before we change things, since this affects all Indic scripts. (This bug report supports the idea that there should be only *one* issue per bug report.)
I just looked at the Bengali part of chapter 9 of Unicode4.0. It cleary states what to do for 1b. I don't think we need to bring it up with OpenType mailing list, unless we want to know if they are planning to add some new feature in OT layout table. And IMHO if uniscribe does not render it properly then we need to let them know, not follow them :) And your suggestion "gulong tag = (i == baseLimit) ? half_p : blwf_p" does work. Issue 3 is quite important for Bengali at least. Unicode 4.0 seems to be using ZWJ/ZWNJ to deal with few commonly used cases. btw, I just tried out Qt's OT support. It works with all these cases!
I've split this into four separate bug reports; I'll leave this bug open to track the resolution of the four issues.
Hopefully we can fix some of the problems earlier, but fixing all of these issues won't be possible until at least 1.4.
Three of the four issues are already fix, the one remaining is in bug 118299. Can't this bug be closed now?
Closing as per my last comment.