Bug 113551 – Bugs in the Bengali rendering system of Pango.

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 113551 - Bugs in the Bengali rendering system of Pango.


Summary:	Bugs in the Bengali rendering system of Pango.


Status:	RESOLVED FIXED

Product:	pango
Classification:	Platform
Component:	indic
Version:	1.2.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	Medium fix
Assigned To:	Pango Indic
QA Contact:	Pango Indic

URL:
Whiteboard:

Depends on:	118297 118299 118301 118302
Blocks:

Reported:	2003-05-22 19:36 UTC by Sayamindu Dasgupta
Modified:	2006-01-16 09:41 UTC

See Also:
GNOME target:	---
GNOME version:	2.9/2.10

Attachments
Patch to fix 1a (1.23 KB, patch) 2003-06-01 01:29 UTC, Taneem Ahmed	none	Details \| Review
Patch to fix 1a and 2 (1.63 KB, patch) 2003-06-01 04:58 UTC, Taneem Ahmed	none	Details \| Review
Patch to port changes between 1.8 and 1.9 of IndicReordering.cpp in ICU code to Pango (2.48 KB, patch) 2003-06-01 06:12 UTC, Taneem Ahmed	none	Details \| Review
Effect of allowing reph for U+9AC (3.50 KB, image/png) 2003-06-01 06:48 UTC, Owen Taylor		Details
Trivial patch changing charclass for U+9AC (1.10 KB, patch) 2003-06-01 06:51 UTC, Owen Taylor	none	Details \| Review
The correct rendering result for U+9AC with _bb (21.41 KB, image/png) 2003-06-01 07:16 UTC, Taneem Ahmed		Details
The incorrect rendering result for U+9AC with _rb (20.84 KB, image/png) 2003-06-01 07:17 UTC, Taneem Ahmed		Details
fix/work around for issue 1 and 2 (23.61 KB, image/png) 2003-06-01 07:26 UTC, Taneem Ahmed		Details
My version of indic-ot.c (diff -upw) (3.55 KB, patch) 2003-06-01 07:28 UTC, Owen Taylor	none	Details \| Review
Ugly version of ICU port (13.73 KB, patch) 2003-06-01 07:31 UTC, Owen Taylor	none	Details \| Review
Results with patch 17031 OR 17032 (20.91 KB, image/png) 2003-06-01 07:43 UTC, Taneem Ahmed		Details
my trivial work around for 1b (971 bytes, patch) 2003-06-01 07:51 UTC, Taneem Ahmed	none	Details \| Review
This is the text file for the images (157 bytes, text/plain) 2003-06-01 07:55 UTC, Taneem Ahmed		Details

Description Sayamindu Dasgupta 2003-05-22 19:36:05 UTC

Hello,
Thanks for the work on the Bengali rendering in Pango. 
However, there are a few bugs which we have come across while working on
font development and Bengali l10n.
I am listing them below, and there are two screenshots -
http://www.nongnu.org/freebangfont/pango_bugs/shot_bugs_gedit.jpg showing
the problem strings as rendered in Gedit2, and the other
http://www.nongnu.org/freebangfont/pango_bugs/shot_bugs_yudit.jpg showing
the correct rendering via Yudit. The file used for generating the
screenshots is also downloadable from
http://www.nongnu.org/freebangfont/pango_bugs/bugs.txt . (Note that I am
using Pango version 1.2.1 - the one provided with Mandrake 9.1)

1. Yaphala
---------------

a. The string &#2479;&#2509;&#2479; is rendered incorrectly. For some reasons, the Yafala mark
is getting rendered twice.    
    More information on yaphala can be found at
    http://www.microsoft.com/typography/otfntdev/bengalot/features.htm
    (section on "Post-base form of consonant")

b. The sequence 0985 09CD 09AF 09BE (&#2437;&#2509;&#2479;&#2494;) is not rendered properly.

    I quote from the Unicode Indic FAQ.

	Q: What are the Bengali characters used to transcribe the sound "a" (as in
English "bat") in Unicode?

	A: In Bengali, the sequence "zophola" (U+09CD U+09AF) + the "aa" matra
(U+09BE) is used for transcribing the 		English "a" in "bat". This
zophola_aa can be seen as a special "composite" matra to write a new
Bengali sound, 	  imported from English. Represent these sequences using a
halant (virama):

		Vowel_A_zophola_AA = 0985 09CD 09AF 09BE ( a- halant ya -aa )
		Vowel_E_zophola_AA = 098F 09CD 09AF 09BE ( e- halant ya -aa )
	
	If you need to add a candrabindu or other combining mark in the sequence,
represent the sequence as:

		Vowel_A_zophola_AA + candrabindu = 0985 09CD 09AF 09BE 0981 ( a- halant
ya -aa candrabindu )
    

2. Baphala
---------------

Pango, for some reasons is confusing between the sequence 09AC 09CD. This
sequence can be substituted by two different lookups  - pres, and blws.
Examples are given below.

pres - &#2460;&#2476;&#2509;&#2470; 
blws - &#2460;&#2509;&#2476;&#2470; 

I have attached a screenshot of how the above two examples look in Yudit.
More details on blws can be found at
http://www.microsoft.com/typography/otfntdev/bengalot/features.htm
(section on Below-base substitutions)


3. ZWNJ & ZWJ
---------------------

Rendering of certain strings have led us to believe that Pango is somehow
confusing between Zero Width Non Joiner (ZWNJ) and Zero Width Joiner (ZWJ).
<consonant> <halant> <ZWJ> <consonant> is rendered in the exact same way as
<consonant> <halant> <ZWNJ> <consonant>. This should not happen - as the
screenshot taken in Yudit shows. <consonant> <halant> <ZWJ> should render
the "half form" of the consonant, while Pango is rendering the "halant
form" instead (or it may be simply putting the consonant followed by the
halant - I am not very sure). This issue becomes important when we handle
the khanda-ta character in Bengali - a short write-up on this can be found
in the Unicode Indic FAQ.

Comment 1 Owen Taylor 2003-05-22 19:54:56 UTC

Patches, are of course, much appreciated.

Comment 2 Taneem Ahmed 2003-06-01 01:29:59 UTC

Created attachment 17020 [details] [review]
Patch to fix 1a

Comment 3 Taneem Ahmed 2003-06-01 01:30:34 UTC

Here is a small patch for 1a. This seems like a problem with indic-ot, not 
just Bengali. I am not quite sure if the patch is correct for other languages 
but it works for Bengali, and I am hoping it will give Owen some 
indication about what is the real problem.

Comment 4 Taneem Ahmed 2003-06-01 04:58:06 UTC

Created attachment 17022 [details] [review]
Patch to fix 1a and 2

Comment 5 Taneem Ahmed 2003-06-01 05:05:09 UTC

The second patch includes the previous fix for 1a, and fix for 2. 
 
Owen, can you please take a look at the third issue? It seems like a word 
with ZWJ or ZWNJ are broken into three items (in pango_itemize), and 
then treated alike.

Comment 6 Owen Taylor 2003-06-01 05:22:34 UTC

I don't think the patch is quite right, having multiple post
base forms is allowed in Bengali, I believe, and your
patch will prevent such cases from rendering correctly.

See: 

http://oss.software.ibm.com/cvs/icu/icu/source/layout/IndicReordering.cpp.diff?r1=1.8&r2=1.9

For how the problem was fixed in ICU. The immediately relevant
part of the patch is the change:

- while (baseConsonant >= baseLimit) {
+ while (baseConsonant > baseLimit) {

But probably the other parts of the patch need to be ported to 
Pango as well.

Comment 7 Taneem Ahmed 2003-06-01 05:32:17 UTC

I am quite sure (99%) you can't have multiple post-base in Bengali (I am 
not sure about other indic languages). In Bengali only 0x09AF has 
post-base form, and I haven't seen any word where it repeats itself. I am 
not sure how to test the other languages. 
 
I'll try out what you mentioned Owen, but I doubt I can port all the 
changes from ICU to Pango anytime soon...

Comment 8 Taneem Ahmed 2003-06-01 06:10:34 UTC

Okay, seems like there weren't too many ICU changes for the reorder 
function. Attached is the port of the diff you pointed out. Please take a 
look and see if you can come up with an official patch some time soon. 
 
as for issue 1b, I don't think there is anything in ICU. I will try to propose 
something.

Comment 9 Taneem Ahmed 2003-06-01 06:12:34 UTC

Created attachment 17023 [details] [review]
Patch to port changes between 1.8 and 1.9 of IndicReordering.cpp in ICU code to Pango

Comment 10 Owen Taylor 2003-06-01 06:48:57 UTC

Created attachment 17026 [details]
Effect of allowing reph for U+9AC

Comment 11 Owen Taylor 2003-06-01 06:51:51 UTC

Created attachment 17027 [details] [review]
Trivial patch changing charclass for U+9AC

Comment 12 Owen Taylor 2003-06-01 06:55:05 UTC

Regarding 2. - it seems that your change disallows below-base-forms
for all characters, which can't be right, can it?

In a brief look, perhaps the problem is that "reph" is not
being done for U+9AC, which I believe, as the Bengali Ra
should be getting it?

If I make the change of U+9AC from _cb (consonant with below-base,
to _rb, consonant with below base and reph), I get the image
that I've attached above. I have no idea if this is correct
or not, though at least there are different results for the
two sequences....

(If this change is correct, then ICU needs it as well.)

Comment 13 Taneem Ahmed 2003-06-01 07:13:58 UTC

U+9AC should be _bb (right now in CVS it is _bb not _cb). Reph is only 
for U+9B0. I am attaching two screenshots with _bb and _rb. As you can 
see for _rb the result is the same, which is not correct. The result should 
be as produced by _bb. 
 
Also, a very quick hack (and a bit ugly) is to set U+985 to _ct from _iv, 
this will fix the 1b issue. I will also upload an image with the result. There 
is a small side effect, but I am sure everyone can live with that, instead 
of pango rendering it wrong.

Comment 14 Taneem Ahmed 2003-06-01 07:16:22 UTC

Created attachment 17028 [details]
The correct rendering result for U+9AC with _bb

Comment 15 Taneem Ahmed 2003-06-01 07:17:09 UTC

Created attachment 17029 [details]
The incorrect rendering result for U+9AC with _rb

Comment 16 Taneem Ahmed 2003-06-01 07:26:53 UTC

Created attachment 17030 [details]
fix/work around for issue 1 and 2

Comment 17 Owen Taylor 2003-06-01 07:28:15 UTC

Created attachment 17031 [details] [review]
My version of indic-ot.c (diff -upw)

Comment 18 Owen Taylor 2003-06-01 07:31:47 UTC

Created attachment 17032 [details] [review]
Ugly version of ICU port

Comment 19 Taneem Ahmed 2003-06-01 07:43:17 UTC

Created attachment 17035 [details]
Results with patch 17031 OR 17032

Comment 20 Taneem Ahmed 2003-06-01 07:45:56 UTC

Owen, hmmm with patch 17031 or 17032 nothing is rendered as 
expected. The attachment 17030 [details] shows the expected result...

Comment 21 Taneem Ahmed 2003-06-01 07:51:03 UTC

Created attachment 17036 [details] [review]
my trivial work around for 1b

Comment 22 Taneem Ahmed 2003-06-01 07:55:03 UTC

Created attachment 17037 [details]
This is the text file for the images

Comment 23 Owen Taylor 2003-06-01 08:04:51 UTC

I've attached two copies of a version of version of
your backport - the first for legibility is with diff -w,
(ignore whitespace),the second is a diff that can be applied.

Changes from your version:

 - Remove 'if (lastConsonant >= prev) {' and reindent
 - Get the other part of the ICU change (remove pstf from base
   consonants) as well.
 - Remove code that you only #if 0'ed.

If you could check whether this fixes 1a for you, that
would be appreciated.

I don't want to give up on fixing 1b right and put in
a hack, without making any effort to figure out 

For 2, OK, my change wasn't right .... I really don't know
anything about Bengali, as you can tell :-). So, do we
have any idea *what* is going wrong? 

The output of indic_ot_reorder, with the features *not*
applied is:

 U+99C U+9AC U+9CD U+9A6
 dist  dist  dist  dist
 rphf  rphf  rphf  rphf
 bwlf              bwlf
 half              half
 pstf              pstf


Tracing through TT_GPOS_Apply_String, the features that
take effect are first, the middle two characters are
combined into a ra-below-base form by 'blwf', then
second, 'blws' combines the first and second glyphs.

Eric would have know better, but I'm wondering if the
problem isn't simply that the features are supposed
to be applied syllable by syllable and we're doing
the whole string at once.

Your issue 3. is bug 91542 .. in Pango currently, 
every character has to be assigned to *some* script.

Is there an easy workaround short of fixing 91542?

We can't assign ZWNJ to indic-fc, because it is
needed, e.g., for displaying Persian in Arabic
script, but perhaps we can add ZWJ to the list of
characters that indic-fc.c handles? As it turns
out, that won't work either because the Indic engine
advertises itself as one engine for each different
Indic language. So, only one Indic script can
get ZWJ...

So, in the end, I don't have any idea other than fixing
bug 91542.

Comment 24 Owen Taylor 2003-06-01 08:06:18 UTC

Note that my patches above do *not* contain your workaround
for 2, do they not work for the problem in 1a?

Comment 25 Owen Taylor 2003-06-01 08:42:51 UTC

Two quick thoughts on 1b:

 Does the 'independent vowel + halant + ya + aa' combination
 work in Windows? The OT bengali specification strongly implies
 that uniscribe doesn't handle it.

 It should be pretty trivial to handle by adding an extra
 flag to scriptFlags and writing a special case for it
 in indic_ot_reorder().

Comment 26 Taneem Ahmed 2003-06-01 08:54:17 UTC

I tried what you said, 1b does not get fixed with out the _ct hack. Let me 
explain this problem. Take the following input: 
 
U+985 U+9CD U+9AF U+9BE 
 
The problem with this is that U+985 is an independent vowel, and right 
now this input will become three syllables, (U+985) (U+9CD) (U+9AF 
U+9BE). This is not right obviously. Even if we somehow treat it as one 
syllable, we end up setting the tag blwf_p to all of them. 
 
This is a very very special case for U+985 where it acts as a consonant 
instead of a vowel. If you want to deal with it properly then we will have 
to add quite a few checks for U+985 in the reorder code to add proper 
tags. But as indic-ot.c is used by all the indic scripts, I think it will be 
even a bigger hack, risk, and extra delay. As this is a pure Bengali 
issue, I thought it will be better to keep the hack limited to Bengali :) The 
only side effect for my hack is that U+985 can now take up other 
independent vowels, which may actually be considered as a feature :) 
And I don't have access to a windows box at home, don't know what 
windows does. Can someone else please check? 
 
For 2, the problem is with the tags. Consider the following two inputs: 
U+99C U+9AC U+9CD U+9A6 
U+99C U+9CD U+9AC U+9A6 
 
After reorder, both should be (and is): 
U+99C U+9AC U+9CD U+9A6 
 
The difference is in the tags. For the first case, we should have blwf_p 
for U+9AC U+9CD. With out the patch I proposed, pango sets blwf_p by 
default to everything, as result to the second case too. 
 
As for 3, today was my first day hacking pango... no way I can make a 
meaningful comment on this one. The only idea that crossed my mind is 
to consider ZWJ as part of the language left (or right in case of LTR) to it. 
Most of the code in indic directory seems to be checking for 
CC_ZERO_WIDTH_MARK, but currently this case can not happen. I am 
not sure about other engines.

Comment 27 Owen Taylor 2003-06-01 14:49:05 UTC

It seems to me that the next step for 1b is to:

 - Find a uniscribe enabled copy of Microsoft windows
 - See if 'U+985 U+9CD U+9AF U+9BE' renders as desired
 - Try another sequence that would make sense for a 
   consonant, but doesn't make sense for U+985, 
   say 
       U+985 + halant + <normal consonant>
   and see how that renders.

Another approach would be simply to ask on the 
OpenType mailing list
(http://www.microsoft.com/typography/otspec/otlist.htm)
and ask for clarification of the relationship between
the Unicode Indic FAQ item and the Bengali OpenType spec.

About 2, one concern would be a case where you have 
a subscript form beneath a dead consonant 
(C + virama + C_below + virama + C)
or devanagari ra, this is described in R8 of the
Unicode book's Devanagari section (Chapter is available
for download from 
http://www.unicode.org/versions/Unicode4.0.0/.)

R8 is specifically mentioned as applying to other subscript
consonants for Gurmukhi in the Unicode chapter as well.

So, you only want to supress blwf on the *first* 
consonant of the syllable, not on all pre-base consonants.

So, something as simple as:

 gulong tag = (i == baseLimit) ? half_p : blwf_p

may be right, but I'd really like to get Eric Mader to 
look at this before we change things, since this affects
all Indic scripts.

(This bug report supports the idea that there should be 
only *one* issue per bug report.)

Comment 28 Taneem Ahmed 2003-06-01 20:50:26 UTC

I just looked at the Bengali part of chapter 9 of Unicode4.0. It cleary 
states what to do for 1b. I don't think we need to bring it up with 
OpenType mailing list, unless we want to know if they are planning to 
add some new feature in OT layout table. And IMHO if uniscribe does 
not render it properly then we need to let them know, not follow them :) 
 
And your suggestion "gulong tag = (i == baseLimit) ? half_p : blwf_p" 
does work. 
 
Issue 3 is quite important for Bengali at least. Unicode 4.0 seems to be 
using ZWJ/ZWNJ to deal with few commonly used cases. 
 
btw, I just tried out Qt's OT support. It works with all these cases!

Comment 29 Owen Taylor 2003-07-25 14:41:12 UTC

I've split this into four separate bug reports; I'll leave
this bug open to track the resolution of the four issues.

Comment 30 Owen Taylor 2003-08-25 13:46:46 UTC

Hopefully we can fix some of the problems earlier, but fixing
all of these issues won't be possible until at least 1.4.

Comment 31 Behdad Esfahbod 2005-09-26 13:21:43 UTC

Three of the four issues are already fix, the one remaining is in bug 118299. 
Can't this bug be closed now?

Comment 32 Behdad Esfahbod 2006-01-16 09:41:15 UTC

Closing as per my last comment.