Bug 441654 – prefix fails when more than one base characters (as conjuncts) present after a half form the next prefix renders incorrectly

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 441654 - prefix fails when more than one base characters (as conjuncts) present after a half form the next prefix renders incorrectly


Summary:	prefix fails when more than one base characters (as conjuncts) present after ...


Status:	RESOLVED FIXED

Product:	pango
Classification:	Platform
Component:	indic
Version:	1.14.x
Hardware:	Other All

Importance:	Normal major
Target Milestone:	---
Assigned To:	Pango Indic
QA Contact:	pango-maint

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2007-05-27 19:25 UTC by Praveen A
Modified:	2008-08-06 09:59 UTC

See Also:
GNOME target:	---
GNOME version:	2.13/2.14

Attachments
image referred above (17.79 KB, image/png) 2007-05-27 19:27 UTC, Praveen A		Details
Test rendering (3.48 KB, image/png) 2007-06-17 17:47 UTC, Sayamindu Dasgupta		Details
fix by suresh (1.15 KB, patch) 2008-03-07 12:59 UTC, Praveen A	needs-work	Details \| Review
test output (36.94 KB, image/jpeg) 2008-04-23 18:26 UTC, Rahul Bhalerao		Details
Simplified patch for 'kaarkkodakan' problem (885 bytes, patch) 2008-04-24 07:24 UTC, Baiju M	none	Details \| Review
Simplified patch for 'kaarkkodakan' problem (749 bytes, patch) 2008-04-24 12:28 UTC, Baiju M	none	Details \| Review
Patch for mprefixups.c (470 bytes, patch) 2008-04-24 22:00 UTC, Rahul Bhalerao	none	Details \| Review
Screen shot of test cases (41.06 KB, image/jpeg) 2008-04-24 22:05 UTC, Rahul Bhalerao		Details
Screenshot of test case (45.20 KB, image/png) 2008-04-25 09:27 UTC, Manilal		Details
Problem still with Patch 109861 (5.37 KB, image/jpeg) 2008-04-25 11:04 UTC, Baiju M		Details
Further simplified patch for 'kaarkkodakan' problem (525 bytes, patch) 2008-04-25 11:19 UTC, Baiju M	none	Details \| Review
Correct rendering of 'kaarkkodakan' bug related words (4.98 KB, image/jpeg) 2008-04-25 13:38 UTC, Baiju M		Details
screenshot of karkotakan like words with patch from Baiju (188.84 KB, patch) 2008-04-25 14:56 UTC, Praveen A	none	Details \| Review
Kaarkkodakan problem definition as image (48.03 KB, image/png) 2008-05-01 11:01 UTC, Baiju M		Details
Screemshot after modified solution (33.26 KB, image/jpeg) 2008-06-02 17:14 UTC, Rahul Bhalerao		Details
Incorrect renderings marked (29.66 KB, image/jpeg) 2008-06-02 17:25 UTC, Cibu C J		Details
New screenshot with yet another modified solution with new orthography (54.56 KB, image/jpeg) 2008-06-03 08:36 UTC, Rahul Bhalerao		Details
New screenshot with modified solution with old script font (32.71 KB, image/jpeg) 2008-06-03 08:41 UTC, Rahul Bhalerao		Details
A much generic patch for mprefixups.c (453 bytes, patch) 2008-06-03 13:34 UTC, Rahul Bhalerao	committed	Details \| Review
Screenshot of Firefox with Karkkodakan test case (154.75 KB, image/png) 2008-06-10 07:14 UTC, Manilal		Details

Description Praveen A 2007-05-27 19:25:52 UTC

Please describe the problem:
Eg1: കാ__ര്‍ക്കോ__ടകന്‍  (karkkodakan) - 0d30+0d4d+200d+0d15+0d4d+0d15+0d4b

0d30+0d4d+200d forms the chillu ര്‍ (chillu ra - half form) and now 0d15 should be base for 0d4b prefix, but it takes 0d30 as the base and prefix it all the way left to 0d15.

work around: put a ZWNJ after chillu ra so that the 0d4b prefix takes 0d15 as the base ര്‍‌ക്കോ

Eg2: അ__പ്ഗ്രേ__ഡ് (upgrade) - 0d2a+0d4d+0d17+0d4d+0d30+0d47

here also the prefix 0d47 moves all the way left to 0d2a as base when it should be 0d17 as base

work around: put a ZWS after 0d2a+0d4d പ്ഗ്രേ



Steps to reproduce:
1. open charmap
2. try the unicode sequense given above


Actual results:
incorrectly rendered conjuncts as shown in the examples

Expected results:
correctly rendered conjuncts as given in workaround

Does this happen every time?
yes

Other information:
The conjunct formation is illustrated (both current rendering and the correct rendering) for example1 here
http://images.wikia.com/fci/images/8/81/Rkko.png

Comment 1 Praveen A 2007-05-27 19:27:58 UTC

Created attachment 88902 [details]
image referred above

example 1

Comment 2 Hiran Venugopalan 2007-05-28 04:18:27 UTC

This is true for any pre form (ൌ 0d4c, ോ 0d4b, ൊ 0d4a, േ 0d47, ൈ. .0d48.)  comming after a half form (chillu or consonants+halant ) and a conjunct and the entire combination does not form a bigger conjunct.

Comment 3 Sayamindu Dasgupta 2007-06-16 20:24:50 UTC

This seems to be addressed in bug #427667. Can you guys confirm ?

Comment 4 Praveen A 2007-06-17 14:09:42 UTC

In case of Malayalam it is 
<consonant> <halant> (ZWJ in case of chillus or pure consonants) <consonant> <halant> <consonant> * <pre-base matra>

* It render correctly in case of single consonant and when the whole combination (till the <pre-base matra>) forms a large conjunct, the issue comes only when there is a two consonants ("ka halant ka" or "ga halant ra" in the example given above).

Comment 5 Sayamindu Dasgupta 2007-06-17 17:47:00 UTC

Created attachment 90155 [details]
Test rendering

I made some changes to the source code, and the resultant rendering is attached. Since I do not understand the Malayalam script, could you please confirm that is the output that you want ?

Comment 6 Praveen A 2007-06-18 07:09:42 UTC

It is correct output, I assume you have not used ZWNJ at all.

Comment 7 Ani Peter 2008-02-26 06:43:26 UTC

Sayamindu: The attachment in comment#5 shows he issue is fixed. Could you please provide the code for the same.

Thanks in advance.

Comment 8 Praveen A 2008-03-07 12:59:18 UTC

Created attachment 106776 [details] [review]
fix by suresh

Some more tries for deciding base glyph is added by suresh.

Comment 9 Praveen A 2008-03-07 13:15:03 UTC

We would love to see the fix in next release. This is last major bug in pango for Malayalam.

Comment 10 Santhosh Thottingal 2008-03-14 03:35:00 UTC

Please consider the patch and it helps the Malayalam rendering bug free..

Comment 11 Praveen A 2008-03-14 04:03:05 UTC

This bug affects both typewriter script and traditional script and the fix will make Malayalam rendering bug-free. There is no disputes about this fix. Hope to see the patch integrated in pango soon.

Comment 12 Ani Peter 2008-03-14 05:01:45 UTC

Would be really pleased to see this bug getting fixed asap as this is a critical issue.

Comment 13 Manilal 2008-03-19 13:44:50 UTC

I have tested this patch in pango-1.18.4 (Fedora 8) and it works fine. It would be really nice to see this patch in upstream, since it was a major bug in Malayalam rendering. Please commit the patch ASAP.

Comment 14 Praveen A 2008-04-14 05:40:07 UTC

A patch has been sitting in the bugzilla for more than 5 weeks and many have tested it and requested for inclusion. Is there anything else we need to do to get this patch accepted? At least tell us if anything more to be done.

Comment 15 Rahul Bhalerao 2008-04-23 18:26:34 UTC

Created attachment 109782 [details]
test output

Attaching this screenshot for reference.

Comment 16 Rahul Bhalerao 2008-04-23 18:29:52 UTC

Behdad,
I have tested this patch(screenshot attached above). It solves the given problem very well. Also it is not found to affect any other languages with similar features. Thus I think this patch can be accepted.

Comment 17 Behdad Esfahbod 2008-04-24 04:36:51 UTC

I don't really like the patch :(.  First, I'm pretty sure you need to break out of the loops on first match.

Second and more important issue, those magic numbers 2, 4, 6 are bogus.  What's intended I believe is that basIndex and glyph[i] belong to the same cluster.  Is that all that is needed?  If yes, the code should be restructured to do that.

Comment 18 Baiju M 2008-04-24 07:24:07 UTC

Created attachment 109808 [details] [review]
Simplified patch for 'kaarkkodakan' problem

This patch fixes first issue: "break out of the loops on first match."
also avoid extra looping (extened loop for determining post GSUB location of 
baseIndex and mpreIndex).

The "magic number" is not addressed here. But if this not going to make any other 
issue, please accept this.  Meanwhile we will try to make a better patch.

Comment 19 Manilal 2008-04-24 08:17:36 UTC

I have tested the latest patch(id=109808) in Fedora 7 (pango-1.16.4) and it works fine.

Comment 20 Baiju M 2008-04-24 12:28:52 UTC

Created attachment 109818 [details] [review]
Simplified patch for 'kaarkkodakan' problem

The patch creating another issue, so changed it like this. This one only brings three if condtions inside one loop.

Comment 21 Behdad Esfahbod 2008-04-24 14:08:20 UTC

Can someone explain to me, in plain text, what the correct function of that code block should be?  Or rather, how did one come up with this patch?

Comment 22 Rahul Bhalerao 2008-04-24 21:49:50 UTC

Generally in Indic scripts, the 'pre-base Matra' (Mpre), are placed to the extreme left of the syllable cluster. But in Malayalam, if the consonant conjunct cluster contains more than two consonant (e.g. [C1 + H + C2 + H + C3] is a cluster of three consonants C1, C2 and C3), then the following pre-base Matra should be placed just left of C2, i.e. the second-last consonant or if C1 and C2 have a ligature, then to the left of C3. In other words, essentially to the left of the final sub-cluster.

Thus to do this, only the condition glyph[i].cluster == (baseIndex - 2) is sufficient. The cases for baseIndex-4 and 6 are not required.

Comment 23 Rahul Bhalerao 2008-04-24 22:00:40 UTC

Created attachment 109861 [details] [review]
Patch for mprefixups.c

Behdad, w.r.t. my Comment #22, this is the simplest patch I could think and test of. Do you think it as a right implementation?
I have tested the above patch and would request even others to test it.

Comment 24 Rahul Bhalerao 2008-04-24 22:05:01 UTC

Created attachment 109863 [details]
Screen shot of test cases

Comment 25 Manilal 2008-04-25 09:27:28 UTC

Created attachment 109889 [details]
Screenshot of test case

I have tested the patch in pango-1.16 (Fedora 7). Refer the attachment.

Comment 26 Baiju M 2008-04-25 11:04:06 UTC

Created attachment 109896 [details]
Problem still with Patch 109861

Rahul, your patch is not fixing some cases as show in picture.
Here is the unicode text used to create image:
സബ്സ്ക്രൈബര്‍ & അവള്‍തന്‍സ്ത്രൈണഭാവം
We have started adding the words here: http://fci.wikia.com/wiki/SMC/Rendering_Tests

Comment 27 Baiju M 2008-04-25 11:19:05 UTC

Created attachment 109898 [details] [review]
Further simplified patch for 'kaarkkodakan' problem

I have attached further simplified patch for 'kaarkkodakan' problem.
I removed (baseIndex - 6) since I cann't find any word like that.
If we found any word like that it will be required.

Comment 28 Rahul Bhalerao 2008-04-25 12:45:37 UTC

Baiju, Could you please also explain what the expected output is? Also it would be good if you could explain your intended functionality(something Behdad has already asked for), since I could not spot any difference made by (-4).

Comment 29 Baiju M 2008-04-25 13:38:10 UTC

Created attachment 109910 [details]
Correct rendering of 'kaarkkodakan' bug related words

Rahul, I have attached the correct rendering for: സബ്സ്ക്രൈബര്‍ & അവള്‍തന്‍സ്ത്രൈണഭാവം

As you can see that the latest patch I added is a simplified
(reduced loops and if condtions -- the logic is same) version of
original patch by Suresh.  May be you can compare my patch and yours and
explain why the other one is not working for all cases.  I have never been
into this business before, so it would be difficult for me to explain.
Anyway, I will try later if you or others cannot explain it, because
we (Malayalees) badly require this bug to be fixed :(

Comment 30 Praveen A 2008-04-25 14:56:49 UTC

Created attachment 109912 [details] [review]
screenshot of karkotakan like words with patch from Baiju

I have tested the patch by Baiju and here is the screenshot of correct rendering of all karkotakan like words.

Comment 31 Rahul Bhalerao 2008-04-25 20:43:39 UTC

I have already explained with respect to the test cases that were known by then. Now the with the new test cases being introduced, I was confused myself as I do not know the exact syntax of malayalam script. What I could understand so far is, the left matra is to be put to the left of the final cluster, i.e. exactly what Behdad said, baseIndex and glyphs[i] to belong to the same cluster, and to do that, one way is to check for (baseIndex - 2n) where n is positive integer. But randomly determining a limit to the n's value and repeating the same check for a series of n, both are not good approaches.

Comment 32 suruma 2008-04-26 02:49:02 UTC

Rahul's fix only solves the issue with a simple conjunct, ie, C1 + H + C2.In the case of സബ്സ്ക്രൈബര്‍, with the 'സ്ക്ര' we have C1 + H + C2 + H + C3.So the BaseIndex should be further down by 2(-4 case).And the extreme case happens with 'സ്റ്റ്ര'.eg. സബ്സ്റ്റ്രേറ്റ്.Here arises the -6 case.(In fact the biggest cluster is 'ഗ്ദ്ധ്ര' where we need a -8 case! But it is rarely used)

It seems that Rahul has tested these things with a typewriter script font.A test with traditional script fonts like Rachana or Meera will help you see the abovesaid cases.

Comment 33 suruma 2008-04-26 04:51:18 UTC

One correction:
<strike>(In fact
the biggest cluster is 'ഗ്ദ്ധ്ര' where we need a -8 case! But it
is rarely used)</strike>
The -6 case covers this as well.

Comment 34 Rahul Bhalerao 2008-04-26 09:09:16 UTC

Suresh, I am testing with all kind of fonts. The patch up to -6 or even above that will give expected output but thats not the best thing to do. We need to rethink about restructuring the code in other ways. I am trying it myself.

Comment 35 Praveen A 2008-04-26 21:02:56 UTC

Can we get the patch committed and leave the bug open till the best way is found? Currently it is broken and this patch fixes it. We cannot leave this broken till the optimum one is found. It is a real pain to see this broken and we have to tell every single user to download a patched pango and it happens with every new user.

Comment 36 Baiju M 2008-04-30 08:39:09 UTC

Status of this bug is still 'UNCONFIRMED'. What will be required to make the status to 'NEW' ?

Comment 37 Behdad Esfahbod 2008-04-30 17:15:30 UTC

That simply doesn't matter...

Comment 38 Baiju M 2008-05-01 10:58:02 UTC

Behdad & Rahul, based on your suggestion I will try to define
the intended behaviour/functionality.

Terminology:-
   1. C1,C2...Cn are consonants Eg:- ക , ച , ഡ , ബ
   2. H stands for Halant -- Virama Sign (0D4D - ്)
   3. PB is a volvel sign with pre-base form. Eg:- ോ , ൊ , േ , െ
   4. ligature - ref:http://en.wikipedia.org/wiki/Ligature_(typography)
   5. ZWJ - Zero Width Joiner

Definition:-
   In the case of "C1 + H + C2 + H + .... + Cn + PB", the left part
   of PB should be placed just before the last ligature formed
   regardless of ZWJ coming in beteween.

Case 1 ( C1 + H + C2 + PB ):-

   ന്‍മേ = ന + ്  + ZWJ + മ + േ
   ബ്കേ =  ബ  +  ്  +  ക  +  േ
(This case is working without the proposed Patch 109898 .)

Case 2 ( C1 + H + C2 + H + C3 + PB ):-

 ര്‍ക്കോ  = ര +  ്   + ZWJ + ക  +  ്  + ക  +  ോ
 പ്ഗ്രേ  =  പ +  ്  +  ഗ +  ്  + ര +  േ
(This case is working with the -2 condition of
 the proposed Patch 109898 .)

Case 3 ( C1 + H + C2 + H + C3 + H + C4 + PB ):-

  ബ്സ്ക്രൈ =  ബ  +  ്  + സ  + ്  + ക  + ്  + ര +  ൈ
(This case is working with the -4 condition of
 the proposed Patch 109898 .)

Case 4 ( C1 + H + C2 + H + C3 + H + C4 + H + C5 + PB ):-

  ബ്സ്റ്റ്രേ = ബ +  ്  + സ +  ്  + റ +  ്  + റ +  ്  + ര + േ
(This case will work if -6 is also added as a condition
 to the proposed Patch 109898 .)

Comment 39 Baiju M 2008-05-01 11:01:39 UTC

Created attachment 110216 [details]
Kaarkkodakan problem definition as image

For those who cannot read my above comment, see this attached image with proper redering.

Comment 40 Behdad Esfahbod 2008-05-01 22:54:09 UTC

Given that I don't read any Indic langauge, please tell me what the desired reordering is in each case.  Something like:

Case 1 ( C1 + H + C2 + PB ) -> ( PB1 + C1+H+C2 + PB2 )


Also, is Malayalam different from other languages here?  How?

Comment 41 Praveen A 2008-05-02 05:04:37 UTC

Lets take L12 and the conjunct/ligature formed by C1+H+C2.

I will explain it in the following sequence

Case -> Correct rendering : Current rendering - Result

Case 1 is ( C1 + H + C2 + PB ) -> ( C1+H+PB1+C2+PB2 ): ( C1+H+PB1+C2+PB2 ) - GOOD

Case 2 is ( C1 + H + C2 + H + C3 + PB ) -> (C1+H+PB1+L23+PB2), L23=C2+H+C3 : (PB1+C1+H+L23+PB2) - INCORRECT

Case 3 is ( C1 + H + C2 + H + C3 + H + C4 + PB ) -> ( C1+H+PB1+L234+PB2 ), L234=C2+H+C3+H+C4 : ( PB1+C1+H+L234+PB2 ) - INCORRECT

Case 4 is ( C1 + H + C2 + H + C3 + H + C4 + H + C5 + PB ) -> (C1+PB1+L2345+PB2), L2345=C2+H+C3+H+C4+H+C5: ( PB1 +C1+H+L2345+PB2)

Differences:

1) Most other Indian languages does not have this big ligatures - like five consonants joining to form a single ligature
2) AFAIK, Most other Indian languages does not have prefixes
3) Some of them have either of the two, but Malayalam has got both

Comment 42 Cibu C J 2008-05-02 05:27:34 UTC

I am not sure, ignoring ZWJ/ZWNJ will yield right result. The reordering rule should be:

Step1: move the reordering post base form of the consonant to left most; without crossing ZWJ, ZWNJ or Visible virama.
Step2: move the left part of the reordering vowel sign to left most; again, without crossing ZWJ, ZWNJ or Visible virama.

This is applicable to entire indic, AFAIK. 

In other words, the base glyph will be the consonant which comes immediately after either one of:
ZWJ, ZWNJ, visible virama.

Comment 43 Baiju M 2008-05-02 06:10:11 UTC

Hi Behdad,

Well, I don't know much about other indic languages
(This bug is dragging me to all these issues :( ).
So, here I will explain the Malayalam case. In Malayalam,
when a consonant/conjunct and vowel (sign) is joining, it
will create three kinds of forms depending on the vowel.

C + Vowel  ->      C+PostBase  (Variant 1)
                       or
                    PreBase+C  (Variant 2)
                       or
               PreBase+C+PostBase  (Variant 3)

(Here, C stands for consonant/conjunct)

Remember, this bug is only relevant for PreBase+C+PostBase
and PreBase+C (Variant 2 and Variant 3) forms.

The variant 1 (C+PostBase) forming vowel signs:-
0D3E(ാ), 0D3F(ി), 0D40(ീ), 0D41(ു), 0D42(ൂ), 0D43(ൃ),
0D44(Not used these days)

The variant 2 (PreBase+C) forming vowel signs:-
0D46(െ), 0D47(േ), 0D48(ൈ)

The variant 3 (PreBase+C+PostBase) forming vowel signs:-
0D4A(ൊ), 0D4B(ോ), 0D4C(ൌ)
Note: 0D4C(ൌ) - The 0D4C may also write as C+PostBase 
      with only the right symbol (ൗ  - 0D57)

Here I will try to redefine the definition and example
using better terminology.

Terminology:-
  1. C1,C2...Cn are consonants Eg:- ക , ച , ഡ , ബ
  2. H stands for Halant -- Virama Sign (0D4D - ്)
  3. VS is a volvel sign with pre-base,post-base,
     pre-base+post-base form. Eg:- ോ , ൊ , േ , െ
  4. ligature - a glyph formed from multiple characters
     ref: http://en.wikipedia.org/wiki/Ligature_(typography)
  5. ZWJ - Zero Width Joiner

Definition:-
   In the case of "C1 + H + C2 + H + .... + Cn + VS", 
   the left part of VS should be placed just before the last
   ligature formed regardless of ZWJ coming in beteween.

Case 1 ( C1 + H + C2 + VS ):-

 Variant 1: ന്‍മാ = ന + ്  + ZWJ + മ  +  ാ ( C1+H+ZWJ + C2 + PostBase)
 Variant 2: ന്‍മേ = ന + ്  + ZWJ + മ + േ (C1+H+ZWJ + PreBase + C2)
                           ബ്കേ =  ബ  +  ്  +  ക  +  േ (C1+H + PreBase + C2
 Variant 3: മ്ചോ = മ  +  ്  + ച + ോ  (C1+H + PreBase + C2 + PostBase)

(This case is working without the proposed Patch 109898 )

Case 2 ( C1 + H + C2 + H + C3 + VS ):-

 Variant 1: ര്‍ക്കാ = ര +  ്  + ZWJ + ക +  ്  + ക +  ാ 
            (C1+H+ZWJ + C2+H+C3 + PostBase)
 Variant 2: പ്ഗ്രേ  =  പ +  ്  +  ഗ +  ്  + ര +  േ 
            (C1+H + PreBase + C2+H+C3)
 Variant 3: ര്‍ക്കോ  = ര +  ്   + ZWJ + ക  +  ്  + ക  +  ോ 
           (C1+H+ZWJ + PreBase + C2+H+C3 + PostBase)
 
(This case is working with the -2 condition of
 the proposed Patch 109898 )

Case 3 ( C1 + H + C2 + H + C3 + H + C4 + VS ):-

 Variant 1: ബ്സ്ക്രാ = ബ +  ്  + സ + ്  + ക + ്  + ര +  ാ 
            (C1+H + C2+H+C3+H+C4 + PostBase)
 Varient 2: ബ്സ്ക്രൈ =  ബ  +  ്  + സ  + ്  + ക  + ്  + ര +  ൈ 
            (C1+H + PreBase + C2+H+C3+H+C4)
 Variant 3: ബ്സ്ക്രൊ = ബ + ് + സ + ് + ക + ്  + ര + ൊ 
            (C1+H + PreBase + C2+H+C3+H+C4 + PostBase)
(This case is working with the -4 condition of
 the proposed Patch 109898 )

Case 4 ( C1 + H + C2 + H + C3 + H + C4 + H + C5 + VS ):-

 Variant 1: ബ്സ്റ്റ്രാ =  ബ +  ് + സ + ് + റ + ്  + റ +  ്  + ര +  ാ
            (C1+H + C2+H+C3+H+C4+H+C5 + PostBase)
 Variant 2: ബ്സ്റ്റ്രേ = ബ +  ്  + സ +  ്  + റ +  ്  + റ +  ്  + ര + േ
            (C1+H + PreBase + C2+H+C3+H+C4+H+C5)
 Variant 3: ബ്സ്റ്റ്രോ = ബ +  ്  + സ +  ്  + റ +  ്  + റ +  ്  + ര +  ോ
            (C1+H + PreBase + C2+H+C3+H+C4+H+C5 + PostBase)
(This case will work if -6 is also added as a condition to
 the proposed Patch 109898 )

Comment 44 Baiju M 2008-05-02 06:28:00 UTC

Behdad, what Cibu said in Comment 42 is very correct.

Comment 45 Behdad Esfahbod 2008-05-05 17:25:36 UTC

Thanks for all the explanations.

So I have two questions:

  - If I understand correctly, I read that Malayalam is different from other Indic scripts with respect to where to position the prebase form.  Is that correct?  If yes, why doesn't the patch only special case this for Malayalam then?

  - This "feature" relies on the font using a single glyph as the ligature.  And on having a ligature in the first place.  That's clearly a restriction I've not seen before.  To me it looks like the correct fix here is to make sure the sequence is broken into more clusters.  The the prebase form will always be placed at the beginning of the cluster as it is now, and it will all work.  Assumptions like what's being suggested in the patch are really out of scope of how Pango's shaping model is supposed to work.

Comment 46 Cibu C J 2008-05-05 22:03:55 UTC

1) Yes. Malayalam is different. For example in Devanagari, the prebase form cross ZWJ. Eg: क्‍कि (<KA, Virama, ZWJ, KA, I Sign>, /kki/)  However, I just wanted to make sure I understand what you mean by prebase. For me it is the reordering vowel or consonant sign. It is not the C1-conjoining form in the cluster <C1 Virama C2>.

2) The feature cannot rely on a single glyph as the ligature. That is not universal to Malayalam. For example, ത്രെ (<TA, Virama, RA, E-Sign>, /thra/) in reformed orthography. However, you still need a criteria to identify the cluster. The codepoints that form a cluster is font dependant and the algorithm should take that into account. 

Let me put the algorithm little more elaborately:

Assume the codepoint subsequences are c[1], c[2] etc; where each subsequence corresponds to one glyph already chosen. Let  g[i] be the glyph for the codepoint subsequence c[i]. Let c[n] be the codepoint subsequence with reordering prebase form g[n]. We need to find the position to place g[n]. 

For that, in the codepoint subsequence list, traverse back from i == n-1, without ignoring ZWJ and ZWNJ, until you hit:
g[i] == virama glyph or
last codepoint in c[i] and first codepoint in c[i+1] are not virama or
c[i] is <fictitious start of the list>

Now we found a place for g[n], it is immediately after g[i].

Comment 47 Behdad Esfahbod 2008-05-26 15:45:02 UTC

For reference, a gtk-devel-list thread was started about this bug today:
http://mail.gnome.org/archives/gtk-devel-list/2008-May/msg00094.html

Comment 48 Rahul Bhalerao 2008-06-02 17:14:18 UTC

Created attachment 111972 [details]
Screemshot after modified solution

Can anyone from Malayalam community please confirm if the rendering of combinations related to this bug in the given screenshot are all correct or not? As this screenshot is similar to the one taken against the patch 109818, I think it is mostly correct.

Comment 49 Cibu C J 2008-06-02 17:25:13 UTC

Created attachment 111976 [details]
Incorrect renderings marked

Comment 50 Cibu C J 2008-06-02 17:26:24 UTC

Comment on attachment 111976 [details]
Incorrect renderings marked

The incorrect renderings are marked. Also, I see the new orthography testcases are completely missing here in this test.

Comment 51 Rahul Bhalerao 2008-06-03 08:36:30 UTC

Created attachment 112027 [details]
New screenshot with yet another modified solution with new orthography

More comments on following attachment..

Comment 52 Rahul Bhalerao 2008-06-03 08:41:08 UTC

Created attachment 112028 [details]
New screenshot with modified solution with old script font

This and previous screenshots are taken after making few more changes in code. As can be seen in the screenshot, I have underlined the problems I could spot. But comparing the two screenshots, the problems appear to be font dependent as same problems are not present in other font. Thus from rendering point of view, things are working fine here. If I get the confirmation on this I would be glad to make and post a patch out of it.

Comment 53 Santhosh Thottingal 2008-06-03 08:49:27 UTC

The rendering in the image with Meera font is correct. And in the image with Lohit font, the underlined words are wrong.
Please post the patch so that others can test it.

Comment 54 Rahul Bhalerao 2008-06-03 13:34:01 UTC

Created attachment 112055 [details] [review]
A much generic patch for mprefixups.c

This is the patch that has given the results shown in the screenshot. It appears to work for all the test cases considered so far. More, it is also made more generic than earlier. 
It simply determines the baseGlyph in a better way. It is also not found to affect any other scripts than malayalam, since it depends generically on the cluster formation for individual scripts. 

Please do some more testing if needed, I hope it is very close to be acceptable now.

Comment 55 Mahesh T Pai 2008-06-05 10:31:30 UTC

(In reply to comment #38)

>    3. PB is a volvel sign with pre-base form. Eg:- ോ , ൊ , േ , െ

I feel this definition is a bit confusing, going by the text of the Unicode standards (UTS). 

UTS distinguishes between "[pre|post]-base" which I have understood so far to mean a situation where glyph substitution is involved, and the vowel matras. All the examples given above are vowel signs, and do NOT involve glyph substitution. 

What Biju has given above are code points in UTS, defined as "Vowel Signs" in the code charts. 

Does the pango code treats pre/post base glyph substitution and dependent vowel (both normal "dependent" and "two part dependent" vowel signs, AND pre/post base glyph substitution the same way?

> Definition:-
>    In the case of "C1 + H + C2 + H + .... + Cn + PB", the left part
>    of PB should be placed just before the last ligature formed
>    regardless of ZWJ coming in beteween.

Please see http://www.unicode.org/charts/PDF/U0D00.pdf

A large of part of this problem - caused by zwj - is resolved by version 5.1 of TUS. The conjuncts, and where to place the pre base ligature for complex conjuncts, still exists. We can, for most cases, stress less on presence / absence of zwj,

> Case 1 ( C1 + H + C2 + PB ):-
> 
>    ന്‍മേ = ന + ്  + ZWJ + മ + േ
>    ബ്കേ =  ബ  +  ്  +  ക  +  േ
> (This case is working without the proposed Patch 109898 [edit] .)

I would rather call this the simple case - C1 + H + C2, where only two consonants are involved. 

> 
> Case 2 ( C1 + H + C2 + H + C3 + PB ):-

Now, I cannot read code. :(

When it comes to complex conjuncts, (sequences with more than three or more consonants with halant in between), cannot the code simply do this:-

1. scan the text to see for sequences with H in between.
2. count the number of consonants, (say, we have C1 to CN) and for N - 1, 
3. substitute the sequence with the relevant glyph 
4. check if the Nth consonant is a vowel sign (remember that vowel signs do NOT have a preceding halant, but a pre/post base form has a preceding halant - hence my earlier request to use the appropriate terminology)
5. If it is a single part vowel which goes to the left, put the glyph before the glyph obtained at step 3.
6. If CN is a character with pre/post -base substitution, do the required (put the substituted form before the glyph from step 3 or after, if we have a post base CN). Remember to put the prebase form / vowel sign before the last such glyph we get from step 3. 
7. If the font supports a single glyph for what we get at 6 do a substitution again. 

Ok. I am a big zero at coding, and worse at algorithms, but hope we have a starting point here.

Comment 56 Mahesh T Pai 2008-06-05 10:40:10 UTC

(In reply to comment #42)
> I am not sure, ignoring ZWJ/ZWNJ will yield right result. 

Very true.

> In other words, the base glyph will be the consonant which comes immediately
> after either one of:
> ZWJ, ZWNJ, visible virama.

I think you should, for completeness sake, elaborate on the "visible virama" thing too.

Since all fonts are not expected to have all complex conjuncts, the  assumption is that a sequence C1 + H + C2 + H + ..... + H + CN can, with some fonts, result in N-1 visible halants, and sometimes, just one glyph. Hope I have put this correctly. So the vowel signs should go to the right of last visible virama, if the fonts does not substitute the entire sequence with a single glyph.

Comment 57 Manilal 2008-06-10 07:12:45 UTC

(In reply to comment #54)

Tested Rahul's patch in pango-1.20.1(Fedora 9). There are no issues and the rendering seems to be perfect. It may really nice to see this patch in upstream. I have attached the screenshot of the correctly rendered page.(http://fci.wikia.com/wiki/%E0%B4%B8%E0%B5%8D%E0%B4%B5%E0%B4%A4%E0%B4%A8%E0%B5%8D%E0%B4%A4%E0%B5%8D%E0%B4%B0_%E0%B4%AE%E0%B4%B2%E0%B4%AF%E0%B4%BE%E0%B4%B3%E0%B4%82_%E0%B4%95%E0%B4%AE%E0%B5%8D%E0%B4%AA%E0%B5%8D%E0%B4%AF%E0%B5%82%E0%B4%9F%E0%B5%8D%E0%B4%9F%E0%B4%BF%E0%B4%99%E0%B5%8D%E0%B4%99%E0%B5%8D/Kaarkkodakan)

Thanks Rahul.

Comment 58 Manilal 2008-06-10 07:14:52 UTC

Created attachment 112457 [details]
Screenshot of Firefox with Karkkodakan test case

Comment 59 Rahul Bhalerao 2008-06-16 09:13:50 UTC

Behdad, can you please review and commit my patch in Comment #54, it has been well tested so far.

Comment 60 Praveen A 2008-06-26 05:42:59 UTC

I have also tested it. Behdad, can you please review it and commit Rahul's patch? It has been quite a long time since this bug is reported.

Comment 61 Manilal 2008-08-06 06:12:00 UTC

This patch has been here for more than 2 months. Please commit it ASAP so that it can be included in the next release of Fedora(Fedora 10) and Debian(Lenny).

Comment 62 Behdad Esfahbod 2008-08-06 06:16:40 UTC

Going to commit this for the pango release this week, but I'm still not happy about the patch :).

Comment 63 Behdad Esfahbod 2008-08-06 07:52:08 UTC

Committed in my git tree.  Shows up in SVN later tonight.

2008-08-06  Behdad Esfahbod  <behdad@gnome.org>

        Bug 441654 – prefix fails when more than one base characters (as
        conjuncts) present after a half form the next prefix renders
        incorrectly
        Patch from  Rahul Bhalerao

        * modules/indic/mprefixups.c (indic_mprefixups_apply):
        Do what I was told to do.

Comment 64 Manilal 2008-08-06 08:57:01 UTC

This was the last known bug in GNOME Malayalam rendering. With this the GNOME Malayalam rendering is 100% perfect. Thanks Behdad. Special mention to Praveen(reporter), Rahul and others who commented.

Comment 65 Praveen A 2008-08-06 09:59:02 UTC

Thanks a lot Behdad. This is indeed sweet news (no need to maintain patched pangos any more). Thaks to Suresh for figuring out the patch and Rahul, Baiju for refining it. Also thanks to Manilal, Hiran, Sayamindu, Ani, Santhosh, Cibu and Mahesh for testing the patches and suggestions.