Bug 437633 – Placement of arabic diacritics over 3 component ligature is incorrent

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 437633 - Placement of arabic diacritics over 3 component ligature is incorrent


Summary:	Placement of arabic diacritics over 3 component ligature is incorrent


Status:	RESOLVED FIXED

Product:	pango
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	pango-maint
QA Contact:	pango-maint

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2007-05-11 07:35 UTC by Meor Ridzuan
Modified:	2012-09-26 19:42 UTC

See Also:
GNOME target:	---
GNOME version:	2.15/2.16

Attachments
Picture comparing the result from Gnome and XP (13.45 KB, image/png) 2007-05-11 07:39 UTC, Meor Ridzuan	Details
Screenshot against trunk (13.47 KB, image/png) 2009-08-06 18:26 UTC, Khaled Hosny	Details

Description Meor Ridzuan 2007-05-11 07:35:51 UTC

Please describe the problem:
This bug is related to arabic shaping engine. I suspect it lies in Pango. Infact, it is similar to bug no 302952, which I reported ealier. However, bug no 302952 can be consider closed as I see the latest version of gnome does not suffer the same problem anymore.
However, the only bug left is with 3 component ligature. The placement of the diacritic marks is incorrect. (however, the shaping is correct, which unlike previously, it does not take place)

Steps to reproduce:
1. Install me_quran font from http://arabicfonts.wikispaces.com/ (it's mine)
2. Create a text file with the following sequence :بِسْمِ ٱللَّهِ ٱلرَّحْمَٟنِ ٱلرَّحِيمِ 
3. Open (or copy/paste from here) the file using gedit (or any gtk editor). Notice the placement of shadda over the ligature ALLAH.


Actual results:
You can see the placement of the mark over the ligature

Expected results:
The shadda+fatha mark is placed above second lam (middle of ligature)

Does this happen every time?
Yes

Other information:

Comment 1 Meor Ridzuan 2007-05-11 07:39:19 UTC

Created attachment 87993 [details]
Picture comparing the result from Gnome and XP

The left is taken from Gnome (Ubuntu 7.04), and on the right is taken from XP sp1.

Comment 2 Paolo Borelli 2007-05-11 07:44:11 UTC

Hi Meor, thanks for the bugreport.

This is definately a pango issue, I am reassigning the bug.

Comment 3 Behdad Esfahbod 2007-05-15 21:06:41 UTC

Does your font have ligature carret positioning tables?

Comment 4 Behdad Esfahbod 2007-05-15 21:07:59 UTC

Humm, sorry.  I probably meant, does it have the 'mark' and 'mkmk' GPOS tables?

Comment 5 Meor Ridzuan 2007-05-16 00:25:28 UTC

(In reply to comment #4)
> Humm, sorry.  I probably meant, does it have the 'mark' and 'mkmk' GPOS tables?
> 

Behdad,
Yes, definately. I use both MARK and MKMK features. Both are used in the example I gave. You can get the font from the address above. The VOLT table in the font are still in tact, if you want to look at that as well.

Comment 6 Meor Ridzuan 2007-05-30 07:59:10 UTC

I've go pango source to trace the possible error. I think it might be due to the below:

harfbuzz-gpos.c, Lookup_MarkLigPos function:

/* We must now check whether the ligature ID of the current mark glyph
is identical to the ligature ID of the found ligature. If yes, we
can directly use the component index. If not, we attach the mark
glyph to the last component of the ligature. */

if ( IN_LIGID( j ) == IN_LIGID( buffer->in_pos) )
{
comp_index = IN_COMPONENT( buffer->in_pos );
if ( comp_index >= lat->ComponentCount )
return HB_Err_Not_Covered;
}
else
comp_index = lat->ComponentCount - 1;

I'm, by no means undertand how pango works, but the above comment does not sound right. First of all, why do we compare ligature ID? We are attaching a mark to a ligature, so how can a mark have a ligature ID? In my view, this is how it suppose to work:

1. Before the ligature was form (from GSUB table), the mark should be tied to a specific component of the ligature. This information should be store somewhere.
2. After the ligature is formed, check the for anchor class. If the ligature have the same class, attached the mark to its component. If the ligature does not have the class, some adjustment to the horizontol positioning maybe required, depending on which component the mark belongs to.

Basically, there is no need to attached the mark to the last component of the glyph because by right we should know which component the mark belongs to. Let's take the above example:
the order of the character is: lam lam shadda fatha heh kasra. So, we know the shadda+fatha belongs to second component of the ligature, and kasra belongs to the third component of the ligature . Why do we need to compare the ligature ID to know which component it belongs to?

Also, I found out that by original post is incorrect. In fact, the bugs do effect 2 component ligature as well, as in my previous post (id 302952). It ha been more than 2 years since my first post, but the bug is still there. Hope the above information will help.

Comment 7 Behdad Esfahbod 2007-07-31 20:21:06 UTC

*** Bug 302952 has been marked as a duplicate of this bug. ***

Comment 8 Behdad Esfahbod 2007-07-31 20:22:28 UTC

The dupped bug has sample font.  Khaled is telling me that it only happens in the first ligature of the line or something like that, not sure.

Comment 9 Behdad Esfahbod 2007-08-29 08:52:58 UTC

One problem fixed.  More to come.

2007-08-29  Behdad Esfahbod  <behdad@gnome.org>

        Bug 302952 – The placement of a diacritic marks for an arabic ligature
        is not correct

        * pango/opentype/harfbuzz-buffer.c (hb_buffer_allocate_ligid): Don't
        use zero as allocated ligature id.  Zero means no ligature id.

Comment 10 Behdad Esfahbod 2007-08-29 18:51:25 UTC

(In reply to comment #6)
> Basically, there is no need to attached the mark to the last component of the
> glyph because by right we should know which component the mark belongs to.
> Let's take the above example:
> the order of the character is: lam lam shadda fatha heh kasra. So, we know the
> shadda+fatha belongs to second component of the ligature, and kasra belongs to
> the third component of the ligature . Why do we need to compare the ligature ID
> to know which component it belongs to?

Except that the ligature for <lam lam heh> seems to be forming in two stages: first <lam heh> forming into a lam-heh ligature, then <lam lam-heh> forming into the final lam-lam-heh ligature.  Is that correct?

What seems to be happening is that when seeing <lam lam fatha heh> pango first forms the lam-heh ligature, producing <lam lam-heh fatha>, it also marks the lam-heh and fatha glyphs with ligature ID 1 and marks fatha as sitting on component 0 of the ligature.  But then when it forms the lam-lam-heh ligature, marks it as ligature ID 2 and then loses the relation between fatha and lam-lam-heh, placing it on the last component of the ligature (as if it followed all the chars in the ligature).

The fix is to scan for marks and adjust their ligature ID when forming new ligatures from old ones.  The only problem is that we don't immediately know how many components each ligature has, so adjusting component numbers is at best a heuristic.  Going to give it a try.

Comment 11 Meor Ridzuan 2007-08-30 07:17:18 UTC

(In reply to comment #10)
> (In reply to comment #6)
> > Basically, there is no need to attached the mark to the last component of the
> > glyph because by right we should know which component the mark belongs to.
> > Let's take the above example:
> > the order of the character is: lam lam shadda fatha heh kasra. So, we know the
> > shadda+fatha belongs to second component of the ligature, and kasra belongs to
> > the third component of the ligature . Why do we need to compare the ligature ID
> > to know which component it belongs to?
> 
> Except that the ligature for <lam lam heh> seems to be forming in two stages:
> first <lam heh> forming into a lam-heh ligature, then <lam lam-heh> forming
> into the final lam-lam-heh ligature.  Is that correct?

My personal opinion is, it should not be done that way. I believe, inside the font itself, we do set sort of priority (in GSUB table, I think) or the order of which how the ligature should take place. In my substitution, I don't think I define it that way, even though I could. What I did is, using the following substitution:
1. (lam initial) (lam medial) (heh final) -> (ligature Allah),ignore marks, 3 component ligature.
2. (lam initial) (heh final) -> (ligature lamheh isolated), 2 component ligature.

Thus, the first substitution should take place first. What you describe is the following:

1. (lam medial) (heh final) -> (ligature lamheh final), 2 component.
2. (lam initial) (ligature lamheh final) -> Allah

Yes, the second scenario should work as well, but I prefer the first one.

> 
> What seems to be happening is that when seeing <lam lam fatha heh> pango first
> forms the lam-heh ligature, producing <lam lam-heh fatha>, it also marks the
> lam-heh and fatha glyphs with ligature ID 1 and marks fatha as sitting on
> component 0 of the ligature.  But then when it forms the lam-lam-heh ligature,
> marks it as ligature ID 2 and then loses the relation between fatha and
> lam-lam-heh, placing it on the last component of the ligature (as if it
> followed all the chars in the ligature).
> 
> The fix is to scan for marks and adjust their ligature ID when forming new
> ligatures from old ones.  The only problem is that we don't immediately know
> how many components each ligature has, so adjusting component numbers is at
> best a heuristic.  Going to give it a try.
> 

I believe the all ligature define how many component it has, thus we should know it. I think we can focuse to make it work on the first scenario first. Once it it OK, we can create the second scenario and test the method.

Regards.

Comment 12 Behdad Esfahbod 2007-08-30 21:28:14 UTC

(In reply to comment #11)
> (In reply to comment #10)
> > (In reply to comment #6)
> > > Basically, there is no need to attached the mark to the last component of the
> > > glyph because by right we should know which component the mark belongs to.
> > > Let's take the above example:
> > > the order of the character is: lam lam shadda fatha heh kasra. So, we know the
> > > shadda+fatha belongs to second component of the ligature, and kasra belongs to
> > > the third component of the ligature . Why do we need to compare the ligature ID
> > > to know which component it belongs to?
> > 
> > Except that the ligature for <lam lam heh> seems to be forming in two stages:
> > first <lam heh> forming into a lam-heh ligature, then <lam lam-heh> forming
> > into the final lam-lam-heh ligature.  Is that correct?
> 
> My personal opinion is, it should not be done that way. I believe, inside the
> font itself, we do set sort of priority (in GSUB table, I think) or the order
> of which how the ligature should take place. In my substitution, I don't think
> I define it that way, even though I could. What I did is, using the following
> substitution:
> 1. (lam initial) (lam medial) (heh final) -> (ligature Allah),ignore marks, 3
> component ligature.
> 2. (lam initial) (heh final) -> (ligature lamheh isolated), 2 component
> ligature.
> 
> Thus, the first substitution should take place first. What you describe is the
> following:
> 
> 1. (lam medial) (heh final) -> (ligature lamheh final), 2 component.
> 2. (lam initial) (ligature lamheh final) -> Allah
> 
> Yes, the second scenario should work as well, but I prefer the first one.


But the second scenario seems to be what's happening right now.  I tried hard to avoid having to dig into the font to see how it works though, so I may be wrong.

Comment 13 Behdad Esfahbod 2007-08-31 22:43:51 UTC

Anyway, putting this on hold as it needs further debugging.  This is just a hack showing my hypothesis is correct:

Index: harfbuzz-gsub.c
===================================================================
--- harfbuzz-gsub.c     (revision 2415)
+++ harfbuzz-gsub.c     (working copy)
@@ -1080,6 +1080,11 @@ static FT_Error  Lookup_LigatureSubst( H
        if ( ADD_String( buffer, i, 1, &lig->LigGlyph,
                        0xFFFF, ligID ) )
          return error;
+       if (IN_CURITEM()->ligID)
+         {
+           IN_CURITEM()->ligID = ligID;
+           IN_CURITEM()->component = 1;
+         }
       }
     }
     else


It's not correct, and eats babies.

Comment 14 Roozbeh Pournader 2007-12-03 14:13:44 UTC

I seem to be

Comment 15 Behdad Esfahbod 2007-12-03 18:30:54 UTC

Humm?
Roozbeh do you have any insight into this?  Did you hit it too?

Comment 16 Roozbeh Pournader 2007-12-03 20:52:59 UTC

Sorry. Unintentional spam.

I am encountering various problems with mark positioning over ligatures with my new font, but I am still at the level of fixing fontforge!

Comment 17 Behdad Esfahbod 2009-08-05 20:41:04 UTC

Can someone test this with pango from master?  harfbuzz-ng has been merged which may already fix this.

Comment 18 Khaled Hosny 2009-08-06 18:26:39 UTC

Created attachment 140050 [details]
Screenshot against trunk

Still broken, and now 'mkmk' seems to be broken too.

Comment 19 Behdad Esfahbod 2009-08-06 22:14:42 UTC

Which font is this again?  Can you email me the font you tested?

Comment 20 Khaled Hosny 2009-08-07 03:20:52 UTC

I tested the font from the page linked in the original report, http://arabicfonts.wikispaces.com/file/view/me_quran_volt_newmet.zip

Comment 21 Behdad Esfahbod 2009-08-09 23:04:20 UTC

Khaled, 'mkmk' works fine here.

Comment 22 Behdad Esfahbod 2009-08-10 04:46:18 UTC

Ok, I now understand why the original bug happens.  Will be fixed soon when we order lookups from all features together.

Comment 23 Behdad Esfahbod 2012-07-30 01:42:12 UTC

With further testing, I confirmed that Uniscribe too forms the LAM,HEH ligature first, and makes the LAM,LAM,HEH out of that ligature.  hb does the same now.  We only need to update component info to take that into consideration.  Working on it.

Comment 24 Behdad Esfahbod 2012-07-30 04:01:52 UTC

Fixed in HarfBuzz master.  Leaving the bug open until Pango grabs that.

commit fe20c0f84f5ff518dc471bf22ac5a83ef079eb69
Author: Behdad Esfahbod <behdad@behdad.org>
Date:   Mon Jul 30 00:00:59 2012 -0400

    [GSUB] Fix mark component stuff when ligatures form ligatures!
    
    See comments.
    
    Fixes https://bugzilla.gnome.org/show_bug.cgi?id=437633