GNOME Bugzilla – Bug 437633
Placement of arabic diacritics over 3 component ligature is incorrent
Last modified: 2012-09-26 19:42:32 UTC
Please describe the problem: This bug is related to arabic shaping engine. I suspect it lies in Pango. Infact, it is similar to bug no 302952, which I reported ealier. However, bug no 302952 can be consider closed as I see the latest version of gnome does not suffer the same problem anymore. However, the only bug left is with 3 component ligature. The placement of the diacritic marks is incorrect. (however, the shaping is correct, which unlike previously, it does not take place) Steps to reproduce: 1. Install me_quran font from http://arabicfonts.wikispaces.com/ (it's mine) 2. Create a text file with the following sequence :بِسْمِ ٱللَّهِ ٱلرَّحْمَٟنِ ٱلرَّحِيمِ 3. Open (or copy/paste from here) the file using gedit (or any gtk editor). Notice the placement of shadda over the ligature ALLAH. Actual results: You can see the placement of the mark over the ligature Expected results: The shadda+fatha mark is placed above second lam (middle of ligature) Does this happen every time? Yes Other information:
Created attachment 87993 [details] Picture comparing the result from Gnome and XP The left is taken from Gnome (Ubuntu 7.04), and on the right is taken from XP sp1.
Hi Meor, thanks for the bugreport. This is definately a pango issue, I am reassigning the bug.
Does your font have ligature carret positioning tables?
Humm, sorry. I probably meant, does it have the 'mark' and 'mkmk' GPOS tables?
(In reply to comment #4) > Humm, sorry. I probably meant, does it have the 'mark' and 'mkmk' GPOS tables? > Behdad, Yes, definately. I use both MARK and MKMK features. Both are used in the example I gave. You can get the font from the address above. The VOLT table in the font are still in tact, if you want to look at that as well.
I've go pango source to trace the possible error. I think it might be due to the below: harfbuzz-gpos.c, Lookup_MarkLigPos function: /* We must now check whether the ligature ID of the current mark glyph is identical to the ligature ID of the found ligature. If yes, we can directly use the component index. If not, we attach the mark glyph to the last component of the ligature. */ if ( IN_LIGID( j ) == IN_LIGID( buffer->in_pos) ) { comp_index = IN_COMPONENT( buffer->in_pos ); if ( comp_index >= lat->ComponentCount ) return HB_Err_Not_Covered; } else comp_index = lat->ComponentCount - 1; I'm, by no means undertand how pango works, but the above comment does not sound right. First of all, why do we compare ligature ID? We are attaching a mark to a ligature, so how can a mark have a ligature ID? In my view, this is how it suppose to work: 1. Before the ligature was form (from GSUB table), the mark should be tied to a specific component of the ligature. This information should be store somewhere. 2. After the ligature is formed, check the for anchor class. If the ligature have the same class, attached the mark to its component. If the ligature does not have the class, some adjustment to the horizontol positioning maybe required, depending on which component the mark belongs to. Basically, there is no need to attached the mark to the last component of the glyph because by right we should know which component the mark belongs to. Let's take the above example: the order of the character is: lam lam shadda fatha heh kasra. So, we know the shadda+fatha belongs to second component of the ligature, and kasra belongs to the third component of the ligature . Why do we need to compare the ligature ID to know which component it belongs to? Also, I found out that by original post is incorrect. In fact, the bugs do effect 2 component ligature as well, as in my previous post (id 302952). It ha been more than 2 years since my first post, but the bug is still there. Hope the above information will help.
*** Bug 302952 has been marked as a duplicate of this bug. ***
The dupped bug has sample font. Khaled is telling me that it only happens in the first ligature of the line or something like that, not sure.
One problem fixed. More to come. 2007-08-29 Behdad Esfahbod <behdad@gnome.org> Bug 302952 – The placement of a diacritic marks for an arabic ligature is not correct * pango/opentype/harfbuzz-buffer.c (hb_buffer_allocate_ligid): Don't use zero as allocated ligature id. Zero means no ligature id.
(In reply to comment #6) > Basically, there is no need to attached the mark to the last component of the > glyph because by right we should know which component the mark belongs to. > Let's take the above example: > the order of the character is: lam lam shadda fatha heh kasra. So, we know the > shadda+fatha belongs to second component of the ligature, and kasra belongs to > the third component of the ligature . Why do we need to compare the ligature ID > to know which component it belongs to? Except that the ligature for <lam lam heh> seems to be forming in two stages: first <lam heh> forming into a lam-heh ligature, then <lam lam-heh> forming into the final lam-lam-heh ligature. Is that correct? What seems to be happening is that when seeing <lam lam fatha heh> pango first forms the lam-heh ligature, producing <lam lam-heh fatha>, it also marks the lam-heh and fatha glyphs with ligature ID 1 and marks fatha as sitting on component 0 of the ligature. But then when it forms the lam-lam-heh ligature, marks it as ligature ID 2 and then loses the relation between fatha and lam-lam-heh, placing it on the last component of the ligature (as if it followed all the chars in the ligature). The fix is to scan for marks and adjust their ligature ID when forming new ligatures from old ones. The only problem is that we don't immediately know how many components each ligature has, so adjusting component numbers is at best a heuristic. Going to give it a try.
(In reply to comment #10) > (In reply to comment #6) > > Basically, there is no need to attached the mark to the last component of the > > glyph because by right we should know which component the mark belongs to. > > Let's take the above example: > > the order of the character is: lam lam shadda fatha heh kasra. So, we know the > > shadda+fatha belongs to second component of the ligature, and kasra belongs to > > the third component of the ligature . Why do we need to compare the ligature ID > > to know which component it belongs to? > > Except that the ligature for <lam lam heh> seems to be forming in two stages: > first <lam heh> forming into a lam-heh ligature, then <lam lam-heh> forming > into the final lam-lam-heh ligature. Is that correct? My personal opinion is, it should not be done that way. I believe, inside the font itself, we do set sort of priority (in GSUB table, I think) or the order of which how the ligature should take place. In my substitution, I don't think I define it that way, even though I could. What I did is, using the following substitution: 1. (lam initial) (lam medial) (heh final) -> (ligature Allah),ignore marks, 3 component ligature. 2. (lam initial) (heh final) -> (ligature lamheh isolated), 2 component ligature. Thus, the first substitution should take place first. What you describe is the following: 1. (lam medial) (heh final) -> (ligature lamheh final), 2 component. 2. (lam initial) (ligature lamheh final) -> Allah Yes, the second scenario should work as well, but I prefer the first one. > > What seems to be happening is that when seeing <lam lam fatha heh> pango first > forms the lam-heh ligature, producing <lam lam-heh fatha>, it also marks the > lam-heh and fatha glyphs with ligature ID 1 and marks fatha as sitting on > component 0 of the ligature. But then when it forms the lam-lam-heh ligature, > marks it as ligature ID 2 and then loses the relation between fatha and > lam-lam-heh, placing it on the last component of the ligature (as if it > followed all the chars in the ligature). > > The fix is to scan for marks and adjust their ligature ID when forming new > ligatures from old ones. The only problem is that we don't immediately know > how many components each ligature has, so adjusting component numbers is at > best a heuristic. Going to give it a try. > I believe the all ligature define how many component it has, thus we should know it. I think we can focuse to make it work on the first scenario first. Once it it OK, we can create the second scenario and test the method. Regards.
(In reply to comment #11) > (In reply to comment #10) > > (In reply to comment #6) > > > Basically, there is no need to attached the mark to the last component of the > > > glyph because by right we should know which component the mark belongs to. > > > Let's take the above example: > > > the order of the character is: lam lam shadda fatha heh kasra. So, we know the > > > shadda+fatha belongs to second component of the ligature, and kasra belongs to > > > the third component of the ligature . Why do we need to compare the ligature ID > > > to know which component it belongs to? > > > > Except that the ligature for <lam lam heh> seems to be forming in two stages: > > first <lam heh> forming into a lam-heh ligature, then <lam lam-heh> forming > > into the final lam-lam-heh ligature. Is that correct? > > My personal opinion is, it should not be done that way. I believe, inside the > font itself, we do set sort of priority (in GSUB table, I think) or the order > of which how the ligature should take place. In my substitution, I don't think > I define it that way, even though I could. What I did is, using the following > substitution: > 1. (lam initial) (lam medial) (heh final) -> (ligature Allah),ignore marks, 3 > component ligature. > 2. (lam initial) (heh final) -> (ligature lamheh isolated), 2 component > ligature. > > Thus, the first substitution should take place first. What you describe is the > following: > > 1. (lam medial) (heh final) -> (ligature lamheh final), 2 component. > 2. (lam initial) (ligature lamheh final) -> Allah > > Yes, the second scenario should work as well, but I prefer the first one. But the second scenario seems to be what's happening right now. I tried hard to avoid having to dig into the font to see how it works though, so I may be wrong.
Anyway, putting this on hold as it needs further debugging. This is just a hack showing my hypothesis is correct: Index: harfbuzz-gsub.c =================================================================== --- harfbuzz-gsub.c (revision 2415) +++ harfbuzz-gsub.c (working copy) @@ -1080,6 +1080,11 @@ static FT_Error Lookup_LigatureSubst( H if ( ADD_String( buffer, i, 1, &lig->LigGlyph, 0xFFFF, ligID ) ) return error; + if (IN_CURITEM()->ligID) + { + IN_CURITEM()->ligID = ligID; + IN_CURITEM()->component = 1; + } } } else It's not correct, and eats babies.
I seem to be
Humm? Roozbeh do you have any insight into this? Did you hit it too?
Sorry. Unintentional spam. I am encountering various problems with mark positioning over ligatures with my new font, but I am still at the level of fixing fontforge!
Can someone test this with pango from master? harfbuzz-ng has been merged which may already fix this.
Created attachment 140050 [details] Screenshot against trunk Still broken, and now 'mkmk' seems to be broken too.
Which font is this again? Can you email me the font you tested?
I tested the font from the page linked in the original report, http://arabicfonts.wikispaces.com/file/view/me_quran_volt_newmet.zip
Khaled, 'mkmk' works fine here.
Ok, I now understand why the original bug happens. Will be fixed soon when we order lookups from all features together.
With further testing, I confirmed that Uniscribe too forms the LAM,HEH ligature first, and makes the LAM,LAM,HEH out of that ligature. hb does the same now. We only need to update component info to take that into consideration. Working on it.
Fixed in HarfBuzz master. Leaving the bug open until Pango grabs that. commit fe20c0f84f5ff518dc471bf22ac5a83ef079eb69 Author: Behdad Esfahbod <behdad@behdad.org> Date: Mon Jul 30 00:00:59 2012 -0400 [GSUB] Fix mark component stuff when ligatures form ligatures! See comments. Fixes https://bugzilla.gnome.org/show_bug.cgi?id=437633