Bug 549818 – Sequence 0D16,0D4D,0D30, 0D16,D4D,0D30, etc of Malayalam, comparison with uniscribe,icu,harfbuzz

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 549818 - Sequence 0D16,0D4D,0D30, 0D16,D4D,0D30, etc of Malayalam, comparison with uniscribe,icu,harfbuzz


Summary:	Sequence 0D16,0D4D,0D30, 0D16,D4D,0D30, etc of Malayalam, comparison with uni...


Status:	RESOLVED OBSOLETE

Product:	pango
Classification:	Platform
Component:	indic
Version:	1.21.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Pango Indic
QA Contact:	pango-maint

URL:
Whiteboard:

Duplicates:	679198 (view as bug list)
Depends on:
Blocks:

Reported:	2008-08-29 13:24 UTC by Caolan McNamara
Modified:	2012-08-18 17:44 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
test-case (342.75 KB, application/zip) 2008-08-29 13:24 UTC, Caolan McNamara	Details

Description Caolan McNamara 2008-08-29 13:24:09 UTC

Please describe the problem:
An analysis of rendering 0d16, 0d4d, 0d30 with uniscribe, icu, harfbuzz and pango.

If we want to render Malayalam the same as uniscribe, then I suspect that reclassifying 0x0d30 as not having a post-base form may be a hack-around, and that there is deeper magic at work. Might be of interest anyway.

Steps to reproduce:
In the attachment:

D16_D4D_D30.ttf is a font which just contains fake glyphs for those 
unicode points

D16_D4D_D30_random.ttf is a font which just contains fake glyphs for
those unicode points, and a pstf entry for some random combination
of glyphs which we're not going to use.

D16_combined_D4D_D30.ttf is a font which has a pstf table with combos
D4D + D30 and
D30 + D4d

combined_D16_D4D_D30.ttf is a font whch has a pstf table with combos
D16 + D4D + D30 and
D16 + D30 + D4D

Attached are screenshots of the string
D16,D4D,D30 (D16_D4D_D30.txt)
rendered with these fonts using...

vanilla icu 4.0
vanilla pango 1.21.5
vanilla uniscribe 1.0420.2600.2180

a)
The first interesting thing is that with a font with no gsubs the pure 
software reordering for pango and uniscribe *appears* to be the same, i.e
no reordering at all of 0d16,0d4d,0d30 while the icu and harfbuzz reordering results in glyphs
0d16, 0d30, 0d4d.

Reading http://www.microsoft.com/typography/otfntdev/indicot/shaping.aspx

"The shaping engine finds the base consonant of the syllable, using the
following algorithm: starting from the end of the syllable, move backwards
until a consonant is found that does not have a below-base or post-base form
(post-base forms have to follow below-base forms), or arrive at the first
consonant. The consonant stopped at will be the base."

suggests that the base-consonant should be 0xd16, given that according
to icu and http://www.microsoft.com/typography/otfntdev/indicot/appen.aspx 0d30 (RA) has a post-base form

"If the base consonant is not the last one, Uniscribe moves the halant from the base consonant to the last one. " giving 0d16 0d30 0d4d which is the order
that icu gets, and the order that harfbuzz gets. 

FWIW pango gets different results because 0xd30 has been tweaked to be tagged
as a normal consonant, making 0xd30 the base consonant for the algorithm,
intead of 0xd4d

b)
Things get interesting when repeating with D16_D4D_D30_random.ttf. That now
shows that uniscribe is ordering the glyphs as 0d16, 0d30, 0d4d, i.e. agreeing
with icu and harfbuzz and disageeing with pango. Given that the only difference
is the existance of a pstf table, it suggests that uniscribe does agree with the
basic re-ordering mechanism of icu/harfbuzz, except that it has a quirk in that
it doesn't appear to do it if there is no pstf table in the font.

c)
Looking at the same text with D16_combined_D4D_D30.ttf then *both* icu and
uniscribe select the 0d30+0d4d pstf replacement, suggesting that the sequence
sent for gsub processing by uniscribe is actually 0d30,0d4d, matching the order
of icu and harfbuzz of that subsequence, and not 0d4d,0d30 as used by pango. With the clear difference between uniscribe and icu in that the replacement glyph is ordered at the beginning of the syllable in uniscribe and to the right in icu. Given b) that seems to suggest that uniscribe may have a magic extra step in moving the output glyph to the start of the sequence if there has been a pstf replacement, and that step takes place *after* gsub processing.

d)
Looking at the text with combined_D16_D4D_D30.ttf shows the same results
in icu and uniscribe of 0d16,0d30,0d4d, with neither entriy in the pstf table used, while pango used the pstf table for the 0d16+0d4d+0d30 combo. Which
further re-inforces that uniscribe agrees with icu (and harfbuzz), and not
pango, that 0d30 should be '_pb'.

Summary:
I'm clueless about Malayalam, but if uniscribe compatibility is of interest
it looks like icu/pango/harfbuzz needs some sort of additional
post-gsub replacement vaguely along the lines of re-ordering the result of a glyph substitution of this special type of sequence to the beginning of the 
syllable ? And that the pango change to the classification of 0x0d30 moves it further away from uniscribe.

Actual results:


Expected results:


Does this happen every time?


Other information:

Comment 1 Caolan McNamara 2008-08-29 13:24:51 UTC

Created attachment 117587 [details]
test-case

Comment 2 Ani Peter 2009-10-01 10:17:41 UTC

I confirm that the combination works perfect with pano.

Comment 3 Suresh P 2009-10-04 02:27:32 UTC

(In reply to comment #1)
> Created an attachment (id=117587) [details]
> test-case

It took a while for me to figure out the issue from the attached test cases and scrot pics. :)

The issue you have raised exists. But I don't think,IMHO, it is good is a nice idea to kowtow the algorithm or even the standard put forward by proprietary software vendors. Different ideas can co-exist. The prime focus should be the proper rendering, consistent with the language in question.

Coming to the issue, the icu and harfbuzz use the post-base form of RA(0x0d30) and below-base form of LA(0x0d32) while pango doesn't. Since these forms are classified as 'HAVE_POST(BELOW)_FORMS' by the shaping engine beforehand and the halant(0x0d4d) moving done by it to get the post(below)-base form by applying the respective feature, these forms appear invariably, disregarding the orthography of the script involved. For example there are base consonants that won't take post(below)-base forms(eg. YA,RA,NNA etc.). Therefore, pango gives the most acceptable results in this regard.

Now, as per the new opentype specs(v.1.6), the uniscribe uses a new algorithm to dynamically assign classes for consonants. So, the halant-moving exercise for post(below)-base is done away with. The gsub rules and features will finally decide the contextual nature of the characters. By this the post-base form of RA carrying 'pref' feature tag,if found, is moved to the pre-base position after the higher order substitutions are made.

I have tried to incorporate the new opentype specs in the recent version of my font, Suruma(http://suruma.freeflux.net/blog/archive/2009/09/22/new-suruma-font.html).

Thanks
Suresh

Comment 4 Praveen A 2009-12-17 14:19:37 UTC

QT bug on this issue
http://bugreports.qt.nokia.com/browse/QTBUG-1887

Comment 5 Michael Schumacher 2012-07-02 20:45:17 UTC

*** Bug 679198 has been marked as a duplicate of this bug. ***

Comment 6 Behdad Esfahbod 2012-08-18 17:44:11 UTC

We've merged the HarfBuzz branch.  Closing obsolete.