Bug 161981 – Sinhala rendering should not implicitly create conjuncts

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 161981 - Sinhala rendering should not implicitly create conjuncts


Summary:	Sinhala rendering should not implicitly create conjuncts


Status:	RESOLVED FIXED

Product:	pango
Classification:	Platform
Component:	indic
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Pango Indic
QA Contact:	Pango Indic

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2004-12-22 14:01 UTC by Harshula Jayasuriya
Modified:	2005-03-05 15:28 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
A patch against 1.6.0 that fixes the implicit creation of conjuncts (7.11 KB, patch) 2004-12-22 14:07 UTC, Harshula Jayasuriya	none	Details \| Review
Test File (40 bytes, text/plain) 2004-12-23 12:28 UTC, Harshula Jayasuriya		Details
Image of the incorrect rendering (19.47 KB, image/png) 2004-12-23 12:37 UTC, Harshula Jayasuriya		Details
Image of the correct rendering (5.46 KB, image/png) 2004-12-23 12:43 UTC, Harshula Jayasuriya		Details
PATCH: fixes the implicit creation of conjuncts and emits ZWJ (7.69 KB, patch) 2004-12-27 15:25 UTC, Harshula Jayasuriya	none	Details \| Review
Patch I committed (10.70 KB, patch) 2005-03-03 23:09 UTC, Owen Taylor	none	Details \| Review

Description Harshula Jayasuriya 2004-12-22 14:01:57 UTC

In Sinhala a consonant + virama + consonant does NOT form a conjunct.
A conjunct is created with the sequence consonant + virama + ZWJ + consonant.

Here is an example of incorrect rendering:
http://www.linux.lk/~anuradha/sinhala/screenshots/0.2-0.2.1/australia-pango.png

Here's the correct rendering:
http://www.lug.lk/lurker/attach/3@20041219.170404.f4e34ff9.attach

There's more information and a patch here:
http://www.lug.lk/lurker/message/20041219.170404.f4e34ff9.en.html

Comment 1 Harshula Jayasuriya 2004-12-22 14:07:46 UTC

Created attachment 35121 [details] [review]
A patch against 1.6.0 that fixes the implicit creation of conjuncts

Please have a look at this patch. I can make a patch against a newer version of

pango, if requested.

Comment 2 Owen Taylor 2004-12-22 20:10:19 UTC

Can you:

 A) Attach a small UTF-8 file with the test string
 B) Attach your image links as attachments (images on external 
    websites have a habit of vanishing)
 C) Create a patch against 1.8.0 without the Virama; => AlLakuna change;
    having unrelated changes makes patch review much harder.

Thanks.

Comment 3 Anuradha Ratnaweera 2004-12-23 10:03:16 UTC

Hi Owen,

A) http://www.linux.lk/~anuradha/sinhala/australia.txt
B) Wrong rendering:
http://www.linux.lk/~anuradha/sinhala/screenshots/0.2-0.2.1/australia-pango.png
   Correct rendering:
http://www.linux.lk/~anuradha/sinhala/screenshots/0.2-0.2.1/australia-pango-corrected.png
C) The cleanest way to add the patch is to add AlLakuna IMO

Comment 4 Harshula Jayasuriya 2004-12-23 12:28:48 UTC

Created attachment 35162 [details]
Test File

Comment 5 Harshula Jayasuriya 2004-12-23 12:37:24 UTC

Created attachment 35163 [details]
Image of the incorrect rendering

Comment 6 Harshula Jayasuriya 2004-12-23 12:43:14 UTC

Created attachment 35164 [details]
Image of the correct rendering

Comment 7 Owen Taylor 2004-12-23 15:11:28 UTC

> C) The cleanest way to add the patch is to add AlLakuna IMO

Can you provide more detail here? It's a little hard for me to
understand what the patch is doing.

Comment 8 Harshula Jayasuriya 2004-12-23 15:59:16 UTC

Hi Owen,

The current state table is not valid for Sinhala. The interaction of Virama in
North Indian scripts appears to be different to Sinhala.

A consonant + Virama + consonant results in a single conjunct letter in the 
state table. In Sinhala a consonant + Virama + consonant results in two 
letters. The first has its inherent vowel supressed, and the second is the 
standalone consonant.

The changes create another class of Virama, Al-Lakuna, which does not 
implicitly create conjuncts with surrounding consonants.

Regards,
Harshula

Comment 9 Harshula Jayasuriya 2004-12-23 16:00:24 UTC

Hi Owen,

I've got a patch against 1.8.0, but haven't had a chance to test it. So once
I have done that I'll attach it to this bug.

The 1.6.0 patch doesn't contain a "Virama; => AlLakuna change;", as such. What 
it does contain is a 'fVirama' => fAlLakuna change because 'fVirama' was
introduced specifically to support Sinhala. This change is not unrelated because
it makes the code easier to read and understand by naming all the Sinhala
specific code consistently. 

I need to also verify whether these changes help South Indian languages.

Regards,
Harshula

Comment 10 Anuradha Ratnaweera 2004-12-24 03:00:14 UTC

> > C) The cleanest way to add the patch is to add AlLakuna IMO
>
> Can you provide more detail here? It's a little hard for me to
> understand what the patch is doing.

In some Indic languages, when the Consonent + Virama + Consonent sequence if
found, they should be considered as a single group to form a conjunct.  However,
this is not the case for Sinhala and some other languages.  Apparently this has
not created bugs for other languages, or they have found workarounds, but for
Sinhala, it is a bug and we couldn't find a neat "font hack". :-(

In Sinhala, the conjunct should only be formed if there is an explicit ZWJ,
i.e., Consonent + Virama + ZWJ +  Consonent.  The Virama is also called "Al
lakuna", probably because of this difference.

The state machine in indic-ot-class-tables.c allows only for the former.

Comment 11 Harshula Jayasuriya 2004-12-27 14:39:20 UTC

Hi Owen,

I've ported the patch to 1.8.0. Unfortunately your fix for Bug 145233 breaks
Sinhala rendering. Have a look at the attached text file whilst using pango 
1.8.0. Compare it to the two images I've attached.

The snippet in indic-fc.c:

      if (ZERO_WIDTH_CHAR (wcs[i]))
	glyph = 0;

stops ZWJ from being emitted. For Sinhala the ZWJ is needed, as indicated in 
the the SLS1134 summary:

http://fonts.lk/doc/Representation%20of%20Sinhala%20in%20Unicode.pdf

0D9A + 0DCA + 200D + 0DBB

I'm not well versed with fonts, however, I suspect the correct glyph for the 
aforementioned sequence is only returned if pango emits the ZWJ. Hence we need 
to first fix the missing ZWJ issue.

Since it is a Zero Width *Joiner* perhaps pango should not suppress it?

Regards,
Harshula

Comment 12 Harshula Jayasuriya 2004-12-27 15:25:42 UTC

Created attachment 35225 [details] [review]
PATCH: fixes the implicit creation of conjuncts and emits ZWJ

This patch also includes a quick workaround that allows ZWJs to be emitted.
This
patch is against pango 1.8.0.

Comment 13 Owen Taylor 2004-12-28 15:40:51 UTC

[ Note I'm off on vacation right now, I'm not really going to be able to 
  look at this until next week ]

Is it really right for fonts to have rules including ZWJ? To my knowledge
none previously specified scripts for OpenType do that. Normally, the
presence (or absence) of the ZWJ modifies the set of features that are applied
to the adjoining characters

Comment 14 Owen Taylor 2005-01-04 22:51:09 UTC

Hmmm, thinking about this some more... there is nothing at all in the
Unicode specification to indicate that the rendering rules for Sinhala
should be different than other Indic languages.

Can you give me a reference to indicate that the ZWJ is necessary? I'm
concerned about creating a rendering system that isn't compatible with
other rendering systems for Sinhala.

(the fact that sequences with conjuct consononants are less common than
sequences with an explicit virama doesn't necessarily imply anything about
the encoding of those sequences.)

Comment 15 Anuradha Ratnaweera 2005-01-05 05:13:55 UTC

Hi Owen,

Please see this document which describes how Sinhala should be rendered:
http://fonts.lk/doc/Representation%20of%20Sinhala%20in%20Unicode.pdf .  Harshula
pointed to this earlier in this "thread".  Also, Sinhala is different from other
Indic languages in many ways.

Also, please notice that this is about to be formarly released as new SLS 1134
by the Sri Lanka Standards Institute (SLSI).

Comment 16 Harshula Jayasuriya 2005-01-05 18:33:33 UTC

Hi Owen,

I think SLS 1134 has already been formally released. It was also announced
on the indic @ unicode and unicode @ unicode mailing lists:

http://www.lug.lk/lurker/message/20041127.141309.92571494.en.html

Regards,
Harshula

Comment 17 Harshula Jayasuriya 2005-01-17 15:21:49 UTC

Hi Owen,

Does emitting the ZWJ break any indic scripts? IIRC, the issue with Bug 145233 
was that ZWNJ should not be emitted - which makes sense. There was no concern 
raised about the ZWJ being emitted. I don't quite understand the technical 
reason for suppressing a *joiner*?

It appears this unicode proposal requires ZWJ to be emitted:
http://www.unicode.org/review/pr-37.pdf

"This proposal intends to rectify these problems, clarifying how the ZERO WIDTH 
JOINER is to be applied in scripts, and consolidating common mechanisms for 
equivalent problems that exist in several scripts."

which was accepted:
http://www.unicode.org/review/resolved-pri.html

"Resolution: Closed 2004-08-24. UTC accepted the proposal and will create an 
Indic conjoining behavior model."

Jump straight to page 15 to see some examples. Correct me if I'm wrong, but 
I can't see how those glyphs can be selected unless the ZWJ is emitted 
and available to the font.

Regards,
Harshula

Comment 18 Owen Taylor 2005-01-17 15:44:40 UTC

The Unicode standard is not involved in the mechanisms of OpenType
font handling. It just specifies the rendering for given input
sequences.

All other Indic languages handle the effects of ZWJ in unicode text
without using the ZWJ explicitely in GSUB processing.

The ZWJ *has* to be stripped out of the final results. I don't know
if any harm would come of stripping after GSUB processing rather
than before. But I'm not very comfortable using a model for OpenType
font processing which is different from the existing models for 
other languages.

It would be good if this was brought up on the opentype mailing list.

(See http://www.microsoft.com/typography/otspec/otlist.htm for 
subscription information.)

Comment 19 Anuradha Ratnaweera 2005-01-17 16:06:47 UTC

Hi Owen,

I wonder if supressing only the ZWNJ (not ZWJ) would have fixed the bug 145233 ... 

> It would be good if this was brought up on the opentype mailing list.

FYI, Microsoft has already released Sinhala fonts and they use ZWJ explicitly in
GSUB processing.  See http://www.fonts.lk (the official Sinhala font resource of
the ICTA) for some samples.

Comment 20 Harshula Jayasuriya 2005-01-23 14:12:24 UTC

Hi Owen,

Go to page 15 of:
http://www.unicode.org/review/pr-37.pdf

Could you clarify how the correct glyph would be chosen if Pango did NOT emit
the ZWJ?

Thanks,
Harshula

Comment 21 Gihan Dias 2005-01-24 10:36:03 UTC

Hi Owen,

Please note that the standard encoding of sinhala (and many other Indic fonts)
specifies that the zwj be used to create some conjuncts. This has been accepted
both by the Sri Lanka Standards Institute, and by Unicode.

The encoding is specified at
http://www.fonts.lk/doc/Representation%20of%20Sinhala%20in%20Unicode.pdf

you said:
> All other Indic languages handle the effects of ZWJ in unicode text
> without using the ZWJ explicitely in GSUB processing.

O.K. in this case, how does the font figure out whether to create a conjuct
character or two separate characters?

Also how can we create fonts which work in both Linux and WinXP, since the
Windows Uniscribe DLL does pass through the zwj, so that the font can do its job?

Thanks,

Gihan

Comment 22 Owen Taylor 2005-01-24 17:54:55 UTC

Just so it's clear now, I'm not working on this at the moment and probably
wont' get a chance to do so until the middle of February ... in other words,
I'm not ignoring your comments in particular, I'm ignoring Pango in general..

Comment 23 Owen Taylor 2005-03-03 23:09:14 UTC

I want to argue a bit here, so let me first say that I am going
to make the changes you want :-). 

 1. Encoding of Sinhala in Unicode

 The rules that Pango follows are those in the Unicode standard,
 for all scripts. It is indisputable that the Sri Lanka Standards 
 Institute knows more the Sinhala script than the people working
 on the Unicode standard. However, it is impossible for me, as 
 Pango maintainer, to track many individual national standards. 
 And for some scripts, there are multiple relevant national
 standards bodies (think Mongolian, Arabic, Bengali, and so forth.)

 Unfortunately, the Unicode standard has completely insufficient
 specification of encoding for Sinhala. :-( One hopes that it
 will be revised to match the national standard at some point.
 (I just sent mail to indic@unicode.org asking whether there are
 plans in this area.)

2. Encoding of OpenType fonts

 Just because encoding of a conjuct is:

  cons + al-lakuna + zwj + cons
 
 Doesn't mean that it necessarily has to be handled with a 
 GSUB ligature of cons + al-lakuna + zwj + cons. The way that
 other Indic scripts work is that the presence of zwj/ alter the 
 features that are applied.

Specific patch comments
=======================

* I note that the revised state table doesn't allow for the combination
zwj + al-lakuna described in the "Touching Letters" section of 
the "Representation of Sinhala in Unicode". If this is in fact an issue,
could you file a separate bug for that?

* The way I'm going to handle the zero-width joiner issue for now is
conservative ... add a script flag that is set only for Sinhala
(PROCESS_ZWJ) and do something different in that case.

I'll attach the what I'm committing to CVS. It seems to work with the
test case above. 

[ BTW, would you prefer something other than "Harshula" in the ChangeLog
  credits? We generally credit contributors with their full name. ]

2005-03-03  Owen Taylor  <otaylor@redhat.com>

        * modules/indic/indic-ot.[ch] modules/indic-ot-class-tables.c:
        Split out handling of sinhala al-lakuna character from
        handling of Virama in the state table to avoid implicit
        formation of conjucts for Sinhala. (Patch from
        Harshula, ##161981)

        * modules/indic/indic-fc.c modules/indic/indic-ot.h:
        Add a new script flag SF_PROCESS_ZWJ indicating
        whether zero width characters should be passed to
        gsub/gpos.

        * modules/indic/indic-ot-class-tables.c: Set SF_PROCESS_ZWJ
        for Sinhala. (#161981, Harshula)

Comment 24 Owen Taylor 2005-03-03 23:09:55 UTC

Created attachment 38227 [details] [review]
Patch I committed

Comment 25 Harshula Jayasuriya 2005-03-05 15:28:08 UTC

Hi Owen,

>  Unfortunately, the Unicode standard has completely insufficient
>  specification of encoding for Sinhala. :-( One hopes that it
>  will be revised to match the national standard at some point.

Yes, you are quite right. The revised national standard is quite recent, it 
was officially released in Feb. I think Gihan (ICTA) has/is 
submitted/submitting the revised standard to Unicode.

>  Doesn't mean that it necessarily has to be handled with a 
>  GSUB ligature of cons + al-lakuna + zwj + cons. The way that
>  other Indic scripts work is that the presence of zwj/ alter the 
>  features that are applied.

Ok, maybe we can have an offline discussion about this.

> Specific patch comments
> =======================
> 
> * I note that the revised state table doesn't allow for the combination
> zwj + al-lakuna described in the "Touching Letters" section of 
> the "Representation of Sinhala in Unicode". If this is in fact an issue,
> could you file a separate bug for that?

Yes this is an issue too, it didn't work in 1.6.0 so I didn't consider it a 
regression. I think I need to discuss the Indic implementation with Eric 
before I make anymore changes. Indic encoding appears to also use ZWJ + Virama 
(http://www.unicode.org/review/pr-37.pdf). Obviously, I'm very unfamiliar with 
past design decisions. :-)

> * The way I'm going to handle the zero-width joiner issue for now is
> conservative ... add a script flag that is set only for Sinhala
> (PROCESS_ZWJ) and do something different in that case.

Good idea.

> I'll attach the what I'm committing to CVS. It seems to work with the
> test case above. 

I had a look at the patch and tested it out. Seems fine, it's only a minor 
difference.

> [ BTW, would you prefer something other than "Harshula" in the ChangeLog
>   credits? We generally credit contributors with their full name. ]

I'm not particularly fussed. My full name is Harshula Jayasuriya.

Regards,
Harshula