GNOME Bugzilla – Bug 161981
Sinhala rendering should not implicitly create conjuncts
Last modified: 2005-03-05 15:28:08 UTC
In Sinhala a consonant + virama + consonant does NOT form a conjunct. A conjunct is created with the sequence consonant + virama + ZWJ + consonant. Here is an example of incorrect rendering: http://www.linux.lk/~anuradha/sinhala/screenshots/0.2-0.2.1/australia-pango.png Here's the correct rendering: http://www.lug.lk/lurker/attach/3@20041219.170404.f4e34ff9.attach There's more information and a patch here: http://www.lug.lk/lurker/message/20041219.170404.f4e34ff9.en.html
Created attachment 35121 [details] [review] A patch against 1.6.0 that fixes the implicit creation of conjuncts Please have a look at this patch. I can make a patch against a newer version of pango, if requested.
Can you: A) Attach a small UTF-8 file with the test string B) Attach your image links as attachments (images on external websites have a habit of vanishing) C) Create a patch against 1.8.0 without the Virama; => AlLakuna change; having unrelated changes makes patch review much harder. Thanks.
Hi Owen, A) http://www.linux.lk/~anuradha/sinhala/australia.txt B) Wrong rendering: http://www.linux.lk/~anuradha/sinhala/screenshots/0.2-0.2.1/australia-pango.png Correct rendering: http://www.linux.lk/~anuradha/sinhala/screenshots/0.2-0.2.1/australia-pango-corrected.png C) The cleanest way to add the patch is to add AlLakuna IMO
Created attachment 35162 [details] Test File
Created attachment 35163 [details] Image of the incorrect rendering
Created attachment 35164 [details] Image of the correct rendering
> C) The cleanest way to add the patch is to add AlLakuna IMO Can you provide more detail here? It's a little hard for me to understand what the patch is doing.
Hi Owen, The current state table is not valid for Sinhala. The interaction of Virama in North Indian scripts appears to be different to Sinhala. A consonant + Virama + consonant results in a single conjunct letter in the state table. In Sinhala a consonant + Virama + consonant results in two letters. The first has its inherent vowel supressed, and the second is the standalone consonant. The changes create another class of Virama, Al-Lakuna, which does not implicitly create conjuncts with surrounding consonants. Regards, Harshula
Hi Owen, I've got a patch against 1.8.0, but haven't had a chance to test it. So once I have done that I'll attach it to this bug. The 1.6.0 patch doesn't contain a "Virama; => AlLakuna change;", as such. What it does contain is a 'fVirama' => fAlLakuna change because 'fVirama' was introduced specifically to support Sinhala. This change is not unrelated because it makes the code easier to read and understand by naming all the Sinhala specific code consistently. I need to also verify whether these changes help South Indian languages. Regards, Harshula
> > C) The cleanest way to add the patch is to add AlLakuna IMO > > Can you provide more detail here? It's a little hard for me to > understand what the patch is doing. In some Indic languages, when the Consonent + Virama + Consonent sequence if found, they should be considered as a single group to form a conjunct. However, this is not the case for Sinhala and some other languages. Apparently this has not created bugs for other languages, or they have found workarounds, but for Sinhala, it is a bug and we couldn't find a neat "font hack". :-( In Sinhala, the conjunct should only be formed if there is an explicit ZWJ, i.e., Consonent + Virama + ZWJ + Consonent. The Virama is also called "Al lakuna", probably because of this difference. The state machine in indic-ot-class-tables.c allows only for the former.
Hi Owen, I've ported the patch to 1.8.0. Unfortunately your fix for Bug 145233 breaks Sinhala rendering. Have a look at the attached text file whilst using pango 1.8.0. Compare it to the two images I've attached. The snippet in indic-fc.c: if (ZERO_WIDTH_CHAR (wcs[i])) glyph = 0; stops ZWJ from being emitted. For Sinhala the ZWJ is needed, as indicated in the the SLS1134 summary: http://fonts.lk/doc/Representation%20of%20Sinhala%20in%20Unicode.pdf 0D9A + 0DCA + 200D + 0DBB I'm not well versed with fonts, however, I suspect the correct glyph for the aforementioned sequence is only returned if pango emits the ZWJ. Hence we need to first fix the missing ZWJ issue. Since it is a Zero Width *Joiner* perhaps pango should not suppress it? Regards, Harshula
Created attachment 35225 [details] [review] PATCH: fixes the implicit creation of conjuncts and emits ZWJ This patch also includes a quick workaround that allows ZWJs to be emitted. This patch is against pango 1.8.0.
[ Note I'm off on vacation right now, I'm not really going to be able to look at this until next week ] Is it really right for fonts to have rules including ZWJ? To my knowledge none previously specified scripts for OpenType do that. Normally, the presence (or absence) of the ZWJ modifies the set of features that are applied to the adjoining characters
Hmmm, thinking about this some more... there is nothing at all in the Unicode specification to indicate that the rendering rules for Sinhala should be different than other Indic languages. Can you give me a reference to indicate that the ZWJ is necessary? I'm concerned about creating a rendering system that isn't compatible with other rendering systems for Sinhala. (the fact that sequences with conjuct consononants are less common than sequences with an explicit virama doesn't necessarily imply anything about the encoding of those sequences.)
Hi Owen, Please see this document which describes how Sinhala should be rendered: http://fonts.lk/doc/Representation%20of%20Sinhala%20in%20Unicode.pdf . Harshula pointed to this earlier in this "thread". Also, Sinhala is different from other Indic languages in many ways. Also, please notice that this is about to be formarly released as new SLS 1134 by the Sri Lanka Standards Institute (SLSI).
Hi Owen, I think SLS 1134 has already been formally released. It was also announced on the indic @ unicode and unicode @ unicode mailing lists: http://www.lug.lk/lurker/message/20041127.141309.92571494.en.html Regards, Harshula
Hi Owen, Does emitting the ZWJ break any indic scripts? IIRC, the issue with Bug 145233 was that ZWNJ should not be emitted - which makes sense. There was no concern raised about the ZWJ being emitted. I don't quite understand the technical reason for suppressing a *joiner*? It appears this unicode proposal requires ZWJ to be emitted: http://www.unicode.org/review/pr-37.pdf "This proposal intends to rectify these problems, clarifying how the ZERO WIDTH JOINER is to be applied in scripts, and consolidating common mechanisms for equivalent problems that exist in several scripts." which was accepted: http://www.unicode.org/review/resolved-pri.html "Resolution: Closed 2004-08-24. UTC accepted the proposal and will create an Indic conjoining behavior model." Jump straight to page 15 to see some examples. Correct me if I'm wrong, but I can't see how those glyphs can be selected unless the ZWJ is emitted and available to the font. Regards, Harshula
The Unicode standard is not involved in the mechanisms of OpenType font handling. It just specifies the rendering for given input sequences. All other Indic languages handle the effects of ZWJ in unicode text without using the ZWJ explicitely in GSUB processing. The ZWJ *has* to be stripped out of the final results. I don't know if any harm would come of stripping after GSUB processing rather than before. But I'm not very comfortable using a model for OpenType font processing which is different from the existing models for other languages. It would be good if this was brought up on the opentype mailing list. (See http://www.microsoft.com/typography/otspec/otlist.htm for subscription information.)
Hi Owen, I wonder if supressing only the ZWNJ (not ZWJ) would have fixed the bug 145233 ... > It would be good if this was brought up on the opentype mailing list. FYI, Microsoft has already released Sinhala fonts and they use ZWJ explicitly in GSUB processing. See http://www.fonts.lk (the official Sinhala font resource of the ICTA) for some samples.
Hi Owen, Go to page 15 of: http://www.unicode.org/review/pr-37.pdf Could you clarify how the correct glyph would be chosen if Pango did NOT emit the ZWJ? Thanks, Harshula
Hi Owen, Please note that the standard encoding of sinhala (and many other Indic fonts) specifies that the zwj be used to create some conjuncts. This has been accepted both by the Sri Lanka Standards Institute, and by Unicode. The encoding is specified at http://www.fonts.lk/doc/Representation%20of%20Sinhala%20in%20Unicode.pdf you said: > All other Indic languages handle the effects of ZWJ in unicode text > without using the ZWJ explicitely in GSUB processing. O.K. in this case, how does the font figure out whether to create a conjuct character or two separate characters? Also how can we create fonts which work in both Linux and WinXP, since the Windows Uniscribe DLL does pass through the zwj, so that the font can do its job? Thanks, Gihan
Just so it's clear now, I'm not working on this at the moment and probably wont' get a chance to do so until the middle of February ... in other words, I'm not ignoring your comments in particular, I'm ignoring Pango in general..
I want to argue a bit here, so let me first say that I am going to make the changes you want :-). 1. Encoding of Sinhala in Unicode The rules that Pango follows are those in the Unicode standard, for all scripts. It is indisputable that the Sri Lanka Standards Institute knows more the Sinhala script than the people working on the Unicode standard. However, it is impossible for me, as Pango maintainer, to track many individual national standards. And for some scripts, there are multiple relevant national standards bodies (think Mongolian, Arabic, Bengali, and so forth.) Unfortunately, the Unicode standard has completely insufficient specification of encoding for Sinhala. :-( One hopes that it will be revised to match the national standard at some point. (I just sent mail to indic@unicode.org asking whether there are plans in this area.) 2. Encoding of OpenType fonts Just because encoding of a conjuct is: cons + al-lakuna + zwj + cons Doesn't mean that it necessarily has to be handled with a GSUB ligature of cons + al-lakuna + zwj + cons. The way that other Indic scripts work is that the presence of zwj/ alter the features that are applied. Specific patch comments ======================= * I note that the revised state table doesn't allow for the combination zwj + al-lakuna described in the "Touching Letters" section of the "Representation of Sinhala in Unicode". If this is in fact an issue, could you file a separate bug for that? * The way I'm going to handle the zero-width joiner issue for now is conservative ... add a script flag that is set only for Sinhala (PROCESS_ZWJ) and do something different in that case. I'll attach the what I'm committing to CVS. It seems to work with the test case above. [ BTW, would you prefer something other than "Harshula" in the ChangeLog credits? We generally credit contributors with their full name. ] 2005-03-03 Owen Taylor <otaylor@redhat.com> * modules/indic/indic-ot.[ch] modules/indic-ot-class-tables.c: Split out handling of sinhala al-lakuna character from handling of Virama in the state table to avoid implicit formation of conjucts for Sinhala. (Patch from Harshula, ##161981) * modules/indic/indic-fc.c modules/indic/indic-ot.h: Add a new script flag SF_PROCESS_ZWJ indicating whether zero width characters should be passed to gsub/gpos. * modules/indic/indic-ot-class-tables.c: Set SF_PROCESS_ZWJ for Sinhala. (#161981, Harshula)
Created attachment 38227 [details] [review] Patch I committed
Hi Owen, > Unfortunately, the Unicode standard has completely insufficient > specification of encoding for Sinhala. :-( One hopes that it > will be revised to match the national standard at some point. Yes, you are quite right. The revised national standard is quite recent, it was officially released in Feb. I think Gihan (ICTA) has/is submitted/submitting the revised standard to Unicode. > Doesn't mean that it necessarily has to be handled with a > GSUB ligature of cons + al-lakuna + zwj + cons. The way that > other Indic scripts work is that the presence of zwj/ alter the > features that are applied. Ok, maybe we can have an offline discussion about this. > Specific patch comments > ======================= > > * I note that the revised state table doesn't allow for the combination > zwj + al-lakuna described in the "Touching Letters" section of > the "Representation of Sinhala in Unicode". If this is in fact an issue, > could you file a separate bug for that? Yes this is an issue too, it didn't work in 1.6.0 so I didn't consider it a regression. I think I need to discuss the Indic implementation with Eric before I make anymore changes. Indic encoding appears to also use ZWJ + Virama (http://www.unicode.org/review/pr-37.pdf). Obviously, I'm very unfamiliar with past design decisions. :-) > * The way I'm going to handle the zero-width joiner issue for now is > conservative ... add a script flag that is set only for Sinhala > (PROCESS_ZWJ) and do something different in that case. Good idea. > I'll attach the what I'm committing to CVS. It seems to work with the > test case above. I had a look at the patch and tested it out. Seems fine, it's only a minor difference. > [ BTW, would you prefer something other than "Harshula" in the ChangeLog > credits? We generally credit contributors with their full name. ] I'm not particularly fussed. My full name is Harshula Jayasuriya. Regards, Harshula