GNOME Bugzilla – Bug 101079
opentype font suppor for diacritics for Latin/Greek/Cyrillic letters
Last modified: 2006-03-07 05:54:39 UTC
Combining diacritical marks for Latin/Greek/Cyrillic letters are not supported. Code2000 fonts have opentype tables for some (if not all) of them and Yudit 2.6 can render them correctly with Code2000 font. However, Pango doesn't seem to be able to. A smaple text is available at <http://www.columbia.edu/kermit/st-erkenwald.html>. The text has sequences like <U+0068, U+0305> and <U+0069, U+0304>. U+0304 and U+0305 have to be rendered above base characters <U+0068> and <U+0069>, but they're rendered to the left of them, instead.
It seems like basic-xft does take care of diacritic combining marks with simple overstriking and some heuristics. Unfortunately, it doesn't work well for the sample text I tried with Code2000 font. I'm changing the summary line because diacritic combining marks are supported but opentype tables are not made use of when present in a font.
I found a family of fonts with opentype tables for virtually all Latin/Greek/Cyrillic diacritical combining marks at http://www.sil.org/~gaultney/gentium/index.html When implementing opentype support for Latin/Greek/Cyrillic, these fonts would be of great help.
I'm confused about this Gentium font. pango/pango/opentype/ottest doesn't list any opentype tables for it. $ ./ottest /home/nlevitt/.fonts/Gentium\ Release\ 1/GenR1.ttf ----> GSUB <---- TT_Load_GSUB_Table 8e ----> GPOS <---- TT_Load_GPOS_Table 8e
Sorry for the confusion. I forgot to clarify. Until a week ago, I thought it's an opentype font but it turned out NOT. When I mentioned it as an opentype font, the download link to the font didn't work and I somehow assumed that it's an opentype font.(I returned to the site several times, but it didn't work). A week ago when I finally downloaded it and read README file, I realized that it's a dumb truetype font. Code2000 font by James Kass may have some support of diacritics for Latin/Greek/Cyrillic, but I'm not sure.
oops. I forgot what I had written earlier. Code2000 font does have OT tables for some diacritic marks for Latin/Greek/Cyrillic.
Created attachment 20146 [details] [review] first attempt
Created attachment 20147 [details] sample output
The font in the sample image is code2000. I don't know if the rendering is correct or not (in the sense that the opentype rules are applied correctly). We really need a font with some sample strings and correct renderings to check against. This patch is just sort of a proof of concept. Stuff still needs to be worked out. For starters, there's there is opentype kerning and (I guess) "regular" kerning. This patch skips the regular kerning if the font has opentype kerning. But I'm not sure that's the right thing to do if the font has kerning for one or more but not all the scripts (latn, cyrl, grek, armn, geor, runr, ogam). Another question is whether and which discretionary ligatures should be on by default.
Oh, there is the erkenwald link in the first comment. I won't flood you with more attachments. Suffice it to say that without the patch it renders wrong, and with the patch it renders more than half right.
> This patch skips the regular kerning if the font has > opentype kerning. I'm not sure either, but it's likely that you should not skip 'regular' kerning. Have you tried the other way around (i.e. turn off OT kerning and leave alone 'regular' kerning)? BTW, basic shaper has a 'best-effort guessing' code for combining characters and your patch invokes the OT shaping function after that. You might have to block it 'selectively' (??). To determine whether or not to block it, we may have to go really 'deep' into OT 'internals'. Alternatively, we may block it per script (or unicode block)
> BTW, basic shaper has a 'best-effort guessing' code for combining > characters and your patch invokes the OT shaping function after that. > You might have to block it 'selectively' (??). To determine whether or > not to block it, we may have to go really 'deep' into OT > 'internals'. Alternatively, we may block it per script (or unicode > block) That's a good point. Fortuntely, I did a tiny bit of testing, and it appears that the opentype rules override the heuristics. I tried my sample file and the only part that changed was the Greek part (Code2000 has no tables for Greek). Tests without Greek, like the Erkenwald one, turned out identical. I don't know if we just got lucky or what...
Doesn't sound like this patch is really ready to go in for 1.4.
Pango's rendering of IPA using the Doulos SIL font also has this problem. I have attached several files that demonstrate the problem and what it should look like. The Doulos SIL font is freeware and can be downloaded from http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=DoulosSILfont I am using version 4.0.4 for these screenshots but the only difference between 4.0.10 and 4.0.4 is that 4.0.10 has a couple of supplementary plane characters. Particular attention has been paid to the opentype and graphite tables in this font so that it renders well on Windows with recent versions of uniscribe. Thus it is a good test for the opentype rendering in pango. ipa_test.txt - UTF-8 encoded IPA test file gedit-2.6.2_pango-1.6.0.png - screenshot of gedit 2.6.2 and pango 1.6.0 showing the test data. Diacritics are about the width of an average glyph too far too the left. Also when there are multiple diacritics there is no vertical seperation and so they just go on top of each other in a confused mess (line 4). Also ligatures (line 3) are not working. notepad.txt - same test file (but converted to DOS linebreaks in Notepad.exe with a recent version of Uniscribe. This is what gedit and pango should be able to output using just the open type tables in the font. worldpad.png - This is the same data rendered using the graphite tables in the font. What it would ideally look like. Using Worldpad 2.0.2004.6259 on windows 2000.
Created attachment 32983 [details] screenshot demonstrating the bug
Created attachment 32984 [details] UTF-8 encoded IPA test data
Created attachment 32985 [details] expect output from opentype in notepad
Created attachment 32986 [details] Ideal output (requires graphite rendering)
It seems the first attempt patch looks for the 'kern' tag. Doulos SIL seems to be using the 'mark' tag for base+mark glyphs and the 'mkmk' tag for *+mark+mark glyphs. I don't know how the code should be changed for pango to let the font handle anchors. This issue is quite serious, since some languages use diacritics on base characters that Unicode doesn't have like 'à'. Unicode leaves the issue for the fonts to handle. Maybe the severity should be more than "normal".
Noah, any news on this? Handling mark and mkmk should be easy these days. Do you have time to finish this?
Created attachment 52690 [details] [review] patch that fix mark positioning This patch fixes the mark positioning. Followed by three sample output. Code2000 is printed properly for marks, but mkmk don't work. Doulos SIL shoul work but mark isn't working. Here is the error when using pangoft2topgm --font="Doulos SIL" on a file with diacritics.
Created attachment 52691 [details] Code2000 mark Code2000 mark, diacritics on latin extended letters
Created attachment 52692 [details] Doulos SIL mark Doulos SIL mark, diacritics with latin-extended letters. This will return an error: (process:11612): Pango-WARNING **: Error loading GPOS table 4096
Created attachment 52693 [details] Code2000 mark and mkmk Code2000 mark and mkmk, the IPA sample. As you can see mkmk are handled.
sorry about the spam. Comment #23 "As you can see mkmk are handled." I meant they _aren't_ handled.
Thanks to behdad for his help. Doulos SIL seems to have a bug, ttx isn't able to convert it to xml due to a GPOS error. Behdad mentioned we could work around the OT specs to have this work. Code2000 has a lack of ligatures for accented Is, it does have definition for diacritics placement. I'm currently working on a few fonts. Junicode 6.5 beta has diacritics placement definitions for vowels, the mark and liga seems to work properly, test for yourselves : http://home.sus.mcgill.ca/~moyogo/lingala/fonts/Junicode-20051005-patch.zip It contains Regular, Italic, Bold and Bold Italic. The first three have the mark diacritics placed correclty. mkmk seems to be an issue, test with attachment (id=32984), but the following attachments is clearer for testing Junicode. Pango does not place the diacritics correctly with Junicode Bold Italic, nor does it give any error, and yet the font has the definitions. Ttx does not have any problem converting Junicode Bold Italic to xml. I'm also working on another font. Similar problem as Junicode Bold Italic, the marks are set, but pango does not use them and does not print any errors. Yet Ttx finds a bug in GPOS. test it : http://home.sus.mcgill.ca/~moyogo/lingala/fonts/Ubuntu-Title.otf Ligatures and diacritics are very important, not only in non Latin-scripts. Why aren't they always used? Or at least why aren't they used for Latin-scripts and private Unicode blocks?
Actually, Code2000 and Pango not rendering i+dieresis (U+0069 + U+0308) as ï (U+00EF) is not a bug in Code2000 but in Pango. Pango should know from Unicode that they are canonically equivalent and therefore use the precomposed glyph if available in that font. Should I open another bug or leave that for here?
I'm currently adding anchors for mark and base glyphs to some DejaVu Fonts. Pango has a really strange behaviour. For DejaVu Mono without anchors it will give a 4097 error, yet ttx translate the font to xml without any warning or error. I've only managed to get DejaVu Sans and Sans Bold to have the mark working. All my other tentative simply don't render them, no error message not from Pango nor from ttx.
It seems the error with the fonts that didn't work in Pango was due to the version of Fontforge I was using. Another Fontforge related bug is still there, but only occurs depending on the order one adds anchors types. There is a Junicode-Regular ttf file that has the OpenType feature for Latin Script can it can be fetched at http://home.sus.mcgill.ca/~moyogo/lingala/fonts/Junicode-Regular.ttf Use the gzipped patch attachment (id=52690) with this font and the file attachment (id=32984) or http://home.sus.mcgill.ca/~moyogo/lingala/fonts/text.utf8 The result, as far as the feature are included in the font, is as close to Notepad or even Worldpad for some features. I will provide more fonts with similar features if needed for testing, but it seems to me the patch can be reviewed and tested in CVS.
Created attachment 53890 [details] Gedit with Junicode and the utf8 file Here's a screenshot of Junicode-Regular with OpenType features for Latin script diacritics in Gedit with the previous utf8 exemple with extra characters. The extra characters are there because I could not reprocude Doulos SIL exact behaviour with one sequence of characters, but managed to with other sequences. This is a font issue, or even an OpenType issue.
Created attachment 53961 [details] [review] cleaned up patch for OpenType features for Latin and other basic Scripts Here's the cleaned up version of the patch. It handles mark, mkmk, kern for GPOS and clig, liga and ccmp for GSUB. I think it can be reviewed and go into CVS. I can provide more fonts and more text samples if you need to test it more extensively. All the fonts that have the features I've tested work. I assumed to much previously, when display was incorrect it was due to the font not having the right features and not due to Pango randomly acting up. Doulos SIL is broken for Pango but that's another issue, it should not delay other OpenType fonts from being displayed correctly for the languages needing them. Code2000 isn't sustituting for dotless i and j when accented, so that could be another bug. Should Pango use Unicode data to know i is composed and therefore use the right glyph according to context or should it always be defined in the font?
Can you please explain more fully why you think Doulos SIL is broken? The font has been extensively tested with the uniscribe (Microsoft) and the InDesign (Adobe) shaping engines and works correctly with them. We know it doesn't work correctly with pango (released versions), qt or the version of icu that is in openoffice2 on linux. But that is due to limitations of those shaping engines. It may work correctly with more recent versions of icu. We will try to find out about that. Of course this is no guarantee that there is not a bug in the font but it makes it less likely. If there is a problem in Doulos SIL that you can pinpoint then we can arrange to get it fixed. And now is a good time for that as there is currently a new version in beta. Anyway wherever the problem is it would be good to get it fixed. Doulos SIL uses 4 GPOS lookups and 13 GSUB lookups: GPOS - type1 single adjustment for advancewdith (this may be new), mark, mkmk udia, mkmk ldia GSUB - context for dotless i, subst for dotless i, replacement dotless i when precompose with lower diacritic, precomposed replacements, romanian overstrikes, romanian precomposed, multi-way alternates, single alternates, vietnamese overstrikes, vietnamese precomposed, ffi replacement, pitch ligatures replacement It also uses lots of features and other stuff. thanks for your work on this bug. I am very interested to see it resolved. Hope this is somewhat helpful
cc myself
Did the new patch solve comment #26 i+dieresis (U+0069 + U+0308) => ï (U+00EF)?
The patch only uses kern, mark, mkmk for GPOS and ccmp, clig, liga for GSUB as defined on http://www.microsoft.com/typography/otfntdev/standot/features.htm . Comment #26 is a bug in the font, as far as this patch is concerned. Should Pango do the extra work for fonts missing the feature? With the patch Pango can load Doulos SIL's GSUB without any problem. Ligatures and substitutions like i+dieresis (U+0069 + U+0308) => ï work, except the diacritic is misplaced since Doulos SIL's GPOS triggers an error see comment #22. I was able to have other fonts working with both GPOS and GSUB. Greg, can you run ttx from fonttools on DoulosSILR.ttf? I don't know what to do with the error. Should other features be available for basic scripts?
I have tested the patch with Doulos SIL. I am getting two GPOS errors the same as in comment 22. I have also run ttx on the font and am getting a stack trace that ends with an assertion error "assert r.StartCoverageIndex == len(glyphs), \ AssertionError: (20, 0)". I have turned on the logging in pango/opentype/ftglue.c line 13 and run gedit. I will attach the output of the log. That should help in tracking down the source of the problem. I have checked with some of the font designers and they say that the font is fine with current versions of icu and also Mellel's shaping engine. This is in addition to uniscribe and indesign (except for two or more stacked diacritics). It seems to me that there is some incompatability between Doulos SIL and freetype. Your patch is just causing the OT tables to be loaded and thus triggering the error. So I think the problem must be in either freetype or the font. Behdad, do you agree with this and if so should should we move the discussion elswhere? About comment 26. Windows always substitutes the precomposed form for the decomposed form. OS X does not. Of course if the font and the OS both support the right shaping the final result will look the same. I am not sure that either way is the "correct" way for this. As for your question about the other features I will have to check about that.
Created attachment 54458 [details] log from ftglue.c log from ftglue.c of gedit opening attachment 32984 [details] with the Doulos SIL font selected
The patch gives the same GPOS error with Charis SIL. http://scripts.sil.org/cms/scripts/page.php? site_id=nrsi&item_id=CharisSIL_download
Yes, I expected that. Doulos SIL and Charis SIL are not independent fonts. The glyphs are different but the opentype tables are as similar as they can be made. This is primarily to reduce the amount of work involved.
This comment is quite seperate from my previous comments. It has nothing to do with Doulos SIL or Charis SIL. I think that fallback positioning used in your patch can be improved. That is how diacritics are positioned when there are no opentype tables. Currently you are putting the diacritic back a fixed amount each time. This works well if the previous character is of average width. However it is not so good if the previous character is wider or narrower than usual. It would be better to center the diacritic over (or under) the previous character. This will not be exactly right in every case (e.g. if the previous character is a j) but it will look better for a lot more cases. Also if there is more than one diacritic they land on top of each other. It would be good if second and subsequent diacritics could move up (or down) a bit so they don't land on top of each other. Qt does this and it really looks better. I will attach a screenshot so you can see. The Gentium font (also from scripts.sil.org) is a good font to test this with as it has no opentype tables. Contrary to what was suggested in comment 2.
Created attachment 54460 [details] screenshot of qt's rendering of Doulos SIL screen shot of Doulos SIL in kedit showing how good it can look without the use of opentype tables
Greg, would you please attach the test text used in comment 40, and a pointer to the font used please? I want to get it working in Pango! About Doulos, I'm investigating.
The text used in comment 40 is the same UTF-8 encoded IPA test data: attachment (id=32984) The font used in the screenshot attachment 54460 [details] is Doulos SIL, but since QT doesn't use GPOS or GSUB, it is as if they weren't there. QT places the diacritics by itself, it doesn't use the tables (see comment 39). Gentium is at http://www.sil.org/~gaultney/gentium/index.html (see comment 2 )
Created attachment 54465 [details] screenshot of qt placing Gentium diacritics here's as screenshot of the exact same file with Gentium instead of Doulos SIL in Kedit
Denis's summary in comment 42 is spot on.
About comment #39, as a fallback case, substituting decomposed characters by composed characters if available would yield better results (at least with GPL Vietnamese fonts i have)
Created attachment 54758 [details] backtrace of failure to load GPOS table in Doulos SIL I have tracked down the point in the code at which pango fails to load the GPOS table in Doulos SIL. The attachment is a backtrace with the values of local variables made by setting a breakpoint where the error is first detected. To duplicate this problem: 1) apply the patch in attachment 53961 [details] [review] to pango 1.10.1 2) install Doulos SIL 4.0.10 3) set the font in gedit to Doulos SIL 4) open gedit (I used 2.10.5) but I doubt that the exact version matters much
Ok, I committed two patches: 2005-11-17 Behdad Esfahbod <behdad@gnome.org> Part of #101079: * pango/opentype/ftxopen.c (Load_Lookup): In extension subtables, offset is relative to the extension subtable, not the original table. (Greg Aumann) * pango/opentype/ftxgpos.c (Load_BaseArray): When reading BaseAnchor, skip offsets that are zero. Works around bug in Doulos SIL Regular. ============ I believe Doulos SIL Regular is wrong here: (01:03:07) behdad: the GPOS BaseArray is at 0x0E40 (01:03:29) behdad: ClassCount is 6 (01:03:41) behdad: BaseCount is 0x2C1 which is 705 (01:03:52) behdad: so we expect 705 records of 6 offsets each (01:04:09) behdad: and look what follows, all two zero bytes, followed by 10 nonzero, repeat (01:04:24) behdad: the zero bytes are wrong Testing is appreciated. See if this fixes your favorite problem.
I have tested pango 1.10.1 with the patch in attachment 53961 [details] [review] and the two fixes mentioned in comment 47. I tested with Doulos SIL 4.0.10 and 4.0.14 and Charis SIL 4.0.2. These have fixed the incompatibility problems with pango and these fonts. I will attach two screen shots so you can see. There is still one minor issue with the diacritics under U+0260 LATIN SMALL LETTER G WITH HOOK (end of the fourth line of attachment 32984 [details]). The two diacritics are being placed on top of each other. However this is also happenning in the uniscribe screenshot (attachment 32985 [details]). And in fact pango is rendering the test data a little better than Notepad.
Created attachment 55124 [details] screenshot of Charis SIL 4.0.2 Screenshot of patched pango rendering Charis SIL 4.0.2
Created attachment 55125 [details] screenshot of Doulos SIL 4.0.10 Screenshot of patched pango rendering Doulos SIL 4.0.10
Re: comment 47 and the zeros in the GPOS BaseArray. response is a summary of comments from Bob Hallissy As Behdad noted all of the offsets for BaseAnchor[0] are 0. The key to understanding this is to look at the MarkArray and notice that none of the covered mark glyphs are given class 0. Therefore there is no need to provide a base glyph anchor for class 0 marks. Thus the font is internally consistent. In effect class 0 is unused. The reason for not using class 0 is to use the same glyph classes everywhere in the font. For GDEF the mark classes are always 1..n (omitting 0) (see the example in the spec). We use the same class numbers for all other uses, so we end up with 0 being unused. Not sure the spec explicitly discusses the case where a given mark class is empty. The closest thing would be the statement in the BaseArray table description that says: "A BaseRecord declares one Anchor table for each mark class (including Class 0) identified in the MarkRecords of the MarkArray." Noting that, in Doulos, Class 0 is not "identified in the MarkRecords of the MarkArray", no Anchor table is needed for it. However it should be noted that other opentype shaping engines have no problem with null offsets here. Thus it is not really a bug in the font nor is it a pango bug but really a grey area in the spec. Given that other opentype shaping engines are fine with null offsets it is probably best that pango includes the second of Behdad's patches.
Thanks Greg for the clarifications. That's exactly what I thought, that class 0 is not used. Still it wouldn't harm to point it to a useless small (zero-item?) lookup table. Anyway, the current approach in Pango takes care of that. All good. I'll review and commit the shaper patch soon.
Created attachment 55143 [details] [review] Committed patch. A reworked patch committed. Sweet rendering of Doulos confirmed. Happy all :) 2005-11-23 Behdad Esfahbod <behdad@gnome.org> * modules/basic/basic-fc.c: Reworked basic shaper with OpenType support. (#101079, based on patch from Denis Jacquerye and Noah Levitt) * modules/basic/basic-fc.c (basic_scripts): Added Unicode 4.1 addition script PANGO_SCRIPT_GLAGOLITIC that is a "simple" script. * modules/arabic/arabic-fc.c, modules/syriac/syriac-fc.c: Replace g_utf8_to_ucs4_fast() with g_utf8_strlen()! * pango/opentype/pango-ot-ruleset.c (pango_ot_ruleset_add_feature): Remove reference in docs to pango_ot_ruleset_shape() that was removed long ago.
I opened a couple of bug that come from this discussion: Bug 322234: Diacritics should not overlap Bug 322273: Pango should use canonical decomposition data
Created attachment 60815 [details] Comparing bluefish with yudit under pango-1.10.4 I still experience a similiar problem on my Gentoo Linux box with Pango-1.10.4 installed. I tried DejaVu, Duolos SIL,and Gentium fonts but none of them seemed to display these combining characters correctly. Here is a screenshot comparing bluefish with yudit while displaying the same characters.
Pango-1.11.1 fixes this for fonts with the OpenType features. Gentium and other fonts actually don't have anchors for diacritics, Pango doesn't handle that yet. Doulos SIL, Charis SIL and DejaVu should work.
The fix is only in 0.11.x and later.