After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 485556 - Update pango_is_zero_width
Update pango_is_zero_width
Status: RESOLVED OBSOLETE
Product: pango
Classification: Platform
Component: general
unspecified
Other Linux
: Normal normal
: ---
Assigned To: pango-maint
pango-maint
Depends on:
Blocks:
 
 
Reported: 2007-10-10 23:50 UTC by Behdad Esfahbod
Modified: 2018-05-22 12:34 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Patch to change pango_is_zero_width to return Default Ignorables (3.76 KB, patch)
2009-11-23 23:21 UTC, Roozbeh Pournader
none Details | Review

Description Behdad Esfahbod 2007-10-10 23:50:23 UTC
This FAQ: http://www.unicode.org/faq/unsup_char.html

Suggests that the following may need to be added to is_zero_width():

  - Jamo filler characters (e.g., U+115F HANGUL CHOSEONG FILLER)
  - variation selectors

but they may well need more support before being removed.  For variation selectors specially.  Anyway, there's more to do to fully support that FAQ.  Let this be the placeholder bug.
Comment 1 Behdad Esfahbod 2009-11-23 04:19:17 UTC
Quick update: In harfbuzz we support variation selectors and silently drop them if the font doesn't support them.
Comment 2 Behdad Esfahbod 2009-11-23 04:26:58 UTC
Also, from chapter5 of Unicode:

To allow a greater degree of compatibility across versions of the standard, the ranges U+2060..U+206F, U+FFF0..U+FFFB, and U+E0000..U+E0FFF are reserved for format and control characters (General Category = Cf). Unassigned code points in these ranges should be ignored in processing and display. For more information, see Section 5.21, Default Ignorable Code Points.
Comment 3 Roozbeh Pournader 2009-11-23 23:19:08 UTC
The whole list of characters with "Default Ignorable" property is defined here:

http://unicode.org/Public/UNIDATA/DerivedCoreProperties.txt

(Search for Default_Ignorable_Code_Point). I can definitely confirm that all should be treated as pango_is_zero_width-positive.

Still, pango_is_zero_width may need clearer documentation. Saying something like 'with "ZERO WIDTH" in their name' is so non-kosher: UTC hates people using a character name for deriving any property. They are just names, and should only be used by human being for identification purposes.

We may simply change the definition of the function to say all characters with the Default Ignorable property.

The only character that is pango_is_zero_width at the moment but not Default Ignorable, is U+2028 LINE SEPARATOR. It has been there since day 1, but I think it's a mistake.
Comment 4 Roozbeh Pournader 2009-11-23 23:21:10 UTC
Created attachment 148352 [details] [review]
Patch to change pango_is_zero_width to return Default Ignorables

Keeping with the tradition of doing a patch, if trivial.

There are some possible optimizations, like replacing range comparisons with bitwise operators and equality matches. But not knowing if they would help in anyway, I went for keeping the code more understandable. Feel free to replace.
Comment 5 Behdad Esfahbod 2009-11-23 23:58:27 UTC
CC'ing Jonathan.

Thanks Roozbeh.  I'm not particularly interested in pango_is_zero_width() docs or semantics, since we are moving to HarfBuzz and that function will be kinda deprecated.  But yeah, I like documenting it as Default_Ignorables.

What I'm trying to understand in this bug is how should the shaper deal with these characters.  For example, just replacing them with a zero-width empty glyph (what we do now) breaks GSUB/GPOS around them.  Now that's correct for ZWNJ and GSUB (should inhibit ligation), but other than that, I think being Default_Ignorable, these should be completely removed from the stream.  Except that they may be needed in some Indic shapers...  Now it gets even trickier if we want to support a "show hidden" mode...

Maybe I hack my layout engine to skip over those in GSUB/GPOS.  That should be pretty trivial.  We already have the skipping facility in the lookup_flags.


Re Line Separator, I kinda agree that it may make more sense to not include that in the shaped item at all.  Not sure...  Whatever we do, we should also do to other New Line characters (see chapter 5):
http://www.unicode.org/versions/Unicode5.2.0/ch05.pdf

Can we come up with any recommendation re those characters?
Comment 6 Behdad Esfahbod 2009-11-24 00:00:13 UTC
Re Line Separator family, also see bug 501482.
Comment 7 Roozbeh Pournader 2009-11-24 00:35:48 UTC
Thinking of other examples for use of these things in GSUB, Sinhala needs to do GSUB lookups using the zero width *joiner*. I'm quite sure some of the other Default Ignorables, like the Mongolian variations selectors, would be supported in a similar way, using GSUB.

The exact definition of Default Ignorable is something like this: "These may affect the shape and semantics of things around them. If you know how to do that, good! Otherwise, don't display a visual glyph for these, including error boxes. Unless it's Show Hidden mode." See Section 5.21 of TUS, which says the same thing.

Going back to your problem, we need to somehow know what character it is, in order to know what to do with it. Some characters should be skipped over in GSUB shaping (like the VS-es for unsupported combinations, or the deprecated characters for tags, symmetric swapping and national digits), while some have an important role and should not be ignored/removed (like ZWJ for Sinhala). Some others are actually there to break things from getting composed or reordered (ZWNJ, CGJ), so it would be wrong to GSUB over them sometimes, but not all the time.

In other words, I think the best approach is just making the pango_is_zero_width function act return Default Ignorable's, saying that these are characters that never have a visible glyph, and one may use the function only for determining to show or not to show a missing glyph box.

Treating all these weirdos similarly for your shaper could be dangerous. These are special in the way that they are nasty in shaping, not that they are mostly ignorable in shaping. They're more like a Top Ten wanted list of criminals than a family of happy siblings.
Comment 8 Behdad Esfahbod 2009-11-24 00:45:24 UTC
Roozbeh,

I appreciate writing the long comment.  However, you mostly reflected what I said myself in comment 5.   I'm still "trying to understand how should the shaper deal with these characters."  Not as a class.  But individually.

Something like:

  - Do this with ZWNJ.

  - Do that with ZWJ.

  - Do that other thing with CGJ

  - Get all the rest out of the way.

Or maybe this should be moved to harfbuzz list.

Thanks.
Comment 9 GNOME Infrastructure Team 2018-05-22 12:34:48 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/pango/issues/102.