Bug 654195 - Add g_unichar_compose() and g_unichar_decompose()
Add g_unichar_compose() and g_unichar_decompose()
Status: RESOLVED FIXED
Product: glib
Classification: Platform
Component: general
unspecified
Other Linux
: Normal normal
: ---
Assigned To: gtkdev
gtkdev
:
Depends on:
Blocks:
  Show dependency tree
 
Reported: 2011-07-07 20:30 UTC by Behdad Esfahbod
Modified: 2011-07-14 20:57 UTC (History)
1 user (show)

See Also:
GNOME target: ---
GNOME version: ---


Attachments
first cut (142.30 KB, patch)
2011-07-08 05:01 UTC, Matthias Clasen
none Details | Diff | Review
revised patch (109.86 KB, patch)
2011-07-08 18:51 UTC, Matthias Clasen
none Details | Diff | Review

Description Behdad Esfahbod 2011-07-07 20:30:39 UTC
gboolean g_unichar_compose (gunichar a, gunichar b, gunichar *ab);
gboolean g_unichar_decompose (gunichar ab, gunichar *a, gunichar *b);

These do 2:1 / 1:2 composition / decomposition.  All of the NFC/NFD transformations can be done as a chain of 2:1 or 1:2 transformations.  We have the data in there, but it's in UTF-8 and needs massaging.

I need this API in HarfBuzz to be able to correctly shape NFC/NFD equivalently, but trying to compose two chars when the font has a glyph for the composition, and decompose when the font doesn't have a glyph for it.
Comment 1 Matthias Clasen 2011-07-07 21:42:24 UTC
I can see that we have a static function called combine() which does the 2:1 thing, but I am not 100% sure how the 1:2 thing is supposed to work - is that even always possible ? The table we have in glib seems to list full decompositions - those can be more of two characters, I guess ?
Comment 2 Behdad Esfahbod 2011-07-07 21:48:38 UTC
UnicodeData.txt's sixth field has the decomposition.  To get the full decomposition one has to recursively apply the 1:2 decomposition.  That's why our current data has more-than-two.  But I'm interested in the 1:2 one.

Just take the sixth field of UnicodeData.txt and ignore any of the decompositions starting with a "<" character as they are compatibility decompositions.

This grep suggests that all of the non-compat decompositions are 1:2:

grep '^[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;[^;< ]* [^; ]* ' UnicodeData.txt

Returns empty.
Comment 3 Behdad Esfahbod 2011-07-07 22:32:39 UTC
Seems like it's not documented, but all canonical decompositions in UnicodeData.txt are pairwise.  I'll write to the UTC to see if this is guaranteed in the future.  I'm 99.99% sure it is.

Also, the Hangul Jamo composition / decomposition is algorithmic and not included in UnicodeData.txt (I guess).

One lat thing to keep in mind: there's some 1:1 decompositions too.  In that case the API can simply return 0 for the second part.
Comment 4 Behdad Esfahbod 2011-07-07 23:04:35 UTC
Apparently the 1:2 rule is documented in a side corner of UAX#15 Unicode Normalization Forms:

“The canonical decomposition mapping for all other characters maps each character to one or two others. A character may have a canonical decomposition to more than two characters, but it is expressed as the recursive application of mappings to at most a pair of characters at a time.”

 

“all other characters” here refers to all characters other than precomposed Hangul syllable characters.
Comment 5 Behdad Esfahbod 2011-07-07 23:25:43 UTC
Ok, the canonical documentation is here:

http://www.unicode.org/policies/stability_policy.html#Property_Value

It guarantees that decompositions will remain 1:1 or 1:2 in he future:

"Canonical mappings (Decomposition_Mapping property values) are always limited either to a single value or to a pair. The second character in the pair cannot itself have a canonical mapping."
Comment 6 Matthias Clasen 2011-07-08 05:01:07 UTC
Created attachment 191505 [details] [review]
first cut

First attempt.

Unfortunately, we have to keep separate tables for this, since the fully applied tables are not really suitable for doing this. Also, the functions don't do the algorithmic part for Hangul - if you want to include that, you have to allow for 1->3 decompositions. I'm also not 100% sure if it is right to include the 1->1 cases here, since there are some situations in CJK where you get 
a -> b and a' -> b, which makes the reverse table not a function. 

Let me know what you think.
Comment 7 Behdad Esfahbod 2011-07-08 17:23:16 UTC
Thanks Matthias.

1. Hangul can be done 1:2 and 2:1.  If you check the algorithm it's evident that it's doing two chained decompositions.

2. We don't want the 1:1 ones in compose(). They are not in the standard, and not useful either.
Comment 8 Matthias Clasen 2011-07-08 17:32:12 UTC
> 1. Hangul can be done 1:2 and 2:1.  If you check the algorithm it's evident
> that it's doing two chained decompositions.

I'll have a look

> 2. We don't want the 1:1 ones in compose(). They are not in the standard, and
> not useful either.

Oh, but they are in the standard... if you look my patch, the data is directly extracted from UnicodeData.txt. I can easily see them being useful for your purpose, too. Your font might not have a glyph for Angstrom (212B), but have one for Aring (00C5).
Comment 9 Matthias Clasen 2011-07-08 18:51:04 UTC
Created attachment 191531 [details] [review]
revised patch

Next iteration includes Hangul, and and drops the 1:1 steps from composition (but keeps them for decomposition). Is this closer ?

If you have any pointers to suitable test data for this, that would be nice.
Comment 10 Behdad Esfahbod 2011-07-08 21:56:48 UTC
(In reply to comment #8)
> > 1. Hangul can be done 1:2 and 2:1.  If you check the algorithm it's evident
> > that it's doing two chained decompositions.
> 
> I'll have a look
> 
> > 2. We don't want the 1:1 ones in compose(). They are not in the standard, and
> > not useful either.
> 
> Oh, but they are in the standard... if you look my patch, the data is directly
> extracted from UnicodeData.txt. I can easily see them being useful for your
> purpose, too. Your font might not have a glyph for Angstrom (212B), but have
> one for Aring (00C5).

They are part of the decomposition, not the composition. Ie, quoting from the standard:

"Certain characters are known as singletons. They never remain in the text after normalization. Examples include the angstrom and ohm symbols, which map to their normal letter counterparts a-with-ring and omega, respectively."

Your patch sounds right. Lemme have a quick check. I'm commenting from my phone. Will dige test cases later, but the UAX has a minimal set of tests in the tables.
Comment 11 Behdad Esfahbod 2011-07-08 22:03:22 UTC
The second patch has incorrect docs for compose(). Other than that, any reason not using a stock bsearch implementation?
Comment 12 Matthias Clasen 2011-07-11 00:31:15 UTC
(In reply to comment #11)
> The second patch has incorrect docs for compose(). 

Oh, yeah. Forgot to update that.

> Other than that, any reason
> not using a stock bsearch implementation?

Not really, no. I just copied what's in the full normalization function, but might as well use bsearch
Comment 13 Behdad Esfahbod 2011-07-14 20:57:10 UTC
Fixed in 7041b701dd9fd4f617ca762860447d8fc015a2ab.

Note You need to log in before you can comment on or make changes to this bug.