GNOME Bugzilla – Bug 654195
Add g_unichar_compose() and g_unichar_decompose()
Last modified: 2011-07-14 20:57:10 UTC
gboolean g_unichar_compose (gunichar a, gunichar b, gunichar *ab); gboolean g_unichar_decompose (gunichar ab, gunichar *a, gunichar *b); These do 2:1 / 1:2 composition / decomposition. All of the NFC/NFD transformations can be done as a chain of 2:1 or 1:2 transformations. We have the data in there, but it's in UTF-8 and needs massaging. I need this API in HarfBuzz to be able to correctly shape NFC/NFD equivalently, but trying to compose two chars when the font has a glyph for the composition, and decompose when the font doesn't have a glyph for it.
I can see that we have a static function called combine() which does the 2:1 thing, but I am not 100% sure how the 1:2 thing is supposed to work - is that even always possible ? The table we have in glib seems to list full decompositions - those can be more of two characters, I guess ?
UnicodeData.txt's sixth field has the decomposition. To get the full decomposition one has to recursively apply the 1:2 decomposition. That's why our current data has more-than-two. But I'm interested in the 1:2 one. Just take the sixth field of UnicodeData.txt and ignore any of the decompositions starting with a "<" character as they are compatibility decompositions. This grep suggests that all of the non-compat decompositions are 1:2: grep '^[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;[^;< ]* [^; ]* ' UnicodeData.txt Returns empty.
Seems like it's not documented, but all canonical decompositions in UnicodeData.txt are pairwise. I'll write to the UTC to see if this is guaranteed in the future. I'm 99.99% sure it is. Also, the Hangul Jamo composition / decomposition is algorithmic and not included in UnicodeData.txt (I guess). One lat thing to keep in mind: there's some 1:1 decompositions too. In that case the API can simply return 0 for the second part.
Apparently the 1:2 rule is documented in a side corner of UAX#15 Unicode Normalization Forms: “The canonical decomposition mapping for all other characters maps each character to one or two others. A character may have a canonical decomposition to more than two characters, but it is expressed as the recursive application of mappings to at most a pair of characters at a time.” “all other characters” here refers to all characters other than precomposed Hangul syllable characters.
Ok, the canonical documentation is here: http://www.unicode.org/policies/stability_policy.html#Property_Value It guarantees that decompositions will remain 1:1 or 1:2 in he future: "Canonical mappings (Decomposition_Mapping property values) are always limited either to a single value or to a pair. The second character in the pair cannot itself have a canonical mapping."
Created attachment 191505 [details] [review] first cut First attempt. Unfortunately, we have to keep separate tables for this, since the fully applied tables are not really suitable for doing this. Also, the functions don't do the algorithmic part for Hangul - if you want to include that, you have to allow for 1->3 decompositions. I'm also not 100% sure if it is right to include the 1->1 cases here, since there are some situations in CJK where you get a -> b and a' -> b, which makes the reverse table not a function. Let me know what you think.
Thanks Matthias. 1. Hangul can be done 1:2 and 2:1. If you check the algorithm it's evident that it's doing two chained decompositions. 2. We don't want the 1:1 ones in compose(). They are not in the standard, and not useful either.
> 1. Hangul can be done 1:2 and 2:1. If you check the algorithm it's evident > that it's doing two chained decompositions. I'll have a look > 2. We don't want the 1:1 ones in compose(). They are not in the standard, and > not useful either. Oh, but they are in the standard... if you look my patch, the data is directly extracted from UnicodeData.txt. I can easily see them being useful for your purpose, too. Your font might not have a glyph for Angstrom (212B), but have one for Aring (00C5).
Created attachment 191531 [details] [review] revised patch Next iteration includes Hangul, and and drops the 1:1 steps from composition (but keeps them for decomposition). Is this closer ? If you have any pointers to suitable test data for this, that would be nice.
(In reply to comment #8) > > 1. Hangul can be done 1:2 and 2:1. If you check the algorithm it's evident > > that it's doing two chained decompositions. > > I'll have a look > > > 2. We don't want the 1:1 ones in compose(). They are not in the standard, and > > not useful either. > > Oh, but they are in the standard... if you look my patch, the data is directly > extracted from UnicodeData.txt. I can easily see them being useful for your > purpose, too. Your font might not have a glyph for Angstrom (212B), but have > one for Aring (00C5). They are part of the decomposition, not the composition. Ie, quoting from the standard: "Certain characters are known as singletons. They never remain in the text after normalization. Examples include the angstrom and ohm symbols, which map to their normal letter counterparts a-with-ring and omega, respectively." Your patch sounds right. Lemme have a quick check. I'm commenting from my phone. Will dige test cases later, but the UAX has a minimal set of tests in the tables.
The second patch has incorrect docs for compose(). Other than that, any reason not using a stock bsearch implementation?
(In reply to comment #11) > The second patch has incorrect docs for compose(). Oh, yeah. Forgot to update that. > Other than that, any reason > not using a stock bsearch implementation? Not really, no. I just copied what's in the full normalization function, but might as well use bsearch
Fixed in 7041b701dd9fd4f617ca762860447d8fc015a2ab.