GNOME Bugzilla – Bug 107974
glib Unicode data is outdated, update to Unicode 4.0
Last modified: 2011-02-18 16:07:18 UTC
glib Unicode data is outdated. It should be updated with Unicode 3.2 data.
This is reasonable amount of work ... while the interfaces to GLib are full-unicode clean, a lot of the table formats take advantage of only having characters in the BMP and the fact that older versions of Unicode are compact in the code point assignments. Simply extending the same code to the full 17 planes would produce a lot of bloat ... the table formats will need to be changed in various cases.
I understand. But what about just updating the data to the BMP part of Unicode 3.2?
Pretty easy (basically should just be running various scripts, verifying that the updated tables still make sense, unless the format of the Unicode data has changed significantly), but not _that_ high priority for me, since that still wouldn't give real Unicode-3.2 support. I'm hoping someone will have time to do the full job prior to GLib-2.4.
Created attachment 14997 [details] Patch against HEAD to update glib data to Unicode 3.2's BMP
My patch is broken, it seems, please ignore it.
In gucharmap, I store my stuff in binary-searchable structs: typedef struct { gunichar first; gunichar last; GUnicodeType category; } UnicodeCategory; const UnicodeCategory unicode_categories[] = { { 0x0000, 0x001F, G_UNICODE_CONTROL }, { 0x0020, 0x0020, G_UNICODE_SPACE_SEPARATOR }, { 0x0021, 0x0023, G_UNICODE_OTHER_PUNCTUATION }, { 0x0024, 0x0024, G_UNICODE_CURRENCY_SYMBOL }, [...] { 0xE0020, 0xE007F, G_UNICODE_FORMAT }, { 0xF0000, 0xFFFFD, G_UNICODE_PRIVATE_USE }, { 0x100000, 0x10FFFD, G_UNICODE_PRIVATE_USE }, }; GUnicodeType unichar_type (gunichar uc) { gint min = 0; gint mid; gint max = sizeof (unicode_categories) / sizeof (UnicodeCategory) - 1; if (uc < unicode_categories[0].first || uc > unicode_categories[max].last) return G_UNICODE_UNASSIGNED; while (max >= min) { mid = (min + max) / 2; if (uc > unicode_categories[mid].last) min = mid + 1; else if (uc < unicode_categories[mid].first) max = mid - 1; else return unicode_categories[mid].category; } return G_UNICODE_UNASSIGNED; } I ran tests, and unichar_type is about 100 times slower than g_unichar_type. :-D But after all, it's still really fast; roughly 600ns per lookup on a pentium II 400mhz. Are you at all interested in a patch?
Seems like a pretty good place to apply a well chosen hash function that maps page numbers to arrays of per-character data. Either find a perfect hashing function, or find good hash/rehash functions that limit probes to some reasonable length. This will bound all searches quite nicely while using memory proportional to the occupied pages.
Just wanted to mention that 600ns is actually significant. I'd guess that Pango speed on your machine is ~200k char/second; Pango calls unichar_type() twice per character during layout (once for determining break boundaries, once while shaping). 600ns * 200k * 2 is approximately 0.25 seconds, so that's a 25% slowdown.
Unicode 4.0 introduces a couple of new line breaking classes, NL and WJ. I think this complicates matters, since pango may have to know about them. I guess for the time being pango can treat the new classes the way it treats the old classes of the characters that now belong to the new classes.
Just FYI, ICU people seem to have done something interesting. See a thread of postings to Unicode ml beginning with http://www.unicode.org/mail-arch/unicode-ml/y2003-m05/0048.html (username : unicode-ml, password: unicode)
Unicode 4.0 says a bunch more about case conversion in Lithuanian and Turkish/Azeri than 3.1 did. It looks to me like this means we have to hard-code more special cases. :-\ Unicode 4.0 also looks like it's going to force case folding to be locale-sensitive (T means Turkic): 0049; C; 0069; # LATIN CAPITAL LETTER I 0049; T; 0131; # LATIN CAPITAL LETTER I 0130; F; 0069 0307; # LATIN CAPITAL LETTER I WITH DOT ABOVE 0130; T; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE From Unicode 4.0 SpecialCasing.txt: [...] # Preserve canonical equivalence for I with dot. Turkic is handled below. 0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE [...] # ================================================================================ # Locale-sensitive mappings # ================================================================================ # Lithuanian # Lithuanian retains the dot in a lowercase i when followed by accents. # Remove DOT ABOVE after "i" with upper or titlecase 0307; 0307; ; ; lt After_Soft_Dotted; # COMBINING DOT ABOVE # Introduce an explicit dot above when lowercasing capital I's and J's # whenever there are more accents above. # (of the accents used in Lithuanian: grave, acute, tilde above, and ogonek) 0049; 0069 0307; 0049; 0049; lt More_Above; # LATIN CAPITAL LETTER I 004A; 006A 0307; 004A; 004A; lt More_Above; # LATIN CAPITAL LETTER J 012E; 012F 0307; 012E; 012E; lt More_Above; # LATIN CAPITAL LETTER I WITH OGONEK 00CC; 0069 0307 0300; 00CC; 00CC; lt; # LATIN CAPITAL LETTER I WITH GRAVE 00CD; 0069 0307 0301; 00CD; 00CD; lt; # LATIN CAPITAL LETTER I WITH ACUTE 0128; 0069 0307 0303; 0128; 0128; lt; # LATIN CAPITAL LETTER I WITH TILDE # ================================================================================ # Turkish and Azeri # I and i-dotless; I-dot and i are case pairs in Turkish and Azeri # The following rules handle those cases. 0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE 0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE # When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i. # This matches the behavior of the canonically equivalent I-dot_above 0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE 0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE # When lowercasing, unless an I is before a dot_above, it turns into a dotless i. 0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I 0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I # When uppercasing, i turns into a dotted capital I 0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I 0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I # Note: the following case is already in the UnicodeData file. # 0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I From Unicode 3.1 SpecialCasing.txt: [...] # ================================================================================ # Locale-sensitive mappings # ================================================================================ # Lithuanian 0307; 0307; ; ; lt AFTER_i; # Remove DOT ABOVE after "i" with upper or titlecase # Turkish, Azeri 0049; 0131; 0049; 0049; tr; # LATIN CAPITAL LETTER I 0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I 0049; 0131; 0049; 0049; az; # LATIN CAPITAL LETTER I 0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I # Note: the following cases are already in the UnicodeData file. # 0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I # 0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
[ Note for future reference, it would be better to make one "issue" like this a separate bug report and mark a dependency. Long bug reports with multiple things in them are hard to manage ] The GLib case folding operation is defined to be a locale-insensitive approximation and I'm pretty sure I decided to take the route of merging all i variants together - see: http://mail.gnome.org/archives/gtk-i18n-list/2001-June/msg00053.html
Created attachment 16792 [details] proposed patch
That's a .diff.gz file. This patch covers most stuff except for the Lithuanian and Turkic special cases I commented on above. I'm running gnome using my patched glib right now, and it seems to work, and all the "make check" tests pass. I tried to avoid making the tables gratuitously large. libglib with the patch applied has about 64k more text, according to size(1).
I eyeball-ed the whole patch. Also eyeball-ed the UTF-8 test cases with latest version of Markus Kuhn's 10x20 BDFs (that support 4.0 to some degree). I have a few worries about binary compatiblity, specially when things like 'gushort' are changed into 'gunichar'. But I never understood binary compatiblity well. Anyway, my comments: 1) There are two cases of 65535. Change them to 0xFFFF. 2) I don't like the special treatment of U+E0000 boundary. Can't you make determination of that boundary a little more automatic? We know that there are no plans for encoding anything in planes 4-13 yet, but since these parts of glib will be updates less and less, I guess we should plan to do these things more automatically. 3) I can't say much about the casing of I's, but I know the Unicode Technical Commitee worked a lot to make it right once for all for Turkic languages. Owen, it's really different now than when you first posted that, which becomes specifically important when one uses combining diacratics over I's. Are you sure we still want to remain locale-insensitive? 4) I couldn't check the non-BMP parts of the test files on Linux. Noah, see if you can find any software that can show them to you. SC Unipad for MS Windows is such a candidate, IIRC.
Hey Roozbeh, thanks for looking at the patch. > I have a few worries about binary compatiblity, specially when things > like 'gushort' are changed into 'gunichar'. But I never understood > binary compatiblity well. The tables aren't exposed, so I *think* this shouldn't be a problem. > > Anyway, my comments: > > 1) There are two cases of 65535. Change them to 0xFFFF. Well, the place where I used 65535 is not for codepoints (it's for offsets into a string). > > 2) I don't like the special treatment of U+E0000 boundary. Can't you > make determination of that boundary a little more automatic? Possibly. This will matter if a future version of unicode encodes some characters closer to U+E0000 than to U+2FAFF. Is there a chance of that? (The E0000 thing is only to save memory; if a future version of unicode encodes U+DFFFD, everything will still work, the tables will just be bigger.) > > 3) I can't say much about the casing of I's, but I know the Unicode > Technical Commitee worked a lot to make it right once for all for > Turkic languages. Owen, it's really different now than when you first > posted that, which becomes specifically important when one uses > combining diacratics over I's. Are you sure we still want to remain > locale-insensitive? Note that it's only case-folding that Owen says should be locale-insensitive (not uppercasing, lowercasing and titlecasing). Still, perhaps Owen can comment. :) > > 4) I couldn't check the non-BMP parts of the test files on Linux. > Noah, see if you can find any software that can show them to you. SC > Unipad for MS Windows is such a candidate, IIRC. I can view the files in gnome-terminal and in mozilla.
> Well, the place where I used 65535 is not for codepoints (it's for > offsets into a string). Anyway, that looks too magic a number. > Possibly. This will matter if a future version of unicode encodes > some characters closer to U+E0000 than to U+2FAFF. Is there a > chance of that? Well, honestly, there is no plan currently to encode anything after U+3FFFD or before U+E0000. That practically means this area will be empty for the next four years or so. I guess you should plan on this. Even Plane 3 (U+3xxxx) is not roadmapped in detail there, there was just a resolution in the last JTC1/SC2/WG2 meeting in Tokyo: RESOLUTION M43.14 (Roadmap - Plane 3): WG2 accepts the recommendation in document N 2515 from the Roadmap ad hoc committee for adding Plane 3 as an additional supplementary plane to the roadmap, identifying it as ‘Plane 3’. The document N2515 is at <http://anubis.dkuug.dk/JTC1/SC2/WG2/docs/n2515.pdf>, and mentions that it is reserved for ancient or rarely-used ideographs. > I can view the files in gnome-terminal and in mozilla. The non-BMP parts you mean?
>> I can view the files in gnome-terminal and in mozilla. > The non-BMP parts you mean? Mozilla(Win/Xft/X11core) is fine with non-BMP characters as long as you have (a) font(s). I'm not sure of gnome-terminal(gedit doesn't work with non-BMP, which prompted me to file bug 101081). You can also use yudit(http://www.yudit.org) to view/edit non-BMP fonts.
> > Well, the place where I used 65535 is not for codepoints (it's for > > offsets into a string). > > Anyway, that looks too magic a number. Quite right, it means "none" or "N/A" here. > Well, honestly, there is no plan currently to encode anything after > U+3FFFD or before U+E0000. That practically means this area will be > empty for the next four years or so. Wonderful, there is no problem then. > > I can view the files in gnome-terminal and in mozilla. > > The non-BMP parts you mean? Yeah. > I'm not sure of gnome-terminal(gedit doesn't > work with non-BMP, which prompted me to file bug 101081). I guess gnome-terminal works because it uses Xft directly, not through pango.
Not really a dependency, but note to self: be aware of bug 114749.
OK, finally had a chance to write up my notes; the following comments are all details, in general the patch looks right and OK to commit. I would like to see, where we have tables that you've changed to gunichar from guint16 but there aren't actually > 0xFFFF elements to keep them as guint16 and add an assertion in the Perl code in case that is violated in the future. * In general, before the change, gen-uncode-tables.pl was using uppercase hex - 0xFFFF - most of the new stuff is lowercase. * It would be good to add check/die in gen-unicode-tables.pl if the generated max table index exceeds: + printf OUT "#define G_UNICODE_MAX_TABLE_INDEX 10000\n\n"; * Remove commented out line in escape() if the new stuff is right. * I also think the use of literal 65535 as a flag value for $canon_offset/$compat_offset is a bit odd looking 0xFFFF would be better and more consistent with what we have elsewhere, but I'd probably suggest something like. $NOT_PRESENT_OFFSET = 65535 printf OUT "#define G_UNICODE_NOT_PRESENT_OFFSET $NOT_PRESENT_OFFSET\n" * Would be good to add an assertion that the real offsets stay less than $NOT_PRESENT_OFFSET. * The block of code === + if (defined $canon_decomp) + { + if (defined $decomp_offsets{$canon_decomp}) + { + $canon_offset = $decomp_offsets{$canon_decomp}; + } + else + { + $canon_offset = $decomp_string_offset; + $decomp_offsets{$canon_decomp} = $canon_offset; + $decomp_string .= "\n \"" . &escape ($canon_decomp) . "\\0\" /* offset $decomp_string_offset */"; + $decomp_string_offset += &length_in_bytes ($canon_decomp) + 1; + } === is repeated twice almost identically; might makes sense to use a subroutine. * I don't think the + 1 is right in: === + $bytes_out += $decomp_string_offset + 1; === === + print STDERR "Generated " . ($special_case_offset + 1) . " bytes in special case table\n"; === * The change: === - my $recordlen = (2+$casefoldlen+1) & ~1; + my $recordlen = (4+$casefoldlen+1) & ~1; printf "Generated %d bytes for casefold table\n", $recordlen * @casefold; === needs to also change the +1 & ~` to +3 & ~3; the idea is that the structure size gets rounded up to it's alignment. * In gunibreak.c, you removed the PROP() macro and folded it into g_unichar_break_type(), which is fine in isolation, but since the overall macro was retained in other places, where it was used in multiple places, I think it's best retained in gunibreak.c as well. * I don't see why you made the addition of a 'return FALSE' in: == @@ -227,6 +232,8 @@ *result = res; return TRUE; } + else + return FALSE; } return FALSE; === * In output_special_case(), you have === if (which == 1) { while (*p != '\0') p = g_utf8_next_char (p); p = g_utf8_next_char (p); } === which could simply be written as: === if (which == 1) p += strlen (p) + 1; ==== And you have: === if (out_buffer) return g_strlcpy (out_buffer, p, -1); else return strlen (p) + 1; === I'd really avoid using strlcpy(), it's a bit obscure and the semantics are wrong here - we don't want NUL termination. === len = strlen (p); if (out_buffer) memcpy (out_buffer, p, len); return len; ===
> * I don't think the + 1 is right in: > > === > + $bytes_out += $decomp_string_offset + 1; > === > > === > + print STDERR "Generated " . ($special_case_offset + 1) . " bytes > in special case table\n"; > === sizeof() seems to agree with me :-) I'm attaching the program I used to make check the byte counts I was reporting. Put it in the glib subdir and compile with "gcc -I.. sizes.c"
Created attachment 18774 [details] sizes.c
Created attachment 18777 [details] patch as I'm about to apply it (.diff.gz)
I made all the changes you suggested except two: the bytes_out one I commented on above, and "needs to also change the +1 & ~` to +3 & ~3;", because it was superseded by "there aren't actually > 0xFFFF elements". Do we close this bug, despite bug 114681 which is marked as a dependency? 2003-07-30 Noah Levitt <nlevitt@columbia.edu> * glib/gen-unicode-tables.pl: * glib/gunibreak.c: * glib/gunibreak.h: * glib/gunichartables.h: * glib/gunicode.h: * glib/gunicomp.h: * glib/gunidecomp.c: * glib/gunidecomp.h: * glib/guniprop.c: * tests/casefold.txt: * tests/casemap.txt: * tests/gen-casefold-txt.pl: * tests/gen-casemap-txt.pl: Update Unicode data to 4.0. (#107974)
I forgot about the the fact that the strings get NUL terminated, even though we don't *use* that termination; so the + 1 should be right. I'd go ahead and close this bug; the other bug is still open in it's own right, and it's easier to keep track of one bug than two.