Bug 107974 – glib Unicode data is outdated, update to Unicode 4.0

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 107974 - glib Unicode data is outdated, update to Unicode 4.0


Summary:	glib Unicode data is outdated, update to Unicode 4.0


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:	114681
Blocks:	101081

Reported:	2003-03-10 07:27 UTC by Roozbeh Pournader
Modified:	2011-02-18 16:07 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Patch against HEAD to update glib data to Unicode 3.2's BMP (58.77 KB, application/x-gzip) 2003-03-13 16:25 UTC, Roozbeh Pournader	Details
proposed patch (138.95 KB, application/x-gzip) 2003-05-24 01:08 UTC, Noah Levitt	Details
sizes.c (4.07 KB, text/plain) 2003-07-31 00:56 UTC, Noah Levitt	Details
patch as I'm about to apply it (.diff.gz) (142.48 KB, application/octet-stream) 2003-07-31 02:27 UTC, Noah Levitt	Details

Description Roozbeh Pournader 2003-03-10 07:27:15 UTC

glib Unicode data is outdated. It should be updated with Unicode 3.2 data.

Comment 1 Owen Taylor 2003-03-10 13:52:08 UTC

This is reasonable amount of work ... while the interfaces
to GLib are full-unicode clean, a lot of the table formats
take advantage of only having characters in the BMP and
the fact that older versions of Unicode are compact
in the code point assignments.

Simply extending the same code to the full 17 planes would
produce a lot of bloat ... the table formats will need
to be changed in various cases.

Comment 2 Roozbeh Pournader 2003-03-10 13:54:48 UTC

I understand. But what about just updating the data to the BMP part of
Unicode 3.2?

Comment 3 Owen Taylor 2003-03-10 14:44:27 UTC

Pretty easy (basically should just be running various scripts,
verifying that the updated tables still make sense, unless the
format of the Unicode data has changed significantly), but not 
_that_ high priority for me, since that still wouldn't give real 
Unicode-3.2 support. I'm hoping someone will have time to do 
the full job prior to GLib-2.4.

Comment 4 Roozbeh Pournader 2003-03-13 16:25:55 UTC

Created attachment 14997 [details]
Patch against HEAD to update glib data to Unicode 3.2's BMP

Comment 5 Roozbeh Pournader 2003-03-16 07:59:27 UTC

My patch is broken, it seems, please ignore it.

Comment 6 Noah Levitt 2003-03-18 07:30:06 UTC

In gucharmap, I store my stuff in binary-searchable structs:

typedef struct
{
  gunichar first;
  gunichar last;
  GUnicodeType category;
}
UnicodeCategory;

const UnicodeCategory unicode_categories[] =
{
  { 0x0000, 0x001F, G_UNICODE_CONTROL },
  { 0x0020, 0x0020, G_UNICODE_SPACE_SEPARATOR },
  { 0x0021, 0x0023, G_UNICODE_OTHER_PUNCTUATION },
  { 0x0024, 0x0024, G_UNICODE_CURRENCY_SYMBOL },
  [...]
  { 0xE0020, 0xE007F, G_UNICODE_FORMAT },
  { 0xF0000, 0xFFFFD, G_UNICODE_PRIVATE_USE },
  { 0x100000, 0x10FFFD, G_UNICODE_PRIVATE_USE },
};

GUnicodeType
unichar_type (gunichar uc)
{
  gint min = 0;
  gint mid;
  gint max = sizeof (unicode_categories) / sizeof (UnicodeCategory) - 1;

  if (uc < unicode_categories[0].first || uc >
unicode_categories[max].last)
    return G_UNICODE_UNASSIGNED;

  while (max >= min)
    {
      mid = (min + max) / 2;
      if (uc > unicode_categories[mid].last)
        min = mid + 1;
      else if (uc < unicode_categories[mid].first)
        max = mid - 1;
      else
        return unicode_categories[mid].category;
    }

  return G_UNICODE_UNASSIGNED;
}

I ran tests, and unichar_type is about 100 times slower than
g_unichar_type. :-D But after all, it's still really fast; roughly
600ns per lookup on a pentium II 400mhz.

Are you at all interested in a patch?

Comment 7 Keith Packard 2003-05-05 05:16:27 UTC

Seems like a pretty good place to apply a well chosen hash function
that maps page numbers to arrays of per-character data.  Either find a
perfect hashing function, or find good hash/rehash functions that
limit probes to some reasonable length.  This will bound all searches
quite nicely while using memory proportional to the occupied pages.

Comment 8 Owen Taylor 2003-05-05 16:21:39 UTC

Just wanted to mention that 600ns is actually significant.
I'd guess that Pango speed on your machine is ~200k char/second;
Pango calls unichar_type() twice per character during
layout (once for determining break boundaries, once while
shaping). 600ns * 200k * 2 is approximately 0.25 seconds,
so that's a 25% slowdown.

Comment 9 Noah Levitt 2003-05-06 02:38:26 UTC

Unicode 4.0 introduces a couple of new line breaking classes, NL and
WJ. I think this complicates matters, since pango may have to know
about them. I guess for the time being pango can treat the new classes
the way it treats the old classes of the characters that now belong to
the new classes.

Comment 10 Jungshik Shin 2003-05-07 03:10:08 UTC

Just FYI, ICU people seem to have done something interesting. See a
thread of postings to
Unicode ml beginning with
http://www.unicode.org/mail-arch/unicode-ml/y2003-m05/0048.html
(username : unicode-ml, password: unicode)

Comment 11 Noah Levitt 2003-05-22 03:18:29 UTC

Unicode 4.0 says a bunch more about case conversion in Lithuanian and
Turkish/Azeri than 3.1 did. It looks to me like this means we have to
hard-code more special cases. :-\

Unicode 4.0 also looks like it's going to force case folding to be
locale-sensitive (T means Turkic):

    0049; C; 0069; # LATIN CAPITAL LETTER I
    0049; T; 0131; # LATIN CAPITAL LETTER I

    0130; F; 0069 0307; # LATIN CAPITAL LETTER I WITH DOT ABOVE
    0130; T; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE



From Unicode 4.0 SpecialCasing.txt:

    [...]
    
    # Preserve canonical equivalence for I with dot. Turkic is handled below.
    
    0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE
    
    [...] 
    
    # ================================================================================
    # Locale-sensitive mappings
    # ================================================================================
    
    # Lithuanian
    
    # Lithuanian retains the dot in a lowercase i when followed by accents.
    
    # Remove DOT ABOVE after "i" with upper or titlecase
    
    0307; 0307; ; ; lt After_Soft_Dotted; # COMBINING DOT ABOVE
    
    # Introduce an explicit dot above when lowercasing capital I's and J's
    # whenever there are more accents above.
    # (of the accents used in Lithuanian: grave, acute, tilde above, and ogonek)
    
    0049; 0069 0307; 0049; 0049; lt More_Above; # LATIN CAPITAL LETTER I
    004A; 006A 0307; 004A; 004A; lt More_Above; # LATIN CAPITAL LETTER J
    012E; 012F 0307; 012E; 012E; lt More_Above; # LATIN CAPITAL LETTER I WITH OGONEK
    00CC; 0069 0307 0300; 00CC; 00CC; lt; # LATIN CAPITAL LETTER I WITH GRAVE
    00CD; 0069 0307 0301; 00CD; 00CD; lt; # LATIN CAPITAL LETTER I WITH ACUTE
    0128; 0069 0307 0303; 0128; 0128; lt; # LATIN CAPITAL LETTER I WITH TILDE
    
    # ================================================================================
    
    # Turkish and Azeri
    
    # I and i-dotless; I-dot and i are case pairs in Turkish and Azeri
    # The following rules handle those cases.
    
    0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
    0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE
    
    # When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i.
    # This matches the behavior of the canonically equivalent I-dot_above
    
    0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
    0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE
    
    # When lowercasing, unless an I is before a dot_above, it turns into a dotless i.
    
    0049; 0131; 0049; 0049; tr Not_Before_Dot; # LATIN CAPITAL LETTER I
    0049; 0131; 0049; 0049; az Not_Before_Dot; # LATIN CAPITAL LETTER I
    
    # When uppercasing, i turns into a dotted capital I
    
    0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
    0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I
    
    # Note: the following case is already in the UnicodeData file.
    
    # 0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I


From Unicode 3.1 SpecialCasing.txt:

    [...]

    # ================================================================================
    # Locale-sensitive mappings
    # ================================================================================
    
    # Lithuanian 
    
    0307; 0307; ; ; lt AFTER_i; # Remove DOT ABOVE after "i" with upper or titlecase
    
    # Turkish, Azeri
    
    0049; 0131; 0049; 0049; tr; # LATIN CAPITAL LETTER I
    0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I
    
    0049; 0131; 0049; 0049; az; # LATIN CAPITAL LETTER I
    0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I
    
    # Note: the following cases are already in the UnicodeData file.
    
    # 0131; 0131; 0049; 0049; tr; # LATIN SMALL LETTER DOTLESS I
    # 0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE

Comment 12 Owen Taylor 2003-05-22 12:51:52 UTC

[ Note for future reference, it would be better to make one
  "issue" like this a separate bug report and mark a dependency.
  Long bug reports with multiple things in them are hard
  to manage ]

The GLib case folding operation is defined to be a locale-insensitive
approximation and I'm pretty sure I decided to take the 
route of merging all i variants together - see:

http://mail.gnome.org/archives/gtk-i18n-list/2001-June/msg00053.html

Comment 13 Noah Levitt 2003-05-24 01:08:19 UTC

Created attachment 16792 [details]
proposed patch

Comment 14 Noah Levitt 2003-05-24 01:17:00 UTC

That's a .diff.gz file.

This patch covers most stuff except for the Lithuanian and Turkic
special cases I commented on above.

I'm running gnome using my patched glib right now, and it seems to
work, and all the "make check" tests pass.

I tried to avoid making the tables gratuitously large. libglib with
the patch applied has about 64k more text, according to size(1).

Comment 15 Roozbeh Pournader 2003-05-27 19:52:41 UTC

I eyeball-ed the whole patch. Also eyeball-ed the UTF-8 test cases
with latest version of Markus Kuhn's 10x20 BDFs (that support 4.0 to
some degree).

I have a few worries about binary compatiblity, specially when things
like 'gushort' are changed into 'gunichar'. But I never understood
binary compatiblity well.

Anyway, my comments:

1) There are two cases of 65535. Change them to 0xFFFF.

2) I don't like the special treatment of U+E0000 boundary. Can't you
make determination of that boundary a little more automatic? We know
that there are no plans for encoding anything in planes 4-13 yet, but
since these parts of glib will be updates less and less, I guess we
should plan to do these things more automatically.

3) I can't say much about the casing of I's, but I know the Unicode
Technical Commitee worked a lot to make it right once for all for
Turkic languages. Owen, it's really different now than when you first
posted that, which becomes specifically important when one uses
combining diacratics over I's. Are you sure we still want to remain
locale-insensitive?

4) I couldn't check the non-BMP parts of the test files on Linux.
Noah, see if you can find any software that can show them to you. SC
Unipad for MS Windows is such a candidate, IIRC.

Comment 16 Noah Levitt 2003-05-27 20:52:38 UTC

Hey Roozbeh, thanks for looking at the patch.

> I have a few worries about binary compatiblity, specially when things
> like 'gushort' are changed into 'gunichar'. But I never understood
> binary compatiblity well.

The tables aren't exposed, so I *think* this shouldn't be a problem.

> 
> Anyway, my comments:
> 
> 1) There are two cases of 65535. Change them to 0xFFFF.

Well, the place where I used 65535 is not for codepoints (it's for
offsets into a string).

> 
> 2) I don't like the special treatment of U+E0000 boundary. Can't you
> make determination of that boundary a little more automatic?

Possibly. This will matter if a future version of unicode encodes some
characters closer to U+E0000 than to U+2FAFF. Is there a chance of that? 

(The E0000 thing is only to save memory; if a future version of
unicode encodes U+DFFFD, everything will still work, the tables will
just be bigger.)

> 
> 3) I can't say much about the casing of I's, but I know the Unicode
> Technical Commitee worked a lot to make it right once for all for
> Turkic languages. Owen, it's really different now than when you first
> posted that, which becomes specifically important when one uses
> combining diacratics over I's. Are you sure we still want to remain
> locale-insensitive?

Note that it's only case-folding that Owen says should be
locale-insensitive (not uppercasing, lowercasing and titlecasing).
Still, perhaps Owen can comment. :)

> 
> 4) I couldn't check the non-BMP parts of the test files on Linux.
> Noah, see if you can find any software that can show them to you. SC
> Unipad for MS Windows is such a candidate, IIRC.

I can view the files in gnome-terminal and in mozilla.

Comment 17 Roozbeh Pournader 2003-05-28 08:25:23 UTC

> Well, the place where I used 65535 is not for codepoints (it's for
> offsets into a string).

Anyway, that looks too magic a number.

> Possibly. This will matter if a future version of unicode encodes
> some characters closer to U+E0000 than to U+2FAFF. Is there a
> chance of that?

Well, honestly, there is no plan currently to encode anything after
U+3FFFD or before U+E0000. That practically means this area will be
empty for the next four years or so. I guess you should plan on this.
Even Plane 3 (U+3xxxx) is not roadmapped in detail there, there was
just a resolution in the last JTC1/SC2/WG2 meeting in Tokyo:

  RESOLUTION M43.14 (Roadmap - Plane 3):

  WG2 accepts the recommendation in document N 2515 from the Roadmap
  ad hoc committee for adding Plane 3 as an additional supplementary
  plane to the roadmap, identifying it as ‘Plane 3’.

The document N2515 is at
<http://anubis.dkuug.dk/JTC1/SC2/WG2/docs/n2515.pdf>, and mentions
that it is reserved for ancient or rarely-used ideographs.

> I can view the files in gnome-terminal and in mozilla.

The non-BMP parts you mean?

Comment 18 Jungshik Shin 2003-05-28 08:38:36 UTC

>> I can view the files in gnome-terminal and in mozilla.

> The non-BMP parts you mean?

Mozilla(Win/Xft/X11core) is fine with non-BMP characters as long as
you have (a) font(s). I'm not sure of gnome-terminal(gedit doesn't
work with non-BMP, which prompted me to file bug 101081). You can also
use yudit(http://www.yudit.org) to view/edit non-BMP fonts.

Comment 19 Noah Levitt 2003-05-29 04:18:41 UTC

> > Well, the place where I used 65535 is not for codepoints (it's for
> > offsets into a string).
> 
> Anyway, that looks too magic a number.

Quite right, it means "none" or "N/A" here. 

> Well, honestly, there is no plan currently to encode anything after
> U+3FFFD or before U+E0000. That practically means this area will be
> empty for the next four years or so.

Wonderful, there is no problem then.

> > I can view the files in gnome-terminal and in mozilla.
> 
> The non-BMP parts you mean?

Yeah.

> I'm not sure of gnome-terminal(gedit doesn't
> work with non-BMP, which prompted me to file bug 101081). 

I guess gnome-terminal works because it uses Xft directly, not through
pango.

Comment 20 Noah Levitt 2003-06-09 03:11:39 UTC

Not really a dependency, but note to self: be aware of bug 114749.

Comment 21 Owen Taylor 2003-07-30 16:33:23 UTC

OK, finally had a chance to write up my notes; the following
comments are all details, in general the patch looks right
and OK to commit. I would like to see, where we have tables
that you've changed to gunichar from guint16 but there aren't
actually > 0xFFFF elements to keep them as guint16 and add
an assertion in the Perl code in case that is violated in
the future.

* In general, before the change, gen-uncode-tables.pl
  was using uppercase hex - 0xFFFF - most of the new
  stuff is lowercase.

* It would be good to add check/die in gen-unicode-tables.pl
  if the generated max table index exceeds:

  +    printf OUT "#define G_UNICODE_MAX_TABLE_INDEX 10000\n\n";

* Remove commented out line in escape() if the new stuff
  is right.

* I also think the use of literal 65535 as a flag value
  for $canon_offset/$compat_offset is a bit odd looking
  0xFFFF would be better and more consistent with what
  we have elsewhere, but I'd probably suggest something
  like.

$NOT_PRESENT_OFFSET = 65535

printf OUT "#define G_UNICODE_NOT_PRESENT_OFFSET $NOT_PRESENT_OFFSET\n"

* Would be good to add an assertion that the real offsets
  stay less than $NOT_PRESENT_OFFSET.

* The block of code

===
+            if (defined $canon_decomp)
+            {
+                if (defined $decomp_offsets{$canon_decomp})
+                {
+                    $canon_offset = $decomp_offsets{$canon_decomp};
+                }
+                else
+                {
+                    $canon_offset = $decomp_string_offset;
+                    $decomp_offsets{$canon_decomp} = $canon_offset;
+                    $decomp_string .= "\n  \"" . &escape
($canon_decomp) . "\\0\" /* offset $decomp_string_offset */";
+                    $decomp_string_offset += &length_in_bytes
($canon_decomp) + 1;
+                }
===

  is repeated twice almost identically; might makes sense to
  use a subroutine.

* I don't think the + 1 is right in:

===
+    $bytes_out += $decomp_string_offset + 1;
===

===
+    print STDERR "Generated " . ($special_case_offset + 1) . " bytes
in special case table\n";
===

* The change:

===
-   my $recordlen = (2+$casefoldlen+1) & ~1;
+   my $recordlen = (4+$casefoldlen+1) & ~1;
    printf "Generated %d bytes for casefold table\n", $recordlen *
@casefold;
===

  needs to also change the +1 & ~` to +3 & ~3; the idea is that
  the structure size gets rounded up to it's alignment.

* In gunibreak.c, you removed the PROP() macro and folded
  it into g_unichar_break_type(), which is fine in isolation,
  but since the overall macro was retained in other places,
  where it was used in multiple places, I think it's 
  best retained in gunibreak.c as well.

* I don't see why you made the addition of a 'return FALSE' in:

==
@@ -227,6 +232,8 @@
 	  *result = res;
 	  return TRUE;
 	}
+      else
+        return FALSE;
     }
 
   return FALSE;
===

* In output_special_case(), you have 

===
   if (which == 1)
     {
      while (*p != '\0')
	p = g_utf8_next_char (p);
      p = g_utf8_next_char (p);
     }
===

 which could simply be written as:

===
 if (which == 1)
   p += strlen (p) + 1;
====

 And you have:

===
  if (out_buffer)
    return g_strlcpy (out_buffer, p, -1);
  else
    return strlen (p) + 1;
===

 I'd really avoid using strlcpy(), it's a bit
 obscure and the semantics are wrong here - we don't 
 want NUL termination.

===
 len = strlen (p);
 if (out_buffer)
    memcpy (out_buffer, p, len);
 return len;
===

Comment 22 Noah Levitt 2003-07-31 00:55:50 UTC

> * I don't think the + 1 is right in:
> 
> ===
> +    $bytes_out += $decomp_string_offset + 1;
> ===
> 
> ===
> +    print STDERR "Generated " . ($special_case_offset + 1) . " bytes
> in special case table\n";
> ===

sizeof() seems to agree with me :-) I'm attaching the program I used
to make check the byte counts I was reporting. Put it in the glib
subdir and compile with "gcc -I.. sizes.c"

Comment 23 Noah Levitt 2003-07-31 00:56:43 UTC

Created attachment 18774 [details]
sizes.c

Comment 24 Noah Levitt 2003-07-31 02:27:05 UTC

Created attachment 18777 [details]
patch as I'm about to apply it (.diff.gz)

Comment 25 Noah Levitt 2003-07-31 02:32:55 UTC

I made all the changes you suggested except two: the bytes_out one I
commented on above, and "needs to also change the +1 & ~` to +3 &
~3;", because it was superseded by "there aren't actually > 0xFFFF
elements".

Do we close this bug, despite bug 114681 which is marked as a dependency?


2003-07-30  Noah Levitt  <nlevitt@columbia.edu>

	* glib/gen-unicode-tables.pl:
	* glib/gunibreak.c:
	* glib/gunibreak.h:
	* glib/gunichartables.h:
	* glib/gunicode.h:
	* glib/gunicomp.h:
	* glib/gunidecomp.c:
	* glib/gunidecomp.h:
	* glib/guniprop.c:
	* tests/casefold.txt:
	* tests/casemap.txt:
	* tests/gen-casefold-txt.pl:
	* tests/gen-casemap-txt.pl: Update Unicode data to 4.0. (#107974)

Comment 26 Owen Taylor 2003-07-31 14:55:46 UTC

I forgot about the the fact that the strings get NUL
terminated, even though we don't *use* that termination;
so the + 1 should be right.

I'd go ahead and close this bug; the other bug is still
open in it's own right, and it's easier to keep track
of one bug than two.