Bug 391261 – glib should not create/handle long UTF-8 forms

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 391261 - glib should not create/handle long UTF-8 forms


Summary:	glib should not create/handle long UTF-8 forms


Status:	RESOLVED OBSOLETE

Product:	glib
Classification:	Platform
Component:	i18n
Version:	2.12.x
Hardware:	Other All

Importance:	Normal major
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2006-12-31 11:14 UTC by Roozbeh Pournader
Modified:	2018-05-24 10:55 UTC

See Also:
GNOME target:	---
GNOME version:	2.17/2.18

Attachments
patch to handle non-"unicode scalar values" (7.10 KB, patch) 2007-01-01 14:53 UTC, Roozbeh Pournader	none	Details \| Review
patch to fix test file and program (998 bytes, patch) 2007-01-02 09:05 UTC, Roozbeh Pournader	none	Details \| Review

Description Roozbeh Pournader 2006-12-31 11:14:35 UTC

Presently, glib's UTF-8 functions use the ISO/IEC 10646 definition of UTF-8 both when handling and when generating UTF-8 data. This means that it accepts and generates UTF-8 for values larger than the largest allowed Unicode character, U+10FFFF. This means that the applications will get invalid Unicode characters instead of an error, making glib not conforming to The Unicode Standard.

As an example, the following piece of code, accepts the "ill-formed" UTF-8 sequence and gives an invalid unicode codepoint of U+11000 without an error:

#include <glib.h>
#include <stdio.h>

int
main ()
{
  gunichar *result;
  gchar input[] = "\xF4\x90\x80\x80";

  result = g_utf8_to_ucs4 (input, -1, NULL, NULL, NULL);

  if (result != NULL)
    printf ("result is: U+%x\n", result[0]);

  g_free (result);
  return 0;
}

The same happens with g_unichar_to_utf8, which takes invalid Unicode code points and generates an ill-formed UTF-8 sequence.

Quoting relevant parts from the Unicode 5.0 book:

Page 73:
"[Conformance clause] C9  When a process generates a code unit sequence which purports to be in a Unicode character encoding form, it shall not emit ill-formed code unit sequences.
[...]

C10  When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters.
"

Page 103:
"Any UTF-8 byte sequence that does not match the patterns listed in Table 3-7 is ill-formed." [The patterns in Table 3-7, on page 104, do not match <F4 90 80 80>.]

We can of course claim that "we support ISO/IEC 10646's UTF-8" and ignore the problem altogether, but this is considered a security problem. Quoting UTR #6, Unicode Security Considerations:

http://www.unicode.org/reports/tr36/#Non_Visual_Recommendations

"A. Ensure that all implementations of UTF-8 used in a system are conformant to the latest version of Unicode. In particular,

   A. Always use the so-called "shortest form" of UTF-8
   B. Never go outside of 0..10FFFF16
   C. Never use 5 or 6 byte UTF-8."

Going this way, also increases the performance of at least those functions that handle UTF-8 data, as the tests become simpler.

Not doing a patch yet as this may be controversial. Please comment.

Comment 1 Matthias Clasen 2006-12-31 13:56:40 UTC

I don't think a patch to do this would be very controversial.

Comment 2 Owen Taylor 2006-12-31 15:37:41 UTC

The original conception of the conversion functions is that they
mechanical transformation of the input sequences between encoding 
forms. Overlong forms of UTF-8 are checked for because they are
a particular security risks, but no attempt is made to comprehensively
make sure that the input is all Unicode characters.

I don't have a strong view on the proposal to add such checking 
to g_ucs4_to_utf8(), but:

 - If you do change it, should be changed to *fully* validate the
   Unicode characters in the generated UTF-8 output using 
   the UNICODE_VALID() macro. Testing for out of range characters
   while not catching isolated surrogates and other disallowed
   codepoints doesn't make any sense to me.

 - You would presumably want to change utf8_to_ucs4, utf16_to_utf8,
   ucs4_to_utf16, utf16_to_utf4 as well. In some cases, out of 
   range characters aren't an issue and the checks could be
   correspondigly optimized, but there are non-characters in
   all those cases as well.

 - Tests showing the performance impact of the changes would be
   useful.

 - Note that there is some small risk of compatibility problems if
   this change is made, since the current functions could be used
   to encode non-textual data, or data that used non-Unicode code
   points for other purposes. it certainly would be a unstable branch   
   only change.
 
"Going this way, also increases the performance of at least those functions 
 that handle UTF-8 data, as the tests become simpler."

I don't see how that works; the enormous majority of UTF-8 handled
by GLib doesn't come from g_ucs4_to_utf8() and most of GLib assumes
that input UTF-8 is valid. The exceptions are basically g_utf8_validate(),
g_utf8_get_char_validated(). (g_utf8_to_ucs4() does some checking
but doesn't check for valid characters)

Comment 3 Roozbeh Pournader 2006-12-31 17:22:38 UTC

(In reply to comment #2)
>  - If you do change it, should be changed to *fully* validate the
>    Unicode characters in the generated UTF-8 output using 
>    the UNICODE_VALID() macro. Testing for out of range characters
>   while not catching isolated surrogates and other disallowed
>   codepoints doesn't make any sense to me.
>  - You would presumably want to change utf8_to_ucs4, utf16_to_utf8,
>    ucs4_to_utf16, utf16_to_utf4 as well.

Generally agreed. We should also handle surrogates properly. But I think we should allow noncharacters (those like U+FFFE and U+FFFF) to pass, as the standard seems fine with them, and they may have some usage as sentinels. The same table in the standard with UTF-8 patterns that doesn't include surrogates, allows noncharacters.

I can dig the exact places in the standard about noncharacters and the related recommendations, if you wish.

> I don't see how that works; [...]

I was thinking mostly places like g_utf8_get_char, all the places that glib also checks to handle the larger values, but looking at the code, it seems that I was wrong, since they are checked last. If anything is improved, it will be the handling of non-shortest form UTF-8, I guess, which shouldn't be a common case.

Comment 4 Roozbeh Pournader 2007-01-01 11:46:58 UTC

> I can dig the exact places in the standard about noncharacters and the related
> recommendations, if you wish.

I did this anyway. Some quotes:

Page 71, details under conformance clause C2:
"The noncharacter code points may be used internallly, such as for sentinel values or delimiters, but should not be exchanged publicly."

Page 72, under C7:
"It a noncharacter that does not have a specific internal use is unexpectedly encountered in processing, an implementation may signal an error or delete or ignore the noncharacter. If these options are not taken, the noncharacter should be treated as an unassigned code point. For example, an API taht returned a character property value for a noncharacter would return the same value as the default value for an unassigned code point."

Page 80, definition D14 says that noncharacters are "permanently reserved for internal use". D15 then calls normal empty positions "Reserved code points".

Page 98, intro to section 3.9 that defines Unicode Encoding Forms:
"Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF" to unique code unit sequences."

Page 98, definition D76 defines a "Unicode scalar value" to be things in the range 0 to 10FFFF, excluding the surrogate area and then explicitly says that the range is from 0 to D7FF and E000 to 10FFFF. This is what is then mapped to the UTFs. And is directly or indirectly mentioned as the range of thing each of the UTFs encode when they are discussed. It always includes things like FFFF and 10FFFF.

Page 104, under D93: "In implementations of the Unicode Standard, a typical API will logically convert the input code unit sequence into Unicode scalar values [...] A conformant encoding form conversion will treat any ill-formed code unit sequence as an error condition. (See conformance clause C10.) This guarantees that it will neither interpret nor emit an ill-formed code unit sequence. Any implementation of encoding form conversion must take this requirement into account, because an encoding form conversion implicitly involves a verification that the Unicode strings being converted do, in fact, contain well-formed code unit sequences."

Page 549:
"U+FFFF and U+10FFFF. These two noncharacter code points have the attribute of being associated with the largest code unit values for particular Unicode encoding forms. In UTF-16, U+FFFF is associated with the largest 16-bit code unit value, FFFF_16. U+10FFFF is associated with the largest legal UTF-32 32-bit code unit value, 10FFFF_16. This attribute renders these two noncharacter code points useful for internal purposes as sentinels."

From all these, it seems that noncharacters are not really as invalid as glib presently considers them, and they are allowed for internal usage for various things. I think we can still have some of the validating functions like g_unichar_validate check for them, but we should pass them on in the UTF conversions.

BTW, while we fix these, perhaps we can make new functions named utf32 instead of ucs4 and deprecate the ucs4 ones, to emphasize that we are now not allowing anything beyong U+10FFFF

Comment 5 Roozbeh Pournader 2007-01-01 14:53:55 UTC

Created attachment 79140 [details] [review]
patch to handle non-"unicode scalar values"

Attaching a first patch for review. I am intentionally not handling noncharacters, per comment #4. I will also create a patch for the necessary changes in tests if this is fine.

Comment 6 Roozbeh Pournader 2007-01-02 09:05:33 UTC

Created attachment 79171 [details] [review]
patch to fix test file and program

Comment 7 Roozbeh Pournader 2007-01-13 15:48:51 UTC

Owen, ping? (Mathhias says it's your code, so we should await your judgement.)

Comment 8 Behdad Esfahbod 2007-07-09 20:23:38 UTC

I have wanted to comment here for what seems like to be 6 months now :(.
Will do later this week.

Comment 9 Behdad Esfahbod 2007-09-18 21:49:48 UTC

(In reply to comment #4)

> Page 104, under D93: "In implementations of the Unicode Standard, a typical API
> will logically convert the input code unit sequence into Unicode scalar values
> [...] A conformant encoding form conversion will treat any ill-formed code unit
> sequence as an error condition. (See conformance clause C10.) This guarantees
> that it will neither interpret nor emit an ill-formed code unit sequence. Any
> implementation of encoding form conversion must take this requirement into
> account, because an encoding form conversion implicitly involves a verification
> that the Unicode strings being converted do, in fact, contain well-formed code
> unit sequences."

This is a very limiting and narrow-minded, theoretical, point of view.  In practice you cannot refuse to render text just because one invalid byte is in the stream.  What I had to do in firefox2+pango was to first validate input and "fix-up" any invalid codepoints such that conversion doesn't fail...

So, I would like to suggest that we fix glib to have a more extended API: you get an illegal output if and only if you pass in an illegal input.  This is still confromant to the Unicode standard, just with a different definition of reporting errors.  The existence of an error in the input can still be reported, from the function return value.

Comment 10 Roozbeh Pournader 2008-01-13 15:08:58 UTC

I found just recently that there is a very nice improvement to g_utf8_next_char documented by Donald Knuth in his drafts for The Art of Computer Programming that doesn't use any memory reference or branching. We can use that to get rid of the 256-byte g_utf8_skip array, replacing it with only five boolean arithmetic operations:

#define g_utf8_skipper(c) (((0xE5000000 >> (((c) >> 3) & 0xFE)) & 3) + 1)
#define g_utf8_next_char(p) (char *)((p) + g_utf8_skipper(*(const guchar *)(p)))    

See excercise 197 (PostScript page 74) and its answer (PS p. 111) here: http://www-cs-faculty.stanford.edu/~knuth/fasc1a.ps.gz

But Knuth's formula is only usable if we switch to 4-byte UTF-8, as suggested in this bug. I tried to extend the method to 6-byte UTF-8, but that would need 64-bit arithmetic and (I believe) two more operations.

As this was originally asked for by Federico [http://www.gnome.org/~federico/news-2005-10.html#31], I'm CC-ing him.

Comment 11 Behdad Esfahbod 2008-01-13 17:36:47 UTC

I don't think that's worth changing.  This is all very, very, architecture/cache dependent as can be seen in the comments of my blogpost here:

  http://mces.blogspot.com/2005/11/false-alarm-on-gutf8offsettopointer.html

Oh BTW Pango uses the fact that glib's UTF-8 to gunichar conversion converts a byte of -1 to a gunichar of -1.

Comment 12 Behdad Esfahbod 2008-08-24 00:06:10 UTC

BTW, Roozbeh's method is buggy (uses shifting more than the width of the integer, which is undefined in C).  I've got a correct one.  Both are available here:

http://mces.blogspot.com/2008/04/utf-8-bit-manipulation.html

The relevant code is:

def behdad_utf8_skipper(c):
        v = c ^ 0xff
        r = (v > 0xF) << 2
        v >>= r
        s = (v > 0x3) << 1
        v >>= s
        r |= s
        r |= (v >> 1)
        return (0x11234561 >> (r << 2)) & 7

Comment 13 GNOME Infrastructure Team 2018-05-24 10:55:33 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/glib/issues/72.