GNOME Bugzilla – Bug 107427
some invalid characters considered valid, in g_unichar_validate and elsewhere
Last modified: 2011-02-18 15:57:04 UTC
According to http://www.unicode.org/reports/tr27/ There are 34 specific code points in Unicode 3.0 that are characterized as noncharacters. Unicode 3.1 adds an additional 32 noncharacters. To clarify the status of all 66, a definition (page 41) is added, and conformance rules C5 and C10 (pages 38, 39) are amended as follows: D7b Noncharacter: a code point that is permanently reserved for internal use, and that should never be interchanged. In Unicode 3.1, these consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 10_16) and the values U+FDD0..U+FDEF. g_unicode_validate() should return false for all of these, in addition to the surrogates.
Created attachment 14740 [details] [review] I think this fixes it.
Created attachment 14741 [details] [review] Sorry, use this one. Should have the parens around (Char)
Created attachment 14742 [details] [review] argh, that was the same as the first one. This one is fixed, promise.
Actually, it's only 2 in 3.0 - the nFFFE/FFFF weren't added until 3.0.1. The patch looks OK to me to commit (both glib-2-2 and HEAD.) Since performance here is actually quite important, I'll suggest one optimization (a & 0xffff) != 0xfffe && (a & 0xffff) != 0xffff Is the same as: (a & 0xfffe) != 0xfffe. (Hmmm, I guess the surrogate check could also be done like that: ((Char) < 0xd800 || (Char) > 0xe000) is, if I'm not mistaken, the same as: ((Char) & 0xfffff800) != 0xd800 You'd have to time that to see if it is a performance win or not.)
Committed to both branches with the changed surrogate check, which seemed to be a small performance win.
*** Bug 109378 has been marked as a duplicate of this bug. ***