Bug 694669 – consider unicode corrigendum #9

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 694669 - consider unicode corrigendum #9


Summary:	consider unicode corrigendum #9


Status:	RESOLVED FIXED

Product:	glib
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtkdev
QA Contact:	gtkdev

URL:
Whiteboard:

Duplicates:	690531 (view as bug list)
Depends on:
Blocks:

Reported:	2013-02-25 13:49 UTC by Christian Persch
Modified:	2013-11-04 10:40 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
patch (1.92 KB, patch) 2013-02-25 13:49 UTC, Christian Persch	accepted-commit_now	Details \| Review
tests: clean up for Unicode corrigendum #9 (1.84 KB, patch) 2013-03-12 16:39 UTC, Allison Karlitskaya (desrt)	committed	Details \| Review
tests: unicode-encoding: Update for unicode corrigendum #9 (807 bytes, patch) 2013-03-18 22:23 UTC, Christian Persch	none	Details \| Review

Description Christian Persch 2013-02-25 13:49:37 UTC

Created attachment 237356 [details] [review]
patch

From gutf8.c:

/*
 * Check whether a Unicode (5.2) char is in a valid range.
 *
 * The first check comes from the Unicode guarantee to never encode
 * a point above 0x0010ffff, since UTF-16 couldn't represent it.
 * 
 * The second check covers surrogate pairs (category Cs).
 * 
 * The last two checks cover "Noncharacter": defined as:
 *   "A code point that is permanently reserved for
 *    internal use, and that should never be interchanged. In
 *    Unicode 3.1, these consist of the values U+nFFFE and U+nFFFF
 *    (where n is from 0 to 10_16) and the values U+FDD0..U+FDEF."
 *
 * @param Char the character
 */
#define UNICODE_VALID(Char)                   \
    ((Char) < 0x110000 &&                     \
     (((Char) & 0xFFFFF800) != 0xD800) &&     \
     ((Char) < 0xFDD0 || (Char) > 0xFDEF) &&  \
     ((Char) & 0xFFFE) != 0xFFFE)
   

Unicode Corrigendum #9 [http://www.unicode.org/versions/corrigendum9.html] strikes the "and that should never be interchanged" clause, so IMHO we should update this code to allow the noncharacters through.

Comment 1 Matthias Clasen 2013-03-02 01:56:10 UTC

Review of attachment 237356 [details] [review]:

seems right

Comment 2 Christian Persch 2013-03-05 16:28:37 UTC

Pushed to master.

Comment 3 Allison Karlitskaya (desrt) 2013-03-12 16:00:37 UTC

This regresses the test suite:

/utf8/validate/29: **
GLib:ERROR:utf8-validate.c:285:do_test: assertion failed: (result == test->valid)

Comment 4 Allison Karlitskaya (desrt) 2013-03-12 16:22:15 UTC

After reading the corrigendum, it is utterly clear that the unexpected passing of this testcase is the entire point of the change.  I'll update the test.

Comment 5 Allison Karlitskaya (desrt) 2013-03-12 16:39:50 UTC

Created attachment 238709 [details] [review]
tests: clean up for Unicode corrigendum #9

Unicode corrigendum #9 spells out in no uncertain terms that on
conversion interfaces we should not reject characters like U+FFFE and
U+FFFF which we were doing before.

Commit f91ef4ef15d220f6899c97aaf5b1c0a8f68cfe9a started accepting these
characters, but we had some testcases that were checking that strings
containing these characters should be rejected.

Update the tests.

Comment 6 Christian Persch 2013-03-12 16:45:49 UTC

Comment on attachment 238709 [details] [review]
tests: clean up for Unicode corrigendum #9

Looks good to me, thanks for catching the problem. I only ran the 'unicode' test (computer trouble).

Comment 7 Allison Karlitskaya (desrt) 2013-03-12 16:47:24 UTC

Attachment 238709 [details] pushed as e359bc0 - tests: clean up for Unicode corrigendum #9

Comment 8 Christian Persch 2013-03-18 22:23:30 UTC

Created attachment 239212 [details] [review]
tests: unicode-encoding: Update for unicode corrigendum #9

Comment 9 Behdad Esfahbod 2013-11-04 10:40:01 UTC

*** Bug 690531 has been marked as a duplicate of this bug. ***