GNOME Bugzilla – Bug 690531
gedit should work with Unicode noncharacters
Last modified: 2013-11-04 10:40:01 UTC
I am editing Unicode and CLDR data files. Some of those files contain noncharacters like U+FDD0 (UTF-8: EF B7 90). When I open one of those files, gedit 3.4.1 complains with a big scary banner at the top: "There was a problem opening the file /home/mscherer/svn.cldr/…k/common/collation/ko.xml." and "The file you opened has some invalid characters. If you continue editing this file you could corrupt this document. You can also choose another character encoding and try again." [Retry | Edit Anyway | Cancel] The file displays fine except that U+FDD0 is shown as \x-escaped UTF-8 bytes as if it were ill-formed: <p>\EF\B7\90⼀</p><!-- INDEX 1 --> Please fix the Unicode in gedit such that files on unicode.org can be edited... The ko.xml file is available here: http://unicode.org/cldr/trac/browser/trunk/common/collation/ko.xml Most noncharacters are permitted in HTML and XML. Editors should not flag them as errors.
FYI Unicode Corrigendum #9: Clarification About Noncharacters http://www.unicode.org/versions/corrigendum9.html Unicode FAQ Q: Are there any 16-bit values that are invalid? http://www.unicode.org/faq/utf_bom.html#utf16-7 Q: What about noncharacters? Are they invalid? http://www.unicode.org/faq/utf_bom.html#utf16-8 Also XML 1.1 "XML processors must accept any character in the range specified for Char." http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets (XML chose to forbid U+FFFE & U+FFFF but not the other 64 noncharacters. Noncharacters are "discouraged" but so are compatibility characters like full-width ASCII.)
*** This bug has been marked as a duplicate of bug 660633 ***
This was fixed earlier this year. *** This bug has been marked as a duplicate of bug 694669 ***