GNOME Bugzilla – Bug 675373
Incorrect Name and NCName validation for non-ASCII characters (with fix)
Last modified: 2021-07-05 13:26:43 UTC
The validation functions within the 'Check Name, NCName and QName strings' section of the file 'tree.c' do not seem to conform to the W3C XML1.0 (5th) and Namespaces in XML (3rd) editions. This causes xmllint to find false errors in atomic types for non-ASCII IDs and Names, such as those containing Japanese ideographic characters. It seems that these the validation code, following the 'try_complex:' labels in these functions, are currently based on orphaned definitions. Suggestion is add some new macros to 'parserinternals.h' to cover the current definitions and then modify the 'try_complex:' sub-sections to utilize them e.g. /** * IS_NAMESTARTCHAR: * @c: an xmlChar value * * [4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] * */ #define IS_NAMESTARTCHAR(c) (\ (0x3a == (c)) ||\ ((0x41 <= (c)) && ((c) <= 0x5a)) ||\ (0x5f == (c)) ||\ ((0x61 <= (c)) && ((c) <= 0x7a)) ||\ ((0xc0 <= (c)) && ((c) <= 0xd6)) ||\ ((0xd8 <= (c)) && ((c) <= 0xf6)) ||\ ((0xf8 <= (c)) && ((c) <= 0x2ff))||\ ((0x370 <= (c)) && ((c) <= 0x37d)) ||\ ((0x37f <= (c)) && ((c) <= 0x1fff)) ||\ ((0x200c <= (c)) && ((c) <= 0x200d)) ||\ ((0x2070 <= (c)) && ((c) <= 0x218f)) ||\ ((0x2c00 <= (c)) && ((c) <= 0x2fef)) ||\ ((0x3001 <= (c)) && ((c) <= 0xd7ff)) ||\ ((0xf900 <= (c)) && ((c) <= 0xfdcf)) ||\ ((0xfdf0 <= (c)) && ((c) <= 0xfffd)) ||\ ((0x10000<= (c)) && ((c) <= 0xeffff))\ ) /** * IS_NAMECHAR: * @c: an xmlChar value * * [4a] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040] * */ #define IS_NAMECHAR(c) (\ (0x2d == (c)) ||\ (0x2e == (c)) ||\ ((0x30 <= (c)) && ((c) <= 0x39)) ||\ ((0x41 <= (c)) && ((c) <= 0x5a)) ||\ (0x5f == (c)) ||\ ((0x61 <= (c)) && ((c) <= 0x7a)) ||\ (0xb7 == (c)) ||\ ((0xc0 <= (c)) && ((c) <= 0xd6)) ||\ ((0xd8 <= (c)) && ((c) <= 0xf6)) ||\ ((0xf8 <= (c)) && ((c) <= 0x2ff))||\ ((0x300 <= (c)) && ((c) <= 0x36f)) ||\ ((0x370 <= (c)) && ((c) <= 0x37d)) ||\ ((0x37f <= (c)) && ((c) <= 0x1fff)) ||\ ((0x200c <= (c)) && ((c) <= 0x200d)) ||\ ((0x203f <= (c)) && ((c) <= 0x2040)) ||\ ((0x2070 <= (c)) && ((c) <= 0x218f)) ||\ ((0x2c00 <= (c)) && ((c) <= 0x2fef)) ||\ ((0x3001 <= (c)) && ((c) <= 0xd7ff)) ||\ ((0xf900 <= (c)) && ((c) <= 0xfdcf)) ||\ ((0xfdf0 <= (c)) && ((c) <= 0xfffd)) ||\ ((0x10000<= (c)) && ((c) <= 0xeffff))\ ) and then (in, for example, xmlValidateName)... try_complex: /* * Second check for chars outside the ASCII range */ cur = value; c = CUR_SCHAR(cur, l); if (space) { while (IS_BLANK(c)) { cur += l; c = CUR_SCHAR(cur, l); } } if (!IS_NAMESTARTCHAR(c) ) return(1); cur += l; c = CUR_SCHAR(cur, l); while (IS_NAMECHAR(c) || (c == ':')) { cur += l; c = CUR_SCHAR(cur, l); } // if ((!IS_LETTER(c)) && (c != '_') && (c != ':')) // return(1); // cur += l; // c = CUR_SCHAR(cur, l); // while (IS_LETTER(c) || IS_DIGIT(c) || (c == '.') || (c == ':') || // (c == '-') || (c == '_') || IS_COMBINING(c) || IS_EXTENDER(c)) { // cur += l; // c = CUR_SCHAR(cur, l); // } if (space) { while (IS_BLANK(c)) { cur += l; c = CUR_SCHAR(cur, l); } } if (c != 0) return(1); return(0);
Created attachment 213409 [details] Test file with ASCII IDs that pass validation Test against schema: http://www.collada.org/2005/11/COLLADASchema.xsd Note that this test file does have one validation bug for missing source element, which can be ignored. It is the IDs being tested here.
Created attachment 213411 [details] Test file with Japanese IDs that fail validation Test against schema: http://www.collada.org/2005/11/COLLADASchema.xsd Note that this test file does have one validation bug for missing source element, which can be ignored. It is the IDs being tested here. This file simply substitutes some valid UTF-8 Japanese characters for the ASCII IDs of file test1.dae.
Created attachment 213412 [details] Test file with ASCII IDs that pass validation Test against schema: http://www.collada.org/2005/11/COLLADASchema.xsd Note that this test file does have one validation bug for missing source element, which can be ignored. It is the IDs being tested here.
Created attachment 213413 [details] Modified parserInternals.h Added macros used by validation fixes for non-ASCII IDs and names
Created attachment 213414 [details] Modified tree.c Made changes to some (but not all) validation code for non-ASCII characters in names and IDs, using new macros from parserInternals.h
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.