GNOME Bugzilla – Bug 472786
Improve error message when XML file contains \0 bytes
Last modified: 2021-07-05 13:25:32 UTC
Please describe the problem: Characters in the range U+0000-U+001F except U+0009, U+000A and U+000D are not allowed XML files. If a file contains e.g. U+0001, an the error message is "PCDATA invalid Char value 1". However, if a file contains U+0000, the error message is confusing: "Premature end of data in tag foo line 1". It looks as if U+0000 is treated as an end-of-file marker. U+0000 characters are difficult to find, because they are not visible in many editors, and some tools even get confused when they encounter them. This makes it hard to find out what is going on, if a file fails to parse due to U+0000 characters. Steps to reproduce: Issue the following commands in a shell: echo -e "<foo>\x00</foo>" > foo0.xml echo -e "<foo>\x01</foo>" > foo1.xml xmllint foo?.xml Actual results: foo0.xml:1: parser error : Premature end of data in tag foo line 1 <foo> ^ foo1.xml:1: parser error : PCDATA invalid Char value 1 <foo></foo> ^ Expected results: foo0.xml:1: parser error : PCDATA invalid Char value 0 <foo></foo> ^ foo1.xml:1: parser error : PCDATA invalid Char value 1 <foo></foo> ^ Does this happen every time? Yes Other information:
The problem with U+0000 character is that it has usually a special meaning in programs written in C - it is interpreted as "end-of-string" marker. I took a look at libxml2 sources and there were many routines which used (char==U+0000) instead of (position<size) to decide whether all data have been processed. Changing (fixing) all these places would be tricky. Perhaps we could check for U+0000 before the parsing routines kicks in, e.g. somewhere inside xmlParserInputBufferGrow(). We must take special care about different encodings, since it is OK to have 0x00 bytes in a file encoded in UTF-16. Any other opinions?
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.