After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 472786 - Improve error message when XML file contains \0 bytes
Improve error message when XML file contains \0 bytes
Status: RESOLVED OBSOLETE
Product: libxml2
Classification: Platform
Component: general
2.6.30
Other All
: Normal minor
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2007-09-02 10:48 UTC by Christian Schmidt
Modified: 2021-07-05 13:25 UTC
See Also:
GNOME target: ---
GNOME version: 2.17/2.18



Description Christian Schmidt 2007-09-02 10:48:46 UTC
Please describe the problem:
Characters in the range U+0000-U+001F except U+0009, U+000A and U+000D are not allowed XML files.

If a file contains e.g. U+0001, an the error message is "PCDATA invalid Char value 1".

However, if a file contains U+0000, the error message is confusing: "Premature end of data in tag foo line 1". It looks as if U+0000 is treated as an end-of-file marker.

U+0000 characters are difficult to find, because they are not visible in many editors, and some tools even get confused when they encounter them. This makes it hard to find out what is going on, if a file fails to parse due to U+0000 characters.


Steps to reproduce:
Issue the following commands in a shell:
echo -e "<foo>\x00</foo>" > foo0.xml
echo -e "<foo>\x01</foo>" > foo1.xml
xmllint foo?.xml 

Actual results:
foo0.xml:1: parser error : Premature end of data in tag foo line 1
<foo>
     ^
foo1.xml:1: parser error : PCDATA invalid Char value 1
<foo></foo>
     ^


Expected results:
foo0.xml:1: parser error : PCDATA invalid Char value 0
<foo></foo>
     ^
foo1.xml:1: parser error : PCDATA invalid Char value 1
<foo></foo>
     ^


Does this happen every time?
Yes

Other information:
Comment 1 Miroslav Bajtoš 2009-05-30 20:33:52 UTC
The problem with U+0000 character is that it has usually a special meaning in programs written in C - it is interpreted as "end-of-string" marker.

I took a look at libxml2 sources and there were many routines which used (char==U+0000) instead of (position<size) to decide whether all data have been processed. Changing (fixing) all these places would be tricky.

Perhaps we could check for U+0000 before the parsing routines kicks in, e.g. somewhere inside xmlParserInputBufferGrow(). We must take special care about different encodings, since it is OK to have 0x00 bytes in a file encoded in UTF-16.

Any other opinions?
Comment 2 GNOME Infrastructure Team 2021-07-05 13:25:32 UTC
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new ticket at
  https://gitlab.gnome.org/GNOME/libxml2/-/issues/

Thank you for your understanding and your help.