After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 694228 - libxml2 fails to load external entities encoded as UTF-8 with BOM
libxml2 fails to load external entities encoded as UTF-8 with BOM
Status: RESOLVED FIXED
Product: libxml2
Classification: Platform
Component: general
git master
Other Mac OS
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2013-02-20 00:52 UTC by Mark Rowe
Modified: 2013-03-27 05:39 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
XML file from test case (99 bytes, text/xml)
2013-02-20 00:53 UTC, Mark Rowe
Details
DTD from test case (129 bytes, application/octet-stream)
2013-02-20 00:53 UTC, Mark Rowe
Details

Description Mark Rowe 2013-02-20 00:52:25 UTC
Some time between libxml2 2.7.3 and 2.7.8 libxml2 stopped being able to load external entities that are saved as UTF-8 with a BOM.

With the following input (test.xml as a regular UTF-8 file, test.dtd as UTF-8 with BOM):

mrowe@apollo:~$ cat test.xml 
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root SYSTEM "test.dtd">
<root>
  &entity;
</root>
mrowe@apollo:~$ cat test.dtd 
<?xml version="1.0" encoding="UTF-8"?>
<!ENTITY entity "This is an external entity that uses non-ASCII characters… 日本">
mrowe@apollo:~$ xxd test.dtd | head -1
0000000: efbb bf3c 3f78 6d6c 2076 6572 7369 6f6e  ...<?xml version
mrowe@apollo:~$ 


Feeding the files to libxml2 results in parse errors:

mrowe@apollo:~$ xmllint --loaddtd --noent test.xml 
test.dtd:1: parser error : Content error in the external subset
<?xml version="1.0" encoding="UTF-8"?>
^
test.xml:4: parser error : Entity 'entity' not defined
  &entity;
          ^
mrowe@apollo:~$ 

According to <http://www.w3.org/TR/xml11/#charencoding> UTF-8 with a BOM is an encoding that should be supported.

This appears to happen because xmlParseExternalSubset sees that ctxt->encoding is non-NULL and skips its encoding detection support, meaning that the BOM is not consumed.
Comment 1 Mark Rowe 2013-02-20 00:53:13 UTC
Created attachment 236864 [details]
XML file from test case
Comment 2 Mark Rowe 2013-02-20 00:53:41 UTC
Created attachment 236865 [details]
DTD from test case
Comment 3 Mark Rowe 2013-02-20 01:06:01 UTC
Support for UTF-8 with BOMs in external entities was added to address bug 440415.
Comment 4 Daniel Veillard 2013-03-27 05:39:00 UTC
Okay, problem found and fixed upstream:

https://git.gnome.org/browse/libxml2/commit/?id=ab0e35044c0e83936a8042de3dcee328173c273b

I also added your test data to the regression suite in a subsequent commit
to avoid this from reproducing,

 thanks for the report and test !

Daniel