GNOME Bugzilla – Bug 694228
libxml2 fails to load external entities encoded as UTF-8 with BOM
Last modified: 2013-03-27 05:39:00 UTC
Some time between libxml2 2.7.3 and 2.7.8 libxml2 stopped being able to load external entities that are saved as UTF-8 with a BOM. With the following input (test.xml as a regular UTF-8 file, test.dtd as UTF-8 with BOM): mrowe@apollo:~$ cat test.xml <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE root SYSTEM "test.dtd"> <root> &entity; </root> mrowe@apollo:~$ cat test.dtd <?xml version="1.0" encoding="UTF-8"?> <!ENTITY entity "This is an external entity that uses non-ASCII characters… 日本"> mrowe@apollo:~$ xxd test.dtd | head -1 0000000: efbb bf3c 3f78 6d6c 2076 6572 7369 6f6e ...<?xml version mrowe@apollo:~$ Feeding the files to libxml2 results in parse errors: mrowe@apollo:~$ xmllint --loaddtd --noent test.xml test.dtd:1: parser error : Content error in the external subset <?xml version="1.0" encoding="UTF-8"?> ^ test.xml:4: parser error : Entity 'entity' not defined &entity; ^ mrowe@apollo:~$ According to <http://www.w3.org/TR/xml11/#charencoding> UTF-8 with a BOM is an encoding that should be supported. This appears to happen because xmlParseExternalSubset sees that ctxt->encoding is non-NULL and skips its encoding detection support, meaning that the BOM is not consumed.
Created attachment 236864 [details] XML file from test case
Created attachment 236865 [details] DTD from test case
Support for UTF-8 with BOMs in external entities was added to address bug 440415.
Okay, problem found and fixed upstream: https://git.gnome.org/browse/libxml2/commit/?id=ab0e35044c0e83936a8042de3dcee328173c273b I also added your test data to the regression suite in a subsequent commit to avoid this from reproducing, thanks for the report and test ! Daniel