GNOME Bugzilla – Bug 162613
UTF-8 BOM not recognised with push parser
Last modified: 2005-02-11 14:36:02 UTC
When using a push parser, a document which begins with the UTF-8 BOM cannot be parsed, getting the "Document is empty" error. ctxt->charset seems to get initialized to XML_CHAR_ENCODING_UTF8 by the call to xmlCreatePushParserCtxt, when no initial chunk is provided. The places where the encoding is auto-detected i.e. xmlParseTryOrFinish is hence never reached: case XML_PARSER_START: if (ctxt->charset == XML_CHAR_ENCODING_NONE) { ...
using 2.6.16, FC3 package libxml2-2.6.16-3
Please provide an example, I assume xmllint --push --noout fails while xmllint --noout on the same instance suceeds. Daniel
--push doesn't trigger it because it *does* pass an initial chunk to the CreatePushParserCtx call; this bug only occurs whan an initial chunk is not provided. Any document with a UTF-8 BOM is an example. (printf '\xEF\xBB\xBF'; cat anyxml.xml) > utf8-bom.xml
I changed xmlCreatePushParserCtxt so that, if no initial chunk is given, ctxt->charset is set to XML_CHAR_ENCODING_NONE (instead of the previous XML_CHAR_ENCODING_UTF8, automatically provided by xmlNewParserCtxt). After this change, an additional change to xmlParseTryOrFinish was required to properly take care of this case. The changed code (parser.c) is in CVS. Thanks for the report. Bill