GNOME Bugzilla – Bug 126197
xmlParseChunk with UTF-16LE fails on special occasion
Last modified: 2009-08-15 18:40:50 UTC
It seems that the push parser failes to parse if the following scenario is given: 1. multiple calls to xmlParseChunk are performed 2. the input is UTF-16LE (BE as well?) with BOM encoded 3. the declaration does state an other encoding I can reproduce this with xmllint and a modified version of the file "libxml2\test\wap.xml": Note that 'encoding="iso-8859-1"' was added to the prolog of the file "wap.xml" and the file was UTF-16LE encoded (with BOM). P:\tests\unicodeConsole>xmllint --push wap.xml wap.xml:12: parser error : AttValue: ' expected <postfield name="tp" value="wml/state/variables/parsing/1 ^ wap.xml:12: parser error : attributes construct error <postfield name="tp" value="wml/state/variables/parsing/1 ^ wap.xml:12: parser error : Couldn't find end of Start Tag postfield <postfield name="tp" value="wml/state/variables/parsing/1 ^ - Everything works fine if I recompile xmllint with a chunk size of 4096 instead of 1024. - Everything works fine if the file is *not* encoded in UTF-16 (e.g. UTF- 8) - Everything works fine if the file *is* encoded in UTF-16LE *and* the prolog defines an encoding of "UTF-16LE".
I get the same results with libxml2 ver. 2.5.10.
You seem to be a little confused about encoding. The encoding attribute for the xml PI is supposed to specify the encoding of the source file. If you have a UTF-16 (or UTF-16LE or UTF-16BE) encoded file, and within that file you specify that the encoding is ISO-8859- 1, you are "not being truthful" :-). When the parser encounters this declaration, it switches out of UTF-16(LE,BE) mode and into ISO- 8859-1. Because of the internal buffering, the first few lines apparently work "ok", and it is only on the later lines that the trouble becomes apparent (in fact, this depends upon "chunk size"). I'm not certain what you are actually trying to accomplish. If you want to use xmllint to process a UTF-16 input and produce an ISO- 8859-1 output, then use the "--encode" parameter on xmllint. If you are working with your own code, you must use an appropriate output function (look at the coding of xmllint.c for an example).
I'm trying to implement the DOMString (DOM level 3) in Delphi. I need it to be UTF-16LE encoded, regardless of the declared encoding (as defined by the w3c specs). So if I parse a DOMString with libxml2 the actual encoding of the DOMString can differ from the declared one. Since I learned that libxml2 will switch encoding if it autodetects UTF-16 (it looks for the first 4 bytes), it seemed to me the only chance to let libxml2 eat UTF-16LE while declared otherwise. William M. Brack wrote: "When the parser encounters this declaration, it switches out of UTF-16(LE,BE) mode and into ISO-8859-1." If this is so, why does it *not* switch to ISO-8859-1 if it gets the whole XML with the first chunk? Does it use the declared encoding only after the first chunk? I assumed that libxml2 will stay with the encoding it calculated first. William M. Brack wrote: "If you want to use xmllint to process a UTF- 16 input and produce an ISO-8859-1 output [...]" No, this wasn't intended. Ok, it see that my approach simply seems not to work with libxml2. So this is not a bug. Thanks.
Whatever the initial encoding of your document, libxml2 APIs will only work with UTF-8 internally. No amount of tricks will bypass that, you will have to convert from UTF-8 to UTF-16 no matter what, Daniel
Daniel? Pardon me, but I don't understand your statement. I am aware that libxml2 uses UTF-8 internally. What do you mean with "convert from UTF-8 to UTF-16"? I just need to parse a DOMString encoded in UTF-16LE with an other encoding declaration. Or did you just hit the wrong report?