Bug 126197 – xmlParseChunk with UTF-16LE fails on special occasion

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 126197 - xmlParseChunk with UTF-16LE fails on special occasion


Summary:	xmlParseChunk with UTF-16LE fails on special occasion


Status:	VERIFIED NOTABUG

Product:	libxml2
Classification:	Platform
Component:	general
Version:	2.6.1
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2003-11-04 16:02 UTC by kbuchcik
Modified:	2009-08-15 18:40 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description kbuchcik 2003-11-04 16:02:14 UTC

It seems that the push parser failes to parse if the following scenario is 
given:
1. multiple calls to xmlParseChunk are performed
2. the input is UTF-16LE (BE as well?) with BOM encoded
3. the declaration does state an other encoding

I can reproduce this with xmllint and a modified version of the file 
"libxml2\test\wap.xml":

Note that 'encoding="iso-8859-1"' was added to the prolog of the file 
"wap.xml" and the file was UTF-16LE encoded (with BOM).

P:\tests\unicodeConsole>xmllint --push wap.xml
wap.xml:12: parser error : AttValue: ' expected
         <postfield name="tp" value="wml/state/variables/parsing/1
                                                                  ^
wap.xml:12: parser error : attributes construct error
         <postfield name="tp" value="wml/state/variables/parsing/1
                                                                  ^
wap.xml:12: parser error : Couldn't find end of Start Tag postfield
         <postfield name="tp" value="wml/state/variables/parsing/1
                                                                  ^

- Everything works fine if I recompile xmllint with a chunk size of 4096 
  instead of 1024.

- Everything works fine if the file is *not* encoded in UTF-16 (e.g. UTF-
8) 
- Everything works fine if the file *is* encoded in UTF-16LE *and* the 
  prolog defines an encoding of "UTF-16LE".

Comment 1 kbuchcik 2003-11-04 16:13:24 UTC

I get the same results with libxml2 ver. 2.5.10.

Comment 2 William M. Brack 2003-11-28 11:28:34 UTC

You seem to be a little confused about encoding.  The encoding 
attribute for the xml PI is supposed to specify the encoding of the 
source file.  If you have a UTF-16 (or UTF-16LE or UTF-16BE) encoded 
file, and within that file you specify that the encoding is ISO-8859-
1, you are "not being truthful" :-).  When the parser encounters 
this declaration, it switches out of UTF-16(LE,BE) mode and into ISO-
8859-1.  Because of the internal buffering, the first few lines 
apparently work "ok", and it is only on the later lines that the 
trouble becomes apparent (in fact, this depends upon "chunk size").

I'm not certain what you are actually trying to accomplish.  If you 
want to use xmllint to process a UTF-16 input and produce an ISO-
8859-1 output, then use the "--encode" parameter on xmllint.  If you 
are working with your own code, you must use an appropriate output 
function (look at the coding of xmllint.c for an example).

Comment 3 kbuchcik 2003-11-28 15:31:08 UTC

I'm trying to implement the DOMString (DOM level 3) in Delphi. I need 
it to be UTF-16LE encoded, regardless of the declared encoding (as 
defined by the w3c specs). So if I parse a DOMString with libxml2 the 
actual encoding of the DOMString can differ from the declared one. 
Since I learned that libxml2 will switch encoding if it autodetects 
UTF-16 (it looks for the first 4 bytes), it seemed to me the only 
chance to let libxml2 eat UTF-16LE while declared otherwise.

William M. Brack wrote: "When the parser encounters this declaration, 
it switches out of UTF-16(LE,BE) mode and into ISO-8859-1."

If this is so, why does it *not* switch to ISO-8859-1 if it gets the 
whole XML with the first chunk? Does it use the declared encoding 
only after the first chunk? I assumed that libxml2 will stay with the 
encoding it calculated first. 

William M. Brack wrote: "If you want to use xmllint to process a UTF-
16 input and produce an ISO-8859-1 output [...]"

No, this wasn't intended.


Ok, it see that my approach simply seems not to work with libxml2. So 
this is not a bug. 
Thanks.

Comment 4 Daniel Veillard 2003-11-28 15:51:59 UTC

Whatever the initial encoding of your document, libxml2 APIs will
only work with UTF-8 internally. No amount of tricks will
bypass that, you will have to convert from UTF-8 to UTF-16 
no matter what,

Daniel

Comment 5 kbuchcik 2003-11-28 16:05:02 UTC

Daniel? Pardon me, but I don't understand your statement.
I am aware that libxml2 uses UTF-8 internally. What do you mean 
with "convert from UTF-8 to UTF-16"? I just need to parse a DOMString 
encoded in UTF-16LE with an other encoding declaration. 
Or did you just hit the wrong report?