Bug 566012 – Incomplete EBCDIC parsing support

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 566012 - Incomplete EBCDIC parsing support


Summary:	Incomplete EBCDIC parsing support


Status:	RESOLVED FIXED

Product:	libxml2
Classification:	Platform
Component:	general
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2008-12-30 08:38 UTC by Martin Kögler
Modified:	2009-08-28 12:55 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
test data (63 bytes, application/octet-stream) 2009-08-25 05:53 UTC, Martin Kögler	Details
long <?xml tag 1 (69 bytes, text/xml) 2009-08-26 21:23 UTC, Martin Kögler	Details
long <?xml tag 2 (68 bytes, application/octet-stream) 2009-08-26 21:24 UTC, Martin Kögler	Details
short <?xml tag (60 bytes, text/xml) 2009-08-26 21:25 UTC, Martin Kögler	Details

Description Martin Kögler 2008-12-30 08:38:21 UTC

Parsing an EBCDIC document does not work on a normal linux distribution, as there is a property mismatch between glibc (iconv) and libxml2.

libxml2 detects the EBCDIC encoding correctly and tries to load a handler for it:

encoding.c, line 1456
 case XML_CHAR_ENCODING_EBCDIC:
 	handler = xmlFindCharEncodingHandler("EBCDIC");
 	if (handler != NULL) return(handler);
 	handler = xmlFindCharEncodingHandler("ebcdic");
 	if (handler != NULL) return(handler);
 	break;

The problem is, that glibc has no encoding named "EBCDIC", only EBCDIC with an country appended (iconv -l, http://repo.or.cz/w/glibc.git?a=blob;f=iconvdata/gconv-modules;h=e70432fcaab12449b41c7a726a268b46bcb0ddb6;hb=eec0e3dcca8ec0005c6a2296057e4e46f8a6481a).
This is probably caused by the fact, that there are lots of EBCDIC codepages, which only share the basic characters [see http://www-01.ibm.com/software/globalization/cp/cp_cpgid.jsp for details].

For decoding the <?xml version="XXX" encoding="XXX"?>, EBCDIC-US should be sufficient. Therefore, I suggest adding the following as third EBCDIC encoding to try:
--- encoding.c.old      2008-12-30 09:21:13.000000000 +0100
+++ encoding.c  2008-12-30 09:21:56.000000000 +0100
@@ -1458,6 +1458,8 @@ xmlGetCharEncodingHandler(xmlCharEncodin
             if (handler != NULL) return(handler);
             handler = xmlFindCharEncodingHandler("ebcdic");
             if (handler != NULL) return(handler);
+            handler = xmlFindCharEncodingHandler("EBCDIC-US");
+            if (handler != NULL) return(handler);
            break;
         case XML_CHAR_ENCODING_UCS4BE:
             handler = xmlFindCharEncodingHandler("ISO-10646-UCS-4");

With this change, it is possible to parse XML documents in EBCDIC. 

There is a second problem:
After the real encoding of the <?xml tag has been read, xmlSwitchInputEncoding is called with the encoding contained in the document. The functions changes the encoding of the input stream, but does not discard the already converted data. So the old encoding is used for the first few kB of the document.

How to reproduce:
* patch libxml2 with the above patch
* create EBCDIC XML file
/tmp/libxml2-2.7.2$ cat t1.xml
<?xml version="1.0" encoding="IBM-1141" ?>
<test attr="ÄÖÜ" />
/tmp/libxml2-2.7.2$ cat t1.xml  |hexdump -C
00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version="1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 49 42  |.0" encoding="IB|
00000020  4d 2d 31 31 34 31 22 20  3f 3e 0a 3c 74 65 73 74  |M-1141" ?>.<test|
00000030  20 61 74 74 72 3d 22 c4  d6 dc 22 20 2f 3e 0a     | attr="ÄÖÜ" />.|
0000003f
/tmp/libxml2-2.7.2$ iconv -f ISO-8859-1 -t IBM-1141 < t1.xml > t2.xml
/tmp/libxml2-2.7.2$ hexdump -C t2.xml
00000000  4c 6f a7 94 93 40 a5 85  99 a2 89 96 95 7e 7f f1  |Lo§..@¥..¢...~.ñ|
00000010  4b f0 7f 40 85 95 83 96  84 89 95 87 7e 7f c9 c2  |Kð.@........~.ÉÂ|
00000020  d4 60 f1 f1 f4 f1 7f 40  6f 6e 25 4c a3 85 a2 a3  |Ô`ññôñ.@on%L£.¢£|
00000030  40 81 a3 a3 99 7e 7f 4a  e0 5a 7f 40 61 6e 25     |@.££.~.JàZ.@an%|
0000003f
* parse it with libxml2:
/tmp/libxml2-2.7.2$ ./xmllint -format -encode ISO-8859-1 t2.xml | hexdump -C
00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version="1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 49 53  |.0" encoding="IS|
00000020  4f 2d 38 38 35 39 2d 31  22 3f 3e 0a 3c 74 65 73  |O-8859-1"?>.<tes|
00000030  74 20 61 74 74 72 3d 22  a2 5c 21 22 2f 3e 0a     |t attr="¢\!"/>.|
0000003f

If the encoding switch had been correctly done, the attribute attr would contain "ÄÖÜ".

Martin Kögler

Comment 1 Daniel Veillard 2009-08-24 14:50:08 UTC

Okay trivial, no problem, applied and commited, but the
next time if you have a patch, add it as a bugzilla attachment
and flagged as such, it will help make sure this is handled quickly

  thanks !

Daniel

Comment 2 Martin Kögler 2009-08-25 05:53:27 UTC

Created attachment 141609 [details]
test data

XML in encoding IBM-1141

Comment 3 Martin Kögler 2009-08-25 06:01:02 UTC

The second part of the problem is still present in the current GIT version.

I have create a XML file containing some special characters (see attachment). If I convert the XML file via iconv, all characters in the attr attribute are translated properly:

$ iconv -f IBM-1141 -t ISO-8859-1 < t2.xml |hexdump -C
00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version="1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 49 42  |.0" encoding="IB|
00000020  4d 2d 31 31 34 31 22 20  3f 3e 0a 3c 74 65 73 74  |M-1141" ?>.<test|
00000030  20 61 74 74 72 3d 22 c4  d6 dc 22 20 2f 3e 0a     | attr="ÄÖÜ" />.|
0000003f

If I convert the encoding via xmllint, the attr attribute contains garbage:
$ ./xmllint -format -encode ISO-8859-1 t2.xml | hexdump -C
00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version="1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 49 53  |.0" encoding="IS|
00000020  4f 2d 38 38 35 39 2d 31  22 3f 3e 0a 3c 74 65 73  |O-8859-1"?>.<tes|
00000030  74 20 61 74 74 72 3d 22  a2 5c 21 22 2f 3e 0a     |t attr="¢\!"/>.|
0000003f

The problem is, that libxml2 parses a larger part of the xml file via a generic EBCDIC encoding, while reading the <?xml tag. After switching to the correct encoding from the <?xml tag, it keeps some data in the generic EBCDIC encoding and does not recovert it using the correct encoding.

For reference, the hexdump of the attachment.
$ hexdump -C t2.xml
00000000  4c 6f a7 94 93 40 a5 85  99 a2 89 96 95 7e 7f f1  |Lo§..@¥..¢...~.ñ|
00000010  4b f0 7f 40 85 95 83 96  84 89 95 87 7e 7f c9 c2  |Kð.@........~.ÉÂ|
00000020  d4 60 f1 f1 f4 f1 7f 40  6f 6e 25 4c a3 85 a2 a3  |Ô`ññôñ.@on%L£.¢£|
00000030  40 81 a3 a3 99 7e 7f 4a  e0 5a 7f 40 61 6e 25     |@.££.~.JàZ.@an%|
0000003f

Comment 4 Daniel Veillard 2009-08-26 12:40:42 UTC

Okay, you're exhibiting a pathological worst case where the initial
autodetected encoding is not fully compatible with the declared one,
and where the conflict shows up at the very beginning of the document.
I had to change quite a bit to avoid having the initial encoder convert
more than the first line. This raised an actual bug in the regression suite
and fixing the problem for the push parser mode was even more crazy.

  But this should all be sorted out now, I added your test to
the regression suite, thanks for following up !

   fix in git head,

paphio:~/XML -> ./xmllint -format -encode UTF-8 t2.xml 
<?xml version="1.0" encoding="UTF-8"?>
<test attr="ÄÖÜ"/>
paphio:~/XML -> ./xmllint --push -format -encode UTF-8 t2.xml 
<?xml version="1.0" encoding="UTF-8"?>
<test attr="ÄÖÜ"/>
paphio:~/XML -> 

  thanks,

Daniel

Comment 5 Martin Kögler 2009-08-26 21:23:32 UTC

Created attachment 141796 [details]
long <?xml tag 1

Comment 6 Martin Kögler 2009-08-26 21:24:14 UTC

Created attachment 141797 [details]
long <?xml tag 2

Comment 7 Martin Kögler 2009-08-26 21:25:42 UTC

Created attachment 141798 [details]
short <?xml tag

Comment 8 Martin Kögler 2009-08-26 21:38:49 UTC

Your 45 byte limit is still broken in some corner cases:

Lets start with tc2.xml (attachment "long <?xml tag 2"):
$ iconv -f IBM-1141 -t ISO-8859-15 < tc2.xml
<?xml version="1.0"  encoding="EBCDIC-AT-DE" ?>
<test attr="äöü" />
$ ./xmllint -format --encode ISO-8859-15 tc2.xml
<?xml version="1.0" encoding="ISO-8859-15"?>
<test attr="äöü"/>

=> Everything works.

Lets move on to a slightly modified version ta2.xml (attachment "long <?xml tag 1"):
iconv -f IBM-1141 -t ISO-8859-15 < ta2.xml
<?xml version="1.0"   encoding="EBCDIC-AT-DE" ?>
<test attr="äöü" />
$ ./xmllint -format --encode ISO-8859-15 ta2.xml
<?xml version="1.0" encoding="ISO-8859-15"?>
<test attr="{&#166;}"/>

ta2.xml has only one additional blank in the XML tag (=>exceeding the 45 byte limit), which makes the correct decoding fail.

Finally an example, where the 45 bytes are too long: tb2.xml (attachment "short
 <?xml tag")
$ iconv -f CP273 -t ISO-8859-15 < tb2.xml
<?xml version="1.0" encoding="CP273"?>
<ätest attr="ÄÖÜ" />
$ ./xmllint -format --encode ISO-8859-15 tb2.xml
tb2.xml:2: parser error : StartTag: invalid element name
<{test attr="ÃÃÃ" />
 ^
tb2.xml:2: parser error : Extra content at the end of the document
<{test attr="ÃÃÃ" />

This shows, that it is only broken on EBCDIC:
$ iconv -f CP273 -t ISO-8859-15 < tb2.xml > tb3.xml
$ sed -ri s/CP273/ISO-8859-15/g tb3.xml
$ ./xmllint -format --encode ISO-8859-15 tb3.xml
<?xml version="1.0" encoding="ISO-8859-15"?>
<ätest attr="ÄÖÜ"/>

So, huge improvement, but not perfect.

Comment 9 Daniel Veillard 2009-08-28 12:55:54 UTC

yes it's only EBCDIC all other encoding are at least compatible
with ASCII with the characters needed to decode the XMLDecl or
don't have incompatible variations like UTF-16 or UCS4.

So this is an EBCDIC only bug, I think I fixed it, you will
always be able to tweak the autodetection on the first line
my goal is to solve the problem for real use, not spend a decade
to chase EBCDIC related unexistant problems.

So I won't spend more time on this unless this raises a real problem
I consider this resolved fixed, and I may apply more patches if provided
and they don't break the normal case.

Daniel