GNOME Bugzilla – Bug 566012
Incomplete EBCDIC parsing support
Last modified: 2009-08-28 12:55:54 UTC
Parsing an EBCDIC document does not work on a normal linux distribution, as there is a property mismatch between glibc (iconv) and libxml2. libxml2 detects the EBCDIC encoding correctly and tries to load a handler for it: encoding.c, line 1456 case XML_CHAR_ENCODING_EBCDIC: handler = xmlFindCharEncodingHandler("EBCDIC"); if (handler != NULL) return(handler); handler = xmlFindCharEncodingHandler("ebcdic"); if (handler != NULL) return(handler); break; The problem is, that glibc has no encoding named "EBCDIC", only EBCDIC with an country appended (iconv -l, http://repo.or.cz/w/glibc.git?a=blob;f=iconvdata/gconv-modules;h=e70432fcaab12449b41c7a726a268b46bcb0ddb6;hb=eec0e3dcca8ec0005c6a2296057e4e46f8a6481a). This is probably caused by the fact, that there are lots of EBCDIC codepages, which only share the basic characters [see http://www-01.ibm.com/software/globalization/cp/cp_cpgid.jsp for details]. For decoding the <?xml version="XXX" encoding="XXX"?>, EBCDIC-US should be sufficient. Therefore, I suggest adding the following as third EBCDIC encoding to try: --- encoding.c.old 2008-12-30 09:21:13.000000000 +0100 +++ encoding.c 2008-12-30 09:21:56.000000000 +0100 @@ -1458,6 +1458,8 @@ xmlGetCharEncodingHandler(xmlCharEncodin if (handler != NULL) return(handler); handler = xmlFindCharEncodingHandler("ebcdic"); if (handler != NULL) return(handler); + handler = xmlFindCharEncodingHandler("EBCDIC-US"); + if (handler != NULL) return(handler); break; case XML_CHAR_ENCODING_UCS4BE: handler = xmlFindCharEncodingHandler("ISO-10646-UCS-4"); With this change, it is possible to parse XML documents in EBCDIC. There is a second problem: After the real encoding of the <?xml tag has been read, xmlSwitchInputEncoding is called with the encoding contained in the document. The functions changes the encoding of the input stream, but does not discard the already converted data. So the old encoding is used for the first few kB of the document. How to reproduce: * patch libxml2 with the above patch * create EBCDIC XML file /tmp/libxml2-2.7.2$ cat t1.xml <?xml version="1.0" encoding="IBM-1141" ?> <test attr="ÄÖÜ" /> /tmp/libxml2-2.7.2$ cat t1.xml |hexdump -C 00000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 |<?xml version="1| 00000010 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 49 42 |.0" encoding="IB| 00000020 4d 2d 31 31 34 31 22 20 3f 3e 0a 3c 74 65 73 74 |M-1141" ?>.<test| 00000030 20 61 74 74 72 3d 22 c4 d6 dc 22 20 2f 3e 0a | attr="ÄÖÜ" />.| 0000003f /tmp/libxml2-2.7.2$ iconv -f ISO-8859-1 -t IBM-1141 < t1.xml > t2.xml /tmp/libxml2-2.7.2$ hexdump -C t2.xml 00000000 4c 6f a7 94 93 40 a5 85 99 a2 89 96 95 7e 7f f1 |Lo§..@¥..¢...~.ñ| 00000010 4b f0 7f 40 85 95 83 96 84 89 95 87 7e 7f c9 c2 |Kð.@........~.ÉÂ| 00000020 d4 60 f1 f1 f4 f1 7f 40 6f 6e 25 4c a3 85 a2 a3 |Ô`ññôñ.@on%L£.¢£| 00000030 40 81 a3 a3 99 7e 7f 4a e0 5a 7f 40 61 6e 25 |@.££.~.JàZ.@an%| 0000003f * parse it with libxml2: /tmp/libxml2-2.7.2$ ./xmllint -format -encode ISO-8859-1 t2.xml | hexdump -C 00000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 |<?xml version="1| 00000010 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 49 53 |.0" encoding="IS| 00000020 4f 2d 38 38 35 39 2d 31 22 3f 3e 0a 3c 74 65 73 |O-8859-1"?>.<tes| 00000030 74 20 61 74 74 72 3d 22 a2 5c 21 22 2f 3e 0a |t attr="¢\!"/>.| 0000003f If the encoding switch had been correctly done, the attribute attr would contain "ÄÖÜ". Martin Kögler
Okay trivial, no problem, applied and commited, but the next time if you have a patch, add it as a bugzilla attachment and flagged as such, it will help make sure this is handled quickly thanks ! Daniel
Created attachment 141609 [details] test data XML in encoding IBM-1141
The second part of the problem is still present in the current GIT version. I have create a XML file containing some special characters (see attachment). If I convert the XML file via iconv, all characters in the attr attribute are translated properly: $ iconv -f IBM-1141 -t ISO-8859-1 < t2.xml |hexdump -C 00000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 |<?xml version="1| 00000010 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 49 42 |.0" encoding="IB| 00000020 4d 2d 31 31 34 31 22 20 3f 3e 0a 3c 74 65 73 74 |M-1141" ?>.<test| 00000030 20 61 74 74 72 3d 22 c4 d6 dc 22 20 2f 3e 0a | attr="ÄÖÜ" />.| 0000003f If I convert the encoding via xmllint, the attr attribute contains garbage: $ ./xmllint -format -encode ISO-8859-1 t2.xml | hexdump -C 00000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 |<?xml version="1| 00000010 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 49 53 |.0" encoding="IS| 00000020 4f 2d 38 38 35 39 2d 31 22 3f 3e 0a 3c 74 65 73 |O-8859-1"?>.<tes| 00000030 74 20 61 74 74 72 3d 22 a2 5c 21 22 2f 3e 0a |t attr="¢\!"/>.| 0000003f The problem is, that libxml2 parses a larger part of the xml file via a generic EBCDIC encoding, while reading the <?xml tag. After switching to the correct encoding from the <?xml tag, it keeps some data in the generic EBCDIC encoding and does not recovert it using the correct encoding. For reference, the hexdump of the attachment. $ hexdump -C t2.xml 00000000 4c 6f a7 94 93 40 a5 85 99 a2 89 96 95 7e 7f f1 |Lo§..@¥..¢...~.ñ| 00000010 4b f0 7f 40 85 95 83 96 84 89 95 87 7e 7f c9 c2 |Kð.@........~.ÉÂ| 00000020 d4 60 f1 f1 f4 f1 7f 40 6f 6e 25 4c a3 85 a2 a3 |Ô`ññôñ.@on%L£.¢£| 00000030 40 81 a3 a3 99 7e 7f 4a e0 5a 7f 40 61 6e 25 |@.££.~.JàZ.@an%| 0000003f
Okay, you're exhibiting a pathological worst case where the initial autodetected encoding is not fully compatible with the declared one, and where the conflict shows up at the very beginning of the document. I had to change quite a bit to avoid having the initial encoder convert more than the first line. This raised an actual bug in the regression suite and fixing the problem for the push parser mode was even more crazy. But this should all be sorted out now, I added your test to the regression suite, thanks for following up ! fix in git head, paphio:~/XML -> ./xmllint -format -encode UTF-8 t2.xml <?xml version="1.0" encoding="UTF-8"?> <test attr="ÄÖÜ"/> paphio:~/XML -> ./xmllint --push -format -encode UTF-8 t2.xml <?xml version="1.0" encoding="UTF-8"?> <test attr="ÄÖÜ"/> paphio:~/XML -> thanks, Daniel
Created attachment 141796 [details] long <?xml tag 1
Created attachment 141797 [details] long <?xml tag 2
Created attachment 141798 [details] short <?xml tag
Your 45 byte limit is still broken in some corner cases: Lets start with tc2.xml (attachment "long <?xml tag 2"): $ iconv -f IBM-1141 -t ISO-8859-15 < tc2.xml <?xml version="1.0" encoding="EBCDIC-AT-DE" ?> <test attr="äöü" /> $ ./xmllint -format --encode ISO-8859-15 tc2.xml <?xml version="1.0" encoding="ISO-8859-15"?> <test attr="äöü"/> => Everything works. Lets move on to a slightly modified version ta2.xml (attachment "long <?xml tag 1"): iconv -f IBM-1141 -t ISO-8859-15 < ta2.xml <?xml version="1.0" encoding="EBCDIC-AT-DE" ?> <test attr="äöü" /> $ ./xmllint -format --encode ISO-8859-15 ta2.xml <?xml version="1.0" encoding="ISO-8859-15"?> <test attr="{¦}"/> ta2.xml has only one additional blank in the XML tag (=>exceeding the 45 byte limit), which makes the correct decoding fail. Finally an example, where the 45 bytes are too long: tb2.xml (attachment "short <?xml tag") $ iconv -f CP273 -t ISO-8859-15 < tb2.xml <?xml version="1.0" encoding="CP273"?> <ätest attr="ÄÖÜ" /> $ ./xmllint -format --encode ISO-8859-15 tb2.xml tb2.xml:2: parser error : StartTag: invalid element name <{test attr="ÃÃÃ" /> ^ tb2.xml:2: parser error : Extra content at the end of the document <{test attr="ÃÃÃ" /> This shows, that it is only broken on EBCDIC: $ iconv -f CP273 -t ISO-8859-15 < tb2.xml > tb3.xml $ sed -ri s/CP273/ISO-8859-15/g tb3.xml $ ./xmllint -format --encode ISO-8859-15 tb3.xml <?xml version="1.0" encoding="ISO-8859-15"?> <ätest attr="ÄÖÜ"/> So, huge improvement, but not perfect.
yes it's only EBCDIC all other encoding are at least compatible with ASCII with the characters needed to decode the XMLDecl or don't have incompatible variations like UTF-16 or UCS4. So this is an EBCDIC only bug, I think I fixed it, you will always be able to tweak the autodetection on the first line my goal is to solve the problem for real use, not spend a decade to chase EBCDIC related unexistant problems. So I won't spend more time on this unless this raises a real problem I consider this resolved fixed, and I may apply more patches if provided and they don't break the normal case. Daniel