GNOME Bugzilla – Bug 101415
odd charactors are inserted when gnumeric 1.0.11 reads non-ascii charactors in xml
Last modified: 2009-08-15 18:40:50 UTC
The release note of gnumric 1.0.11 says that it fixes for xml importing files with non ASCII characters. Indeed, the 4096-byte problem disappears in this version. But, it inserts extra charactors when it imports xml files with non ASCII charactors. As the result, multi-byte charactors transformed into nonsense charactors when we use this version. The test data is: http:/shino.pos.to/linux/section.xml.gz It contains '§' as the entry of A1 column. A section-mark is expected, and gnumeric 1.0.10 shows so. But gnumeric 1.0.11 adds another charactor and turn into '§' ('A' with hat and section-mark).
Created attachment 13057 [details] test data
Daniel I enabled the use of the xml2 parser in gnumeric 1.0.11 Now instead of dropping things we're seeing this. Is this known or fixable in libxml1 ?
I am using gnumeric-1.0.11 with libxml-1.8.16. The libxml seems to convert ASCII to UTF-8 by adding 194 as the upper byte if the charactor > 127. This conversion is not right when input coding was non-ASCII. In spite that, it seems not a bug but a 'specification' of the new xml parser in libxml. Stripping off the upper bytes in gnumeric when it use the new xml parser can be a practical solution, I suppose.
The only real solution is to move to UTF8 all the way and libxml2. Sorry, there is a time where piling up patches around a broken design just is not possible anymore... Daniel
*** Bug 101663 has been marked as a duplicate of this bug. ***
Seeing parser.c source code of libxml, 'COPY_BUF' uses the function xmlCopyChar, and it calls 'xmlCopyCharMultiByte' when the value was greater than 0x7F. And the function xmlCopyCharMultiByte converts the value into UTF-8. This procedure can be a bug of the new parser in libxml. This conversion makes nonsense strings when the input encoding was not iso-8859-1, even when it parses a legal xml using UTF-8. Next patch against libxml-1.8.16 seems to work fine with gnumeric-1.0.11. --- parser.org Fri Sep 14 23:09:41 2001 +++ parser.c Sat Dec 21 13:22:00 2002 @@ -1126,9 +1126,11 @@ int xmlCopyChar(int len, xmlChar *out, int val) { /* the len parameter is ignored */ +#if true /* suppress converting */ if (val >= 0x80) { return(xmlCopyCharMultiByte (out, val)); } +#endif *out = (xmlChar) val; return 1; } The patch does not influence on the old parser that is the default usage of the libxml.
I mean... +++ parser.c Sat Dec 21 13:22:00 2002 @@ -1126,9 +1126,11 @@ int xmlCopyChar(int len, xmlChar *out, int val) { /* the len parameter is ignored */ +#if true /* suppress converting */ if (val >= 0x80) { return(xmlCopyCharMultiByte (out, val)); } +#endif *out = (xmlChar) val; return 1; }
I mean...#if 0 +++ parser.c Sat Dec 21 13:22:00 2002 @@ -1126,9 +1126,11 @@ int xmlCopyChar(int len, xmlChar *out, int val) { /* the len parameter is ignored */ +#if 0 /* suppress converting */ if (val >= 0x80) { return(xmlCopyCharMultiByte (out, val)); } +#endif *out = (xmlChar) val; return 1; }
This simply doesn't work. libxml1 is BROKEN w.r.t. I18N. there is no clear definition of the internal encoding of the internal representation. This has been fixed 3 years ago when branched to libxml2. The internal encoding in the new framework is UTF8. Again you can try to patch over and over and over on top of a broken design. this DOES NOT FIX THE PROBLEM. The only RELIABLE solution is to swicth to libxml2. I will not apply patches to libxml1. I will not rerelease a new libxml1. That branch is dead, move over ! Daniel
Gnumeric 1.0.x can not move. Making it utf8 clean internally is not really an option. We have switched back to the libxml1 parser for 1.0.12 and will take a different kludgish work around to the problems seen there. Rather than reading 4k blocks, we now read 1M blocks. That is obviously a vile putrid hack, but it will solve the vast majority of the issues. Hopefully 1.2 will be available soon, which uses libxml2.
Okay, please understand that I think there is no good technical solution to the problem considering libxml1 serious deficiencies. sorry about that, Daniel
I wonder why Daniel says the patch above does not work. UTF-8 is not required by Gnumeric-1.0.x. I made another patch against gnumeric-1.0.11. It worked fine with the patched libxml. diff -ur src.org/xml-io.c src/xml-io.c --- src.org/xml-io.c 2002-12-03 05:34:58.000000000 +0900 +++ src/xml-io.c 2002-12-22 00:17:47.000000000 +0900 @@ -3357,6 +3357,8 @@ XmlParseContext *ctxt; GnumericXMLVersion version; gboolean xml_parser_flag; + int xmlCopyChar(int len, xmlChar *out, int val); + xmlChar workbuf[3]; g_return_if_fail (filename != NULL); @@ -3391,6 +3393,13 @@ pctxt = xmlCreatePushParserCtxt (NULL, NULL, buffer, bytes, filename); xml_parser_flag = xmlUseNewParser (TRUE); + /* see how the new parser works */ + workbuf[1] = NULL; + xmlCopyChar ( 1, &workbuf[0], 0x80 ); + if ( workbuf[1] != NULL ) { + /* if it converts then use the old parser */ + xmlUseNewParser (FALSE); + } while ((bytes = gzread (f, buffer, XML_INPUT_BUFFER_SIZE)) > 0) { xmlParseChunk (pctxt, buffer, bytes, 0); value_io_progress_update (context, lseek (fd, 0, SEEK_CUR));
That patch is not a good idea. the xml2 parser assumes that input is coming in as utf8. gnumeric-1.0 is not shipping in utf8, nor can it be forced to do so without great pain what is inappropriate for the stable 1.0 series. Reverting to the xml1 parser brings back the problems of dropping escaped characters on buffer boundaries. The solution we're adopting is to use much bigger buffers. It does not solve the problem, but it makes it much much less likely. This will have to suffice until gnumeric-1.2 is ready in the next few months.
*** Bug 101990 has been marked as a duplicate of this bug. ***