After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 101415 - odd charactors are inserted when gnumeric 1.0.11 reads non-ascii charactors in xml
odd charactors are inserted when gnumeric 1.0.11 reads non-ascii charactors i...
Status: VERIFIED INCOMPLETE
Product: Gnumeric
Classification: Applications
Component: General
1.0.x
Other Linux
: Normal major
: ---
Assigned To: Jody Goldberg
Jody Goldberg
: 101663 101990 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2002-12-17 06:34 UTC by Masaki Shinomiya
Modified: 2009-08-15 18:40 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
test data (1.41 KB, application/x-gzip)
2002-12-17 06:41 UTC, Masaki Shinomiya
Details

Description Masaki Shinomiya 2002-12-17 06:34:41 UTC
The release note of gnumric 1.0.11 says that it fixes for xml importing
files with non ASCII characters.
Indeed, the 4096-byte problem disappears in this version.
But, it inserts extra charactors when it imports xml files with non ASCII
charactors.
As the result, multi-byte charactors transformed into nonsense charactors
when we use this version. 

The test data is:
http:/shino.pos.to/linux/section.xml.gz

It contains '§' as the entry of A1 column.
A section-mark is expected, and gnumeric 1.0.10 shows so.
But gnumeric 1.0.11 adds another charactor and turn into '§'
('A' with hat and section-mark).
Comment 1 Masaki Shinomiya 2002-12-17 06:41:21 UTC
Created attachment 13057 [details]
test data
Comment 2 Jody Goldberg 2002-12-18 04:01:03 UTC
Daniel I enabled the use of the xml2 parser in gnumeric 1.0.11 Now instead of
dropping things we're seeing this.  Is this known or fixable in libxml1 ?
Comment 3 Masaki Shinomiya 2002-12-18 08:21:27 UTC
I am using  gnumeric-1.0.11 with libxml-1.8.16.
The libxml seems to convert ASCII to UTF-8 by adding 194 as the upper
byte if the charactor > 127.
This conversion is not right when input coding was non-ASCII.
In spite that, it seems not a bug but a 'specification' of the new xml
parser in libxml.

Stripping off the upper bytes in gnumeric when it use the new xml
parser can be a practical solution, I suppose.
Comment 4 Daniel Veillard 2002-12-18 12:16:32 UTC
The only real solution is to move to UTF8 all the way and
libxml2. Sorry, there is a time where piling up patches 
around a broken design just is not possible anymore...

Daniel
Comment 5 Morten Welinder 2002-12-20 15:14:13 UTC
*** Bug 101663 has been marked as a duplicate of this bug. ***
Comment 6 Masaki Shinomiya 2002-12-21 05:27:53 UTC
Seeing parser.c source code of libxml,
'COPY_BUF' uses the function xmlCopyChar,
and it calls 'xmlCopyCharMultiByte' when the value was greater than 0x7F.
And the function xmlCopyCharMultiByte converts the value into UTF-8.

This procedure can be a bug of the new parser in libxml.
This conversion makes nonsense strings when the input encoding was not
iso-8859-1,
even when it parses a legal xml using UTF-8.   

Next patch against libxml-1.8.16 seems to work fine with gnumeric-1.0.11.
--- parser.org	Fri Sep 14 23:09:41 2001
+++ parser.c	Sat Dec 21 13:22:00 2002
@@ -1126,9 +1126,11 @@
 int
 xmlCopyChar(int len, xmlChar *out, int val) {
     /* the len parameter is ignored */
+#if true    /* suppress converting */
     if  (val >= 0x80) {
 	return(xmlCopyCharMultiByte (out, val));
     }
+#endif
     *out = (xmlChar) val;
     return 1;
 }

The patch does not influence on the old parser that is the default
usage of the libxml.
Comment 7 Masaki Shinomiya 2002-12-21 14:39:19 UTC
I mean...

+++ parser.c	Sat Dec 21 13:22:00 2002
@@ -1126,9 +1126,11 @@
 int
 xmlCopyChar(int len, xmlChar *out, int val) {
     /* the len parameter is ignored */
+#if true    /* suppress converting */
     if  (val >= 0x80) {
 	return(xmlCopyCharMultiByte (out, val));
     }
+#endif
     *out = (xmlChar) val;
     return 1;
 }
Comment 8 Masaki Shinomiya 2002-12-21 14:40:19 UTC
I mean...#if 0

+++ parser.c	Sat Dec 21 13:22:00 2002
@@ -1126,9 +1126,11 @@
 int
 xmlCopyChar(int len, xmlChar *out, int val) {
     /* the len parameter is ignored */
+#if 0    /* suppress converting */
     if  (val >= 0x80) {
 	return(xmlCopyCharMultiByte (out, val));
     }
+#endif
     *out = (xmlChar) val;
     return 1;
 }
Comment 9 Daniel Veillard 2002-12-21 14:52:52 UTC
This simply doesn't work. libxml1 is BROKEN w.r.t. I18N.
there is no clear definition of the internal encoding of
the internal representation. This has been fixed 3 years ago when
branched to libxml2. The internal encoding in the new framework
is UTF8. Again you can try to patch over and over and over
on top of a broken design. this DOES NOT FIX THE PROBLEM.
The only RELIABLE solution is to swicth to libxml2.
I will not apply patches to libxml1. I will not rerelease a new
libxml1. That branch is dead, move over !

Daniel
Comment 10 Jody Goldberg 2002-12-21 17:16:51 UTC
Gnumeric 1.0.x can not move.  Making it utf8 clean internally is not really an
option.  We have switched back to the libxml1 parser for 1.0.12 and will take a
different kludgish work around to the problems seen there.  Rather than reading
4k blocks, we now read 1M blocks.  That is obviously a vile putrid hack, but it
will solve the vast majority of the issues.  Hopefully 1.2 will be available
soon, which uses libxml2.
Comment 11 Daniel Veillard 2002-12-21 23:47:39 UTC
Okay, please understand that I think there is no good technical
solution to the problem considering libxml1 serious deficiencies.

  sorry about that,

Daniel
Comment 12 Masaki Shinomiya 2002-12-22 00:51:15 UTC
I wonder why Daniel says the patch above does not work.
UTF-8 is not required by Gnumeric-1.0.x.

I made another patch against gnumeric-1.0.11.
It worked fine with the patched libxml.

diff -ur src.org/xml-io.c src/xml-io.c
--- src.org/xml-io.c	2002-12-03 05:34:58.000000000 +0900
+++ src/xml-io.c	2002-12-22 00:17:47.000000000 +0900
@@ -3357,6 +3357,8 @@
 	XmlParseContext *ctxt;
 	GnumericXMLVersion    version;
 	gboolean xml_parser_flag;
+	int xmlCopyChar(int len, xmlChar *out, int val);
+	xmlChar workbuf[3];
 
 	g_return_if_fail (filename != NULL);
 
@@ -3391,6 +3393,13 @@
 	pctxt = xmlCreatePushParserCtxt (NULL, NULL, buffer, bytes, filename);
 
 	xml_parser_flag = xmlUseNewParser (TRUE);
+	/* see how the new parser works  */
+	workbuf[1] = NULL;
+	xmlCopyChar ( 1, &workbuf[0], 0x80 );
+	if ( workbuf[1] != NULL ) {
+	/* if it converts then use the old parser */
+		xmlUseNewParser (FALSE);
+	} 
 	while ((bytes = gzread (f, buffer, XML_INPUT_BUFFER_SIZE)) > 0) {
 		xmlParseChunk (pctxt, buffer, bytes, 0);
 		value_io_progress_update (context, lseek (fd, 0, SEEK_CUR));
Comment 13 Jody Goldberg 2002-12-22 05:04:51 UTC
That patch is not a good idea.
the xml2 parser assumes that input is coming in as utf8.
gnumeric-1.0 is not shipping in utf8, nor can it be forced to do so without
great pain what is inappropriate for the stable 1.0 series.

Reverting to the xml1 parser brings back the problems of dropping escaped
characters on buffer boundaries.  The solution we're adopting is to use much
bigger buffers.  It does not solve the problem, but it makes it much much less
likely.

This will have to suffice until gnumeric-1.2 is ready in the next few months.
Comment 14 Morten Welinder 2002-12-26 15:23:16 UTC
*** Bug 101990 has been marked as a duplicate of this bug. ***