GNOME Bugzilla – Bug 60152
libxml2-in-libxml1 parser is a bit ignorant
Last modified: 2004-12-22 21:47:04 UTC
... it doesn't know about a lot of encodings. Basically, this works well for latin1 and UTF-8, and that's about it. Problem is, when we write, we generate "local" encoding files, using libxml1. Unfortunately, libxml1 will escape the 8th bit (unless there's a way do disable this ?), and will do it erroneously in the case of UTF-8 (it'll escape each byte individually, which is plain wrong). Problem: now we can't just pipe the file through iconv() to make sure it's UTF-8 data. We can't do the escapes ourselves either, because libxml1 will escape the escapes. I'm thinking about doing this in the write step (libxml1 code only): * write into a temporary "XML" file, encoded in UTF-8 with Bad Exuberant Escapes (no gzipping at that point). * read the file back ; unescape all what should be (basically, everything which is in the form &#NNN; where NNN > 0x80. Fortunately, for the rare case below 0x80, entity names are generated by libxml1). * Write the result into the final file, perhaps gzipping it in the process. Uh. We can use xmlDocDumpMemory, handle the BEE removing step , and avoid the temp file (but we have to write/gzip ourselves). We can't release with that (we'd remove !latin1 from dia's user base); I'll cobbl^H^H^He up something along these lines tomorrow afternoon, European time (14 to 16 hours from now).
done, and somewhat stabilised