After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 60152 - libxml2-in-libxml1 parser is a bit ignorant
libxml2-in-libxml1 parser is a bit ignorant
Status: RESOLVED FIXED
Product: dia
Classification: Other
Component: general
CVS head
Other Linux
: Normal critical
: ---
Assigned To: Cyrille Chépélov
Cyrille Chépélov
Depends on:
Blocks:
 
 
Reported: 2001-09-06 22:48 UTC by Cyrille Chépélov
Modified: 2004-12-22 21:47 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Cyrille Chépélov 2001-09-06 22:48:29 UTC
... it doesn't know about a lot of encodings.

Basically, this works well for latin1 and UTF-8, and that's about it.

Problem is, when we write, we generate "local" encoding files, using
libxml1. Unfortunately, libxml1 will escape the 8th bit (unless there's a
way do disable this ?), and will do it erroneously in the case of UTF-8
(it'll escape each byte individually, which is plain wrong). Problem: now
we can't just pipe the file through iconv() to make sure it's UTF-8 data.
We can't do the escapes ourselves either, because libxml1 will escape the
escapes.

I'm thinking about doing this in the write step (libxml1 code only):
   * write into a temporary "XML" file, encoded in UTF-8 with Bad Exuberant
Escapes (no gzipping at that point).
   * read the file back ; unescape all what should be (basically,
everything which is in the form &#NNN; where NNN > 0x80. Fortunately, for
the rare case below 0x80, entity names are generated by libxml1). 
   * Write the result into the final file, perhaps gzipping it in the process.
Uh. We can use xmlDocDumpMemory, handle the BEE removing step , and avoid
the temp file (but we have to write/gzip ourselves).
We can't release with that (we'd remove !latin1 from dia's user base); I'll
cobbl^H^H^He up something along these lines tomorrow afternoon, European
time (14 to 16 hours from now).
Comment 1 Cyrille Chépélov 2001-09-19 04:58:10 UTC
done, and somewhat stabilised