GNOME Bugzilla – Bug 350208
Unnecessary character reference generation when encoding is specified in the contents only
Last modified: 2006-10-20 15:49:26 UTC
Consider the following XML file encoded in UTF-8: <?xml version="1.0"?> <root>abcdéè</root> When running xmllint (with no options) on this file, I get: <?xml version="1.0"?> <root>abcdéè</root> instead of getting the same file (using the UTF-8 encoding). It seems that libxml2 regards the file as encoded in ASCII, though it is really encoded in UTF-8 here. This problem also leads to an inconsistency in Perl with XML::LibXML and the toString method (when toString is applied on the document, the encoding is ASCII, as above, but when it is applied on the root node with the docencoding flag set, the encoding is UTF-8, though in both cases the XML::LibXML documentation says that the document encoding is used).
Note that it's not entities it's character references, very different in practice, they are really identical to the character of that codepoint. Libxml2 doesn't 'regard' the file as encoded in ASCII, it saves it in the most compatible encoding possible, i.e. the one which is the less likely to generate incompatibilities, for example if you feed it to an HTML parser which won't look at the XML rules but may use the completely random (usually) ContentType encoding provided by the HTTP server. Saving that way avoid a full class of errors in case where documents are saved in those mixed contexts. So yes from a viewpoint of purely the XML spec changing back to pure UTF-8 generation may look sensible, however still purely from an XML perspective the document generated is exactly equivalent, but the difference shows up in 'other' kind of mixed environments. And I'm a bit afraid that changing that will generate obscure problems. What you can do is use xmlSaveSetEscape(ctxt, NULL) on the xmlSaveCtxt used by the Perl bindings toString method. That way you have complete control. That sounds a way around your problem available with published current APIs. Daniel
I agree with your viewpoint, but there's still an inconsistency in xmllint (this may be a lack of documentation, though). I've corrected the summary line. I don't know how encodings are specified internally; perhaps the problem is at the application level (e.g. xmllint and my Perl program) and not in the library itself. Indeed, you're saying: "Libxml2 doesn't 'regard' the file as encoded in ASCII, it saves it in the most compatible encoding possible" But on the following file: <?xml version="1.0" encoding="utf-8"?> <root>abcdéè</root> xmllint uses the utf-8 encoding in the output: <?xml version="1.0" encoding="utf-8"?> <root>abcdéè</root> whereas ASCII is the most compatible encoding possible. So, in your viewpoint, there's a bug. If the goal of xmllint is to keep the same encoding as the input file, then there's a bug too (with the example I gave in the bug report, as the encoding is specified by the contents of the file). I don't know if the current behavior of xmllint is what you expect, but in any case, the xmllint man page should specify which encoding is used. Moreover, if you assume that the current behavior of xmllint is correct, how about an option to preserve the encoding (that may be given either by encoding="..." in the XML declaration or by the contents of the XML file), e.g. with "--encode preserve"? If there are several possible encodings[*], then the most compatible one could be chosen. [*] For instance with the following file: <?xml version="1.0"?> <root>abcdéè</root>
That's the intended behaviour. If there is an encoding then it's relatively clear, there is no real doubt the problem is more if the document is <root>abcdéè</root> without any XML Decl then it's likely to be in such a mixed content possibly coming from HTML or going to HTML. Simply put an encoding it's really the cleanest way. w.r.t. xmllint, I'm not sure the feature is really useful. At the API level it's worth discussing the default behavviour your point was about an API ussue in Perl bindings, fine, there is a workaround. xmllint is a tool which covers some of the capabilities of libxml2 but not everything, there is already an awful lot of options, I don't really see the use case of what you're suggesting. Daniel
(In reply to comment #3) > Simply put an encoding it's really the cleanest way. I'm not the author of all these XML files (e.g., Firefox's session.rdf doesn't have a declared encoding, though in practice, it is utf-8). > w.r.t. xmllint, I'm not sure the feature is really useful. Perhaps. I don't remember when I needed it, and it could have been a hack. In fact, the main problem is that I find the current behavior unintuitive (and not enough documented), but this is a personal opinion.
> I'm not the author of all these XML files (e.g., Firefox's session.rdf doesn't > have a declared encoding, though in practice, it is utf-8). Well in that case using --encode UTF-8 just make sense and you get the expected output, at least that work oin my CVS checkout Daniel