Bug 350208 – Unnecessary character reference generation when encoding is specified in the contents only

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 350208 - Unnecessary character reference generation when encoding is specified in the contents only


Summary:	Unnecessary character reference generation when encoding is specified in the ...


Status:	RESOLVED WONTFIX

Product:	libxml2
Classification:	Platform
Component:	general
Version:	2.6.x
Hardware:	Other Linux

Importance:	Normal minor
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2006-08-07 00:08 UTC by Vincent Lefevre
Modified:	2006-10-20 15:49 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Vincent Lefevre 2006-08-07 00:08:39 UTC

Consider the following XML file encoded in UTF-8:

<?xml version="1.0"?>
<root>abcdéè</root>

When running xmllint (with no options) on this file, I get:

<?xml version="1.0"?>
<root>abcd&#xE9;&#xE8;</root>

instead of getting the same file (using the UTF-8 encoding).

It seems that libxml2 regards the file as encoded in ASCII, though it is really encoded in UTF-8 here. This problem also leads to an inconsistency in Perl with XML::LibXML and the toString method (when toString is applied on the document, the encoding is ASCII, as above, but when it is applied on the root node with the docencoding flag set, the encoding is UTF-8, though in both cases the XML::LibXML documentation says that the document encoding is used).

Comment 1 Daniel Veillard 2006-10-20 13:10:45 UTC

Note that it's not entities it's character references, very different in
practice, they are really identical to the character of that codepoint.
Libxml2 doesn't 'regard' the file as encoded in ASCII, it saves it in
the most compatible encoding possible, i.e. the one which is the less
likely to generate incompatibilities, for example if you feed it to an
HTML parser which won't look at the XML rules but may use the completely
random (usually) ContentType encoding provided by the HTTP server.
Saving that way avoid a full class of errors in case where documents 
are saved in those mixed contexts.
So yes from a viewpoint of purely the XML spec changing back to pure UTF-8
generation may look sensible, however still purely from an XML perspective
the document generated is exactly equivalent, but the difference shows up
in 'other' kind of mixed environments. And I'm a bit afraid that changing
that will generate obscure problems.
What you can do is use xmlSaveSetEscape(ctxt, NULL) on the xmlSaveCtxt used
by the Perl bindings toString method. That way you have complete control.
That sounds a way around your problem available with published current APIs.

Daniel

Comment 2 Vincent Lefevre 2006-10-20 14:28:10 UTC

I agree with your viewpoint, but there's still an inconsistency in xmllint (this may be a lack of documentation, though). I've corrected the summary line. I don't know how encodings are specified internally; perhaps the problem is at the application level (e.g. xmllint and my Perl program) and not in the library itself.

Indeed, you're saying:

  "Libxml2 doesn't 'regard' the file as encoded in ASCII, it saves it in
   the most compatible encoding possible"

But on the following file:

<?xml version="1.0" encoding="utf-8"?>
<root>abcdéè</root>

xmllint uses the utf-8 encoding in the output:

<?xml version="1.0" encoding="utf-8"?>
<root>abcdéè</root>

whereas ASCII is the most compatible encoding possible. So, in your viewpoint, there's a bug. If the goal of xmllint is to keep the same encoding as the input file, then there's a bug too (with the example I gave in the bug report, as the encoding is specified by the contents of the file).

I don't know if the current behavior of xmllint is what you expect, but in any case, the xmllint man page should specify which encoding is used.

Moreover, if you assume that the current behavior of xmllint is correct, how about an option to preserve the encoding (that may be given either by encoding="..." in the XML declaration or by the contents of the XML file), e.g. with "--encode preserve"? If there are several possible encodings[*], then the most compatible one could be chosen.

[*] For instance with the following file:

<?xml version="1.0"?>
<root>abcd&#xE9;&#xE8;</root>

Comment 3 Daniel Veillard 2006-10-20 14:39:10 UTC

That's the intended behaviour. If there is an encoding then
it's relatively clear, there is no real doubt

the problem is more if the document is 

<root>abcdéè</root>

without any XML Decl then it's likely to be in such a mixed content
possibly coming from HTML or going to HTML. Simply put an encoding
it's really the cleanest way.

w.r.t. xmllint, I'm not sure the feature is really useful.
At the API level it's worth discussing the default behavviour
your point was about an API ussue in Perl bindings, fine, there
is a workaround. xmllint is a tool which covers some of the capabilities
of libxml2 but not everything, there is already an awful lot of
options, I don't really see the use case of what you're suggesting.

Daniel

Comment 4 Vincent Lefevre 2006-10-20 15:13:09 UTC

(In reply to comment #3)
> Simply put an encoding it's really the cleanest way.

I'm not the author of all these XML files (e.g., Firefox's session.rdf doesn't have a declared encoding, though in practice, it is utf-8).

> w.r.t. xmllint, I'm not sure the feature is really useful.

Perhaps. I don't remember when I needed it, and it could have been a hack. In fact, the main problem is that I find the current behavior unintuitive (and not enough documented), but this is a personal opinion.

Comment 5 Daniel Veillard 2006-10-20 15:49:26 UTC

> I'm not the author of all these XML files (e.g., Firefox's session.rdf doesn't
> have a declared encoding, though in practice, it is utf-8).

  Well in that case using --encode UTF-8 just make sense and you get
the expected output, at least that work oin my CVS checkout

Daniel