Bug 135006 – XHTML dump mode should be avoidable

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 135006 - XHTML dump mode should be avoidable


Summary:	XHTML dump mode should be avoidable


Status:	RESOLVED FIXED

Product:	libxml2
Classification:	Platform
Component:	general
Version:	git master
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2004-02-20 23:34 UTC by Nicolas George
Modified:	2006-09-30 13:37 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Nicolas George 2004-02-20 23:34:19 UTC

When dumping a XML tree, if the doctype is detected as XHTML, a special
dump mode is used, where namespaces information are added, id elements are
inserted along with name elements, and so on.

It should be possible to disable that feature.

Comment 1 John Fleck 2004-02-21 14:49:30 UTC

Some more information, please, per: http://www.xmlsoft.org/bugs.html

# Make sure you can reproduce the bug with xmllint or one of the test
programs found in source in the distribution.
# Please send the command showing the error as well as the input (as
an attachment)

I ask for these because we can only guess at the details of what
you're trying to accomplish and how you're attempting to do it. It
appears that the "--html" flag to xmllint may do what you're asking
for, but it's not clear whether you're using xmllint.

Comment 2 Nicolas George 2004-02-21 15:13:14 UTC

Consider the following XML file:

<?xml version="1.0"?>
<!DOCTYPE html PUBLIC
  "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
  <head>
    <title>Sample</title>
  </head>
  <body>
    <p id="foobar"><a name="foobar">This is an anchor.</a></p>
  </body>
</html>

According to validator.w3.org, this is a perfectly valid XHTML file.
If I feed it to xmllint *without* the --html or --htmlout flags,
xmllint transforms it by:
- adding an xmlns attribute to the html element;
- adding a meta element at the beginning of the head element;
- adding an id attribute to the a element.

The worst is that the resulting XML is not valid, since the p and a
elements have the same id attribute.

According to the source code (tree.c, around lines 7690 and 7773 in
today CVS, look for is_xhtml), this transformation occurs at the
output stage (xmlDocDump), and is triggered by the fact that the
public identifier of the DTD is "-//W3C//DTD XHTML 1.0 Strict//EN" or
that the system identifier is
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd".

As far as I can see, this mode wants to enforce HTML compatibility
guidelines, but sometimes the programmer knows better.

Comment 3 André Klapper 2006-09-29 17:22:04 UTC

george, john, daniel, any news on this? can someone please update this bug report and its status? thanks in advance.

Comment 4 Daniel Veillard 2006-09-30 13:37:17 UTC

Defaulted namespace are systematically added back by libxml2,
that's normal, relying on an external resource to be able to 
find the real element type is just completely wrong, sorry
that is a NOTABUG.
The meta with the encoding is following the suggestion from
the spec:
   http://www.w3.org/TR/xhtml1/#C_9
W.r.t. id and name, this is also a suggestion from the spec:
   http://www.w3.org/TR/xhtml1/#C_8
the error occurs because the input document uses the same identifier
on *different* element, the fact that there is an invalid result
from the serialization is due to the uncertainty, depending on how the
user agent proceses the input document on where the anchor 'foobar'
should be placed, basically the input document is already in some way
invalid, as there are potentially 2 different interpretation for #foobar
depending on the mime-type used to serve the document, it's wrong from
the start.

That said, yes disabling the XHTML specific mode had been added
<br> is serialized <br/> and not <br /> etc.. then too, this
is available as the xmlSaveOption XML_SAVE_NO_XHTML when using the
xmlSave API
   http://xmlsoft.org/html/libxml-xmlsave.html#xmlSaveOption

Daniel