GNOME Bugzilla – Bug 135006
XHTML dump mode should be avoidable
Last modified: 2006-09-30 13:37:17 UTC
When dumping a XML tree, if the doctype is detected as XHTML, a special dump mode is used, where namespaces information are added, id elements are inserted along with name elements, and so on. It should be possible to disable that feature.
Some more information, please, per: http://www.xmlsoft.org/bugs.html # Make sure you can reproduce the bug with xmllint or one of the test programs found in source in the distribution. # Please send the command showing the error as well as the input (as an attachment) I ask for these because we can only guess at the details of what you're trying to accomplish and how you're attempting to do it. It appears that the "--html" flag to xmllint may do what you're asking for, but it's not clear whether you're using xmllint.
Consider the following XML file: <?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html> <head> <title>Sample</title> </head> <body> <p id="foobar"><a name="foobar">This is an anchor.</a></p> </body> </html> According to validator.w3.org, this is a perfectly valid XHTML file. If I feed it to xmllint *without* the --html or --htmlout flags, xmllint transforms it by: - adding an xmlns attribute to the html element; - adding a meta element at the beginning of the head element; - adding an id attribute to the a element. The worst is that the resulting XML is not valid, since the p and a elements have the same id attribute. According to the source code (tree.c, around lines 7690 and 7773 in today CVS, look for is_xhtml), this transformation occurs at the output stage (xmlDocDump), and is triggered by the fact that the public identifier of the DTD is "-//W3C//DTD XHTML 1.0 Strict//EN" or that the system identifier is "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd". As far as I can see, this mode wants to enforce HTML compatibility guidelines, but sometimes the programmer knows better.
george, john, daniel, any news on this? can someone please update this bug report and its status? thanks in advance.
Defaulted namespace are systematically added back by libxml2, that's normal, relying on an external resource to be able to find the real element type is just completely wrong, sorry that is a NOTABUG. The meta with the encoding is following the suggestion from the spec: http://www.w3.org/TR/xhtml1/#C_9 W.r.t. id and name, this is also a suggestion from the spec: http://www.w3.org/TR/xhtml1/#C_8 the error occurs because the input document uses the same identifier on *different* element, the fact that there is an invalid result from the serialization is due to the uncertainty, depending on how the user agent proceses the input document on where the anchor 'foobar' should be placed, basically the input document is already in some way invalid, as there are potentially 2 different interpretation for #foobar depending on the mime-type used to serve the document, it's wrong from the start. That said, yes disabling the XHTML specific mode had been added <br> is serialized <br/> and not <br /> etc.. then too, this is available as the xmlSaveOption XML_SAVE_NO_XHTML when using the xmlSave API http://xmlsoft.org/html/libxml-xmlsave.html#xmlSaveOption Daniel