GNOME Bugzilla – Bug 345147
xsltproc doesn't honor disable-output-escaping in XHTML 1.0 style element
Last modified: 2006-08-10 12:06:44 UTC
Please describe the problem: Though I'm using disable-output-escaping, xsltproc adds a CDATA section, which changes the meaning of the linearized file. Consider the following example: <?xml version="1.0"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/1999/xhtml"> <xsl:output method="xml" encoding="iso-8859-1" doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" indent="yes"/> <xsl:template match="/"> <html> <head> <title>.</title> <style type="text/css"> <xsl:text disable-output-escaping="yes">/*<![CDATA[*/ body { } /*]]>*/</xsl:text> </style> </head> <body><p>.</p></body> </html> </xsl:template> </xsl:stylesheet> Sablotron and Xalan do it right, but not xsltproc. Steps to reproduce: 1. Save the above XSLT file under test.xsl 2. Run "xsltproc test.xsl test.xsl" Actual results: I get: [...] <style type="text/css"><![CDATA[/*<![CDATA[*/ body { } /*]]]]><![CDATA[>*/]]></style> [...] Expected results: I should have got: <style type="text/css">/*<![CDATA[*/ body { } /*]]>*/</style> This is what both Sablotron and Xalan give and the generated file is fully compliant to the W3C specs. Does this happen every time? Yes. Other information:
http://www.w3.org/TR/xhtml1/#h-4.8 Libxml2 follows the suggestion from the XHTML1 spec about the serialization of Script and Style elements. Not a bug, a feature. Daniel
http://www.w3.org/TR/xhtml1/#C_4 of interest too in your case: "Note that XML parsers are permitted to silently remove the contents of comments. Therefore, the historical practice of "hiding" scripts and style sheets within "comments" to make the documents backward compatible is likely to not work as expected in XML-based user agents." Daniel
You misread the specs: in any case, the XHTML1 serialization must not change the contents of text nodes. What the specs mean is that < and <![CDATA[<]]> are equivalent, and for compatibility with HTML4, the later form should be used, englobing the whole text node in a CDATA section. Now, when disable-output-escaping is used, the rules are special; in fact disable-output-escaping means that the normal XML rules should no longer be used. <xsl:text disable-output-escaping="yes"><foo></xsl:text> should generate, once linearized, <foo>, while <foo> and <![CDATA[<foo>]]> are incorrect. Concerning the comments, I didn't use any XML comments. I just used CSS comments, and an XML parser must not remove them.
"You misread the specs:" "What the specs mean is that the specs mean is that < and <![CDATA[<]]> are equivalent" Somehow, I'm not convinced... maybe I'm wrong, a few people complained about that but I would need a more convincing argumentation. If I'm right it's a NOTABUG, if I'm wrong it's a WONTFIX at this point. Daniel
Several additional points: * First note that XHTML1 Section 4 is informative (not normative). It's about differences with HTML4, and what is said is not even recommendations, i.e. it may break compatibility with HTML4 (and let's recall that XHTML1 was designed to allow compatibility with HTML parsers). For instance, if you take the example in Section 4.8: <script type="text/javascript"> <![CDATA[ ... unescaped script content ... ]]> </script> it will probably be wrong with HTML parsers. Indeed, HTML parsers will think that <![CDATA[ is just data of the text node (i.e. not markup), and this will generate Javascript errors (with CSS, I'm not sure this will be much a problem since errors are generally ignored, but this is not clean and not future-proof). The solution is to use a Javascript comment to hide this string to HTML parsers: <script type="text/javascript"> //<![CDATA[ ... unescaped script content ... //]]> </script> or <script type="text/javascript"> /* <![CDATA[ */ ... unescaped script content ... /* ]]> */ </script> As you can see, this will be parsed by both HTML and XML parsers as we want. Note: The script must contain neither </script> (as HTML parsers will recognize an end of script element) nor ]]> (as XML parsers will recognize an end of CDATA section), in which case an external script or more hack is needed. There's a discussion in French on: http://www.ljouanneau.com/blog/2004/04/06/262-la-section-cdata-en-xml Or in English: http://javascript.about.com/library/blxhtml.htm The same trick can be used for CSS, with CSS comments only (/* ... */). Also note that Section 4.8 does not say that one must use a CDATA section; it just suggests to use one if one doesn't want to use entities such as < instead of <. This is the case when ones wants to convert a HTML4 document into an XHTML1 one (without considering the HTML4 compatibility, otherwise one also needs to add JS comments as said above). * Bug 302529 is different. Indeed libxslt adds a (useless) CDATA section where it is not needed. This won't change anything with XML parsers. But this will break HTML parsers (as said above). That's why I think this one is also a real bug. * In any case, when disable-output-escaping="yes" is used, this must disable "normal" linearization rules, when supported. In particular, this can generate not well-formed XML. As said in the XSLT spec, an XSLT processor is not required to support disabling output escaping, but libxslt does support it at other places. And here, disabling output escaping is used for a good reason: compatibility with HTML parsers (i.e. old browsers, but also new ones when the media type is wrongly guessed, e.g. for local files...).
The behaviour was changed according to your proposal. For the content of the elements "style" and "script": - no implicit CDATA section is generated - text nodes marked with "disable-output-escape" are serialized as any other text nodes with that XSLT-specific semantic Committed to CVS HEAD, xmlsave.c, revision 1.34. Thanks for the report!