GNOME Bugzilla – Bug 114557
Incorrect Handlling of CDATA in <script>
Last modified: 2009-08-15 18:40:50 UTC
The behaviour of handling CDATA-sections changed somewhere from 2.4.x to 2.5.x The 'Error' gets triggered by supplying a doctype to the xml document. Using the libXML2 functionality within php the example code shows the Problem: While otherwise beeing unchanged the only difference between the two XML-strings is the missing doctype for the second one. <?PHP $xml=<<<EOF <?xml version="1.0" encoding="iso-8859-1"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <body> some markup... <script>//<![CDATA[ .. some js code ]]></script> some more markup </body> </html> EOF; $dom=domxml_open_mem($xml); echo $dom->dump_mem(true, 'UTF-8'); $xml2=<<<EOF <?xml version="1.0" encoding="iso-8859-1"?> <html xmlns="http://www.w3.org/1999/xhtml"> <body> some markup... <script>//<![CDATA[ .. some js code ]]></script> some more markup </body> </html> EOF; $dom=domxml_open_mem($xml2); echo $dom->dump_mem(true, 'UTF-8'); ?> For whatever reason libXML chooses to wrap the // in the beginning into another CDATA-Block when the xhtml1.0 trans doctype is used. Even though that doesn't create an invalid xml document, it still is not an expected behavior and according to the XHTML 1.0 TRANS dtd <script> is NOT required to be of type CDATA only thus the // should be left untouched - as it used to be in 2.4.x
Seems you didn't read the right section of the XHTML1 spec: http://www.w3.org/TR/xhtml1/#h-4.8 "In XHTML, the script and style elements are declared as having #PCDATA content. As a result, < and & will be treated as the start of markup, and entities such as < and & will be recognized as entity references by the XML processor to < and & respectively. Wrapping the content of the script or style element within a CDATA marked section avoids the expansion of these entities." libxml2 does the suggested practice from the spec. It's not a bug it's a feature, really. Daniel
I disagree with you. Even though you might consider it a feature it clearly is a problem/bug to me. Why? 1. The original document did not contain the CDATA around the // 2. The XML Document is valid without the // beeing wrapped 3. The Specs do NOT deny non-CDATA content, not even the part you did qoute 4. The wrapped // breaks the javascript processing of current brower-implementations ( I agree that one can argue that this is actually a problem of the browser and not of libXML. ) last but not least - MHO: I didn't write the // in wrapepd CDATA tags, the document is valid XML/xHTML without them and i don't expect a software to make up tags i didn't invent just by parsing it. And to make it worse break my software/website by doing so. The only options left for me are to either stay at libxml2 2.4.x for the time beeing, dump my document as HTML and not XML or to manually regex your 'feature' out on a dumped xml document. Feel free to correct me, but if i open a document and dump it right away - despite a better/diffrent indenting - the source and dumped document should be the same - as long as the source was valid xml. I don't see *ANY* reason for doing modifications like the one your feature is doing.
The XHTML1 spec if like 10 page long. Libxml2 follows the recommended handling from it You found a software breaking because libxml2 follows the recommended handling Result: you blame libxml2 ... cool :-( Simply don't use the XHTML1 DOCTYPE if you don't want libxml2 to follow the XHTML1 spec when serializing that document. BTW 1/ CDATA sections are *not* tags, they have none of the properties of tags, they act mostly as a specific text fragment. 2/ if it was just a question of validity <br/> would be just as fine as <br></br> or <br /> while only one of those serialization is right from an XHTML1 point of view The "*ANY* reason for doing modifications like the one your feature is doing" is both prose and an example in a section of the spec I'm supposed to follow in that case ! Either: - ask browser implementors to learn how to read a 10 pages spec and conform to it or - ask the XHTML working group to make an errata for that section of the spec so it get removed But do not blame my software for being conformant, I understand you're frustrated, but there is NO reason to reject it on the person who actullly provides you with the correct and free softare part :-( Daniel
I don't blame libxml2 for a broken 3rd party software. I blame it for modifying my xml-document in a way that it breaks that software, which is NOT the same. As i already stated in my 2nd comment, i do see the problem with current browers. But that is kind of like the evolution-team that ignores M$ subject encoding problems in emails for them not beeing rfc-compliant. They might be correct with that, but it is *NOT* getting us anyway. Especially since M$ is prolly not going to change their code. So if nobody is moving, the problem won't go away on its own. And believe me, i had to do enough 'hacks' to get M$ software ( especially IE) working the way i want so i know at least somewhat what i am talking about. To come back to libxml and the feature of changing the xml. I just reread the script-part you pasted from the specs and - call me picky ;/ - i still don 't get the reason why you *HAVE* to modify it. The section you pasted and all others i found regarding <script> are cleary more of a warning-kind: Using & and < will break the XML, thus you have to wrap those into CDATA. Interpreting that as a general, you must use CDATA in <script> ONLY is *not* covered - at least imho. And yes, doing it anyways shouldn't be a problem. Agreed. But the world out there suxx. And using // is a way to get the javascript-engine happy AND meet the specs. Since - again, the way i read and understand the specs - using // in there is valid xml and CDATA is not enforeced by the XHTML specs either, i don't agree on modifiy the document automagically. It used to be correct (imho) in 2.4.x of libxml2 and was changed for 2.5.x. I do see the idea behind your feature, but as of most implementations that try to be smart.. they sometimes are TOO smart. I didn't complain (and keep doing so) about that because i want to piss you off. In fact i do rely on your code so much in my current implementations that your modification really IS a problem for me. As a workaround i run a regex on the output to get 'my' way of // back into the code. Ugly but works. Regards, Arne Blankerts
show me a *complete* example of the input XML file, not some PHP script, and I will see if I the result through xmllint need fixing. Daniel
Created attachment 17379 [details] Example XHTML document with embedded javascript code
The previous sent attachment should be of mimetype "application/xhtml+xml" which was not accepted by bugzilla thus saved as html. The mimetype is actually a key-part of the problem. If I remove the // from the file and serve it as 'application/xhtml+xml' mozilla can handle it just fine - but IE chokes on it: it doesn't even render the page. If sent in an 'IE-compliant' way using 'text/html' i get js-errors in both mozilla and IE. Adding the // fixes it for both browsers even if sent as 'application/xhtml+xml' to mozilla and 'text/html' to IE.
Hi Daniel, I agree with you that you're following the spec and the browsers should fix themselves. However ... unfortunately the current reality is that at least on Mozilla versions up to firefox 2 this is still broken and it makes it impossible to use inline <script> elements with output = xhtml. And that kind of sucks .... Is there some kind of workaround? For example, could you script content for & or < and if they're not there, don't wrap it in a CDATA section? I ask because XHTML is genuinely useful, and using libxslt is a great way to generate truly valid XHTML. But this particular bug in the browsers is a big block in the way of that. Thanks, Simon.
Please fix that behavior. Sorry but it not a question of the "recommend" handling. The problem is that it tries to be to smart. I try to create conform pages. So I use XHTML output and XSLT as my template system. The <script> tags in the pages contain only function calls and sometimes variable assignments. Here are no & or <>, so here's no need for a CDATA. However, I found a workaround for XSLT. <script type="text/javascript"><xsl:comment> ... js code here ... //</xsl:comment></script> I hope this will still work in the next version. :-(