GNOME Bugzilla – Bug 316041
libxml2 xhtmlNodeDumpOutput breaks HTML <script> with obnoxious CDATA section
Last modified: 2005-09-12 14:03:39 UTC
Consider the following legitimate XHTML file: <?xml version="1.0" encoding="iso-8859-1"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Test</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> <meta http-equiv="Content-Language" content="en" /> <script type="text/javascript" defer="defer"> alert("Hello, world!"); </script> </head> <body> <h1>Test</h1> </body> </html> now pass it through xmllint with no arguments (this is merely to demonstrate the effect of the xhtmlNodeDumpOutput() function): it inserts a CDATA section around the <script> element's content, thereby making the JavaScript essentially broken (recall that, while the output XHTML is valid XML, it will be read in HTML4 / tagsoup mode on essentially all browsers, and in HTML4 the <script> element is already CDATA so the addition of a CDATA header just breaks the JavaScript). The relevant function is xmlsave.c/xhtmlNodeDumpOutput, somewhere around line 1280 (section 4.8, "Script and Style elements"). Someone took the care to add a fix (which, IMHO, is again bugware, but still better than nothing!) for the <style> element (which consists of checking for an unescaped '>', '<' or '&' character): now exactly the same remarks hold for the <script> element, so at least the same fix should be applied there. Better would be to add a fix which escapes the CDATA properly for JavaScript, something like (the very standard): <script type="text/javascript" defer="defer"> // <![CDATA[ alert("Hello, world!"); // ]]> </script> (note the '//' comment separators: for <style> it should be /* ... */ instead). Even better would be to do... absolutely nothing: I believe it is the programmer's responability to assure that upon calling the xhtmlNodeDumpOutput function the <script> and <style> elements have correct content (e.g., a CDATA section where appropriate). But at any case, in the present state of affairs, <script> is broken and should be fixed.
Created attachment 52110 [details] [review] patch to fix handling of <style> elements Tries to do the most unintrusive bugware possible. Probably still broken in corner cases, but at least now it should work in nearly every common situation.
Have you read the XHTML1 spec ? Libxml2 serialization just follows the recommended best practice from the spec. http://www.w3.org/TR/xhtml1/#h-4.8 The spec exists precisely to get older browsers to cope with XML serialization and is just a set of serialization rules in practice. So removing the special processing is out of order. Now explain why you consider serializing an XHTML1 document according to the XHTML1 specification "broken" ? Can you point me to the "standard" suggesting // before the CDATA. I assume this is language specific, the XHTML1 REC is not. Your patch blindly make without even testing the script language, // is not a comment in all potential script language ! That looks way more broken than libxml2 conformant behaviour. Please read the XHTML spec, then reread your comment, see the irony ? "<script> is broken and should be fixed" hum, no it just follows the published specification standard, the broken piece really is the browser or the spec. Daniel
BTW this really should be closed at NOTABUG, but since you spent time to make a patch I just don't want to close it abruptly. Daniel
Sorry about the tone of that bug report. Here are some details. First, the background (which you, Daniel, are familiar with, but I recall for possible other readers' convenience): in older (SGML-based) HTML, the <script> and <style> elements are #CDATA, which means that it is read as such (as character data) until the closing tag, entities are not expanded, comments are not removed, etc. In XML, there is no longer such a thing as #CDATA, and #PCDATA is used instead. Now XHTML 1.0 is generally served with the MIME content-type of "text/html", in which case (as per http://www.w3.org/TR/xhtml-media-types/ recommendation) it should be HTML-compatible (follow the recommendations of http://www.w3.org/TR/xhtml1/#guidelines - "appendix C"): this is because many browsers will treat it as "tag soup", so the programmer must ensure that the XHTML is _simultaneously_ valid XHTML _and_ "valid" tag soup for the intended meaning. Upon <style> and <script> elements this imposes the stringent requirement that their content be interpretable both as #CDATA and as #PCDATA with the same meaning! As this is difficult, the official recommendation in http://www.w3.org/TR/xhtml1/#C_4 is to use external stylesheets or avoid using certain characters ("<", "&", "]]>" or "--"). Another frequent practice is to use a CDATA section, but since even the CDATA beginning and end markers themselves will be interpreted as CDATA in HTML "tag soup" mode, they must be escaped as stylesheet or script comments. Hence the frequent "//<![CDATA[" found at the beginning of <script> elements in XHTML: this makes them work whether the <script> is read as #CDATA or as #PCDATA. Phew... Now, back to libxml2. It seems to me that the xhtmlNodeDumpOutput() function tries to produce HTML-compatible (that is "appendix C compliant") XHTML: indeed, all the XHTML doctypes which are automagically detected at the begining of xmlsave.c (in the xmlIsXHTML() function) are XHTML 1.0 doctypes, and the compatibility measures taken (such as inserting a space before '/' in empty elements) are aimed at HTML ("tag soup" mode) compatibility. My grudge is that the behavior on <script> elements, while fully XML-compliant (of course), breaks this HTML compatibility. There are various approaches around this. One (and IMHO the best) is to say that it is not libxml2's job to add a CDATA section at all: it is the (upstream) programmer's responsibility to ensure that the tree passed at this point contains the CDATA sections where necessary. Another (as my patch tries to do) is to insert CDATA sections but escape them as seems most likely to work: naturally, if the (upstream) programmer does not agree with this escaping, he is free to work around them by inserting his own CDATA section in the tree; so if the language is not JavaScript and requires, say, '#' as a comment, then the programmer would use '#<![CDATA[' and my patch would not introduce any further CDATA section and all would be fine. The bottom line is that blindly inserting CDATA sections is sure to break things (the XHTML spec says to use CDATA sections, yes, but it does not say they should be blindly inserted, it merely says that "Wrapping the content of the script or style element within a CDATA marked section avoids the expansion of these entities"). In any case (THIS IS THE IMPORTANT BIT, sorry I'm being so verbose) when the <script> or <style>'s element content does not have any '<' or '&' (or '--' or ']]>') content or when they have been already properly protected (in CDATA sections) by the programmer, I believe it is wrong for libxml2 to try to add its own CDATA sections. The example I gave in my initial bug report had no need for CDATA escaping (the code is merely alert("Hello, world!");) and according to http://www.w3.org/TR/xhtml1/#h-4.8 there is no need to create a CDATA section in this case. The same holds if the programmer already took care to create appropriate CDATA sections. I will shortly submit a second patch that proposes a minimal (and hopefully more acceptable) change of behavior.
Created attachment 52125 [details] [review] hopefully more acceptable fix Here is another patch that should, hopefully, be more acceptable: it inserts a CDATA section only when absolutely necessary (that is, not when there is no character needing to be escaped, nor when the programmer already put in a CDATA section). But note that I still think the original patch was better.
Okay, this looks fine then. This should cover the cases where the user didn't preprocessed the script to make it okay for direct serialization, and still allow those who want to handle it directly to do so. This is applied, I had to fix the regression test covering it a bit, but that's fine IMHO. I will put an xmlSaveOption to avoid XHTML1 serialization altogether at some point, but this also mean no special closing tag processing etc ... mostly useful when libxml2 serialization is not the last step in the processing pipeline. thanks, Daniel