Bug 316041 – libxml2 xhtmlNodeDumpOutput breaks HTML <script> with obnoxious CDATA section

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 316041 - libxml2 xhtmlNodeDumpOutput breaks HTML <script> with obnoxious CDATA section


Summary:	libxml2 xhtmlNodeDumpOutput breaks HTML <script> with obnoxious CDATA section


Status:	RESOLVED FIXED

Product:	libxml2
Classification:	Platform
Component:	general
Version:	2.6.21
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2005-09-12 01:53 UTC by David Madore
Modified:	2005-09-12 14:03 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
patch to fix handling of <style> elements (2.96 KB, patch) 2005-09-12 02:47 UTC, David Madore	rejected	Details \| Review
hopefully more acceptable fix (1.91 KB, patch) 2005-09-12 12:42 UTC, David Madore	none	Details \| Review

Description David Madore 2005-09-12 01:53:30 UTC

Consider the following legitimate XHTML file:

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Test</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<meta http-equiv="Content-Language" content="en" />
<script type="text/javascript" defer="defer">
alert("Hello, world!");
</script>
</head>
<body>
<h1>Test</h1>
</body>
</html>

now pass it through xmllint with no arguments (this is merely to demonstrate the
effect of the xhtmlNodeDumpOutput() function): it inserts a CDATA section around
the <script> element's content, thereby making the JavaScript essentially broken
(recall that, while the output XHTML is valid XML, it will be read in HTML4 /
tagsoup mode on essentially all browsers, and in HTML4 the <script> element is
already CDATA so the addition of a CDATA header just breaks the JavaScript). 
The relevant function is xmlsave.c/xhtmlNodeDumpOutput, somewhere around line
1280 (section 4.8, "Script and Style elements").

Someone took the care to add a fix (which, IMHO, is again bugware, but still
better than nothing!) for the <style> element (which consists of checking for an
unescaped '>', '<' or '&' character): now exactly the same remarks hold for the
<script> element, so at least the same fix should be applied there.  Better
would be to add a fix which escapes the CDATA properly for JavaScript, something
like (the very standard):

<script type="text/javascript" defer="defer">
// <![CDATA[
alert("Hello, world!");
// ]]>
</script>

(note the '//' comment separators: for <style> it should be /* ... */ instead).

Even better would be to do... absolutely nothing: I believe it is the
programmer's responability to assure that upon calling the xhtmlNodeDumpOutput
function the <script> and <style> elements have correct content (e.g., a CDATA
section where appropriate).

But at any case, in the present state of affairs, <script> is broken and should
be fixed.

Comment 1 David Madore 2005-09-12 02:47:10 UTC

Created attachment 52110 [details] [review]
patch to fix handling of <style> elements

Tries to do the most unintrusive bugware possible.  Probably still broken in
corner cases, but at least now it should work in nearly every common situation.

Comment 2 Daniel Veillard 2005-09-12 08:40:35 UTC

Have you read the XHTML1 spec ? Libxml2 serialization just follows the
recommended best practice from the spec.

http://www.w3.org/TR/xhtml1/#h-4.8

The spec exists precisely to get older browsers to cope with XML serialization
and is just a set of serialization rules in practice. So removing the special
processing is out of order.

Now explain why you consider serializing an XHTML1 document according to
the XHTML1 specification "broken" ? 

Can you point me to the "standard" suggesting // before the CDATA. I assume
this is language specific, the XHTML1 REC is not. Your patch blindly make 
without even testing the script language, // is not a comment in all potential
script language ! That looks way more broken than libxml2 conformant behaviour.

Please read the XHTML spec, then reread your comment, see the irony ? 
"<script> is broken and should be fixed" hum, no it just follows the 
published specification standard, the broken piece really is the browser or
the spec.

Daniel

Comment 3 Daniel Veillard 2005-09-12 08:43:53 UTC

BTW this really should be closed at NOTABUG, but since you spent time
to make a patch I just don't want to close it abruptly.

Daniel

Comment 4 David Madore 2005-09-12 12:27:03 UTC

Sorry about the tone of that bug report.  Here are some details.

First, the background (which you, Daniel, are familiar with, but I recall for
possible other readers' convenience): in older (SGML-based) HTML, the <script>
and <style> elements are #CDATA, which means that it is read as such (as
character data) until the closing tag, entities are not expanded, comments are
not removed, etc.  In XML, there is no longer such a thing as #CDATA, and
#PCDATA is used instead.  Now XHTML 1.0 is generally served with the MIME
content-type of "text/html", in which case (as per
http://www.w3.org/TR/xhtml-media-types/ recommendation) it should be
HTML-compatible (follow the recommendations of
http://www.w3.org/TR/xhtml1/#guidelines - "appendix C"): this is because many
browsers will treat it as "tag soup", so the programmer must ensure that the
XHTML is _simultaneously_ valid XHTML _and_ "valid" tag soup for the intended
meaning.  Upon <style> and <script> elements this imposes the stringent
requirement that their content be interpretable both as #CDATA and as #PCDATA
with the same meaning!  As this is difficult, the official recommendation in
http://www.w3.org/TR/xhtml1/#C_4 is to use external stylesheets or avoid using
certain characters ("<", "&", "]]>" or "--").  Another frequent practice is to
use a CDATA section, but since even the CDATA beginning and end markers
themselves will be interpreted as CDATA in HTML "tag soup" mode, they must be
escaped as stylesheet or script comments.  Hence the frequent "//<![CDATA["
found at the beginning of <script> elements in XHTML: this makes them work
whether the <script> is read as #CDATA or as #PCDATA.  Phew...

Now, back to libxml2.  It seems to me that the xhtmlNodeDumpOutput() function
tries to produce HTML-compatible (that is "appendix C compliant") XHTML: indeed,
all the XHTML doctypes which are automagically detected at the begining of
xmlsave.c (in the xmlIsXHTML() function) are XHTML 1.0 doctypes, and the
compatibility measures taken (such as inserting a space before '/' in empty
elements) are aimed at HTML ("tag soup" mode) compatibility.  My grudge is that
the behavior on <script> elements, while fully XML-compliant (of course), breaks
this HTML compatibility.

There are various approaches around this.  One (and IMHO the best) is to say
that it is not libxml2's job to add a CDATA section at all: it is the (upstream)
programmer's responsibility to ensure that the tree passed at this point
contains the CDATA sections where necessary.  Another (as my patch tries to do)
is to insert CDATA sections but escape them as seems most likely to work:
naturally, if the (upstream) programmer does not agree with this escaping, he is
free to work around them by inserting his own CDATA section in the tree; so if
the language is not JavaScript and requires, say, '#' as a comment, then the
programmer would use '#<![CDATA[' and my patch would not introduce any further
CDATA section and all would be fine.  The bottom line is that blindly inserting
CDATA sections is sure to break things (the XHTML spec says to use CDATA
sections, yes, but it does not say they should be blindly inserted, it merely
says that "Wrapping the content of the script or style element within a CDATA
marked section avoids the expansion of these entities").

In any case (THIS IS THE IMPORTANT BIT, sorry I'm being so verbose) when the
<script> or <style>'s element content does not have any '<' or '&' (or '--' or
']]>') content or when they have been already properly protected (in CDATA
sections) by the programmer, I believe it is wrong for libxml2 to try to add its
own CDATA sections.  The example I gave in my initial bug report had no need for
CDATA escaping (the code is merely alert("Hello, world!");) and according to
http://www.w3.org/TR/xhtml1/#h-4.8 there is no need to create a CDATA section in
this case.  The same holds if the programmer already took care to create
appropriate CDATA sections.

I will shortly submit a second patch that proposes a minimal (and hopefully more
acceptable) change of behavior.

Comment 5 David Madore 2005-09-12 12:42:22 UTC

Created attachment 52125 [details] [review]
hopefully more acceptable fix

Here is another patch that should, hopefully, be more acceptable: it inserts a
CDATA section only when absolutely necessary (that is, not when there is no
character needing to be escaped, nor when the programmer already put in a CDATA
section).  But note that I still think the original patch was better.

Comment 6 Daniel Veillard 2005-09-12 14:03:39 UTC

Okay, this looks fine then. This should cover the cases where the user didn't
preprocessed the script to make it okay for direct serialization, and still
allow those who want to handle it directly to do so.
This is applied, I had to fix the regression test covering it a bit, but that's
fine IMHO.

I will put an xmlSaveOption to avoid XHTML1 serialization altogether at some
point, but this also mean no special closing tag processing etc ... mostly
useful when libxml2 serialization is not the last step in the processing
pipeline.

  thanks,

Daniel