Bug 114557 – Incorrect Handlling of CDATA in <script>

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 114557 - Incorrect Handlling of CDATA in <script>


Summary:	Incorrect Handlling of CDATA in <script>


Status:	VERIFIED NOTABUG

Product:	libxml2
Classification:	Platform
Component:	general
Version:	2.5.2
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	Daniel Veillard

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2003-06-06 11:06 UTC by theseer
Modified:	2009-08-15 18:40 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Example XHTML document with embedded javascript code (572 bytes, text/html) 2003-06-10 10:51 UTC, theseer	Details

Description theseer 2003-06-06 11:06:27 UTC

The behaviour of handling CDATA-sections changed somewhere from 2.4.x to 2.5.x

The 'Error' gets triggered by supplying a doctype to the xml document.

Using the libXML2 functionality within php the example code shows the
Problem: While otherwise beeing unchanged the only difference between the
two XML-strings is the missing doctype for the second one.

<?PHP

 $xml=<<<EOF
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <body>
  some markup...
  <script>//<![CDATA[ .. some js code ]]></script>
  some more markup
 </body>
</html>
EOF;

 $dom=domxml_open_mem($xml);
 echo $dom->dump_mem(true, 'UTF-8');

 $xml2=<<<EOF
<?xml version="1.0" encoding="iso-8859-1"?>
<html xmlns="http://www.w3.org/1999/xhtml">
 <body>
  some markup...
  <script>//<![CDATA[ .. some js code ]]></script>
  some more markup
 </body>
</html>
EOF;

 $dom=domxml_open_mem($xml2);
 echo $dom->dump_mem(true, 'UTF-8');

?>

For whatever reason libXML chooses to wrap the // in the beginning into
another CDATA-Block when the xhtml1.0 trans doctype is used.

Even though that doesn't create an invalid xml document, it still is not an
expected behavior and according to the XHTML 1.0 TRANS dtd <script> is NOT
required to be of type CDATA only thus the // should be left untouched - as
it used to be in 2.4.x

Comment 1 Daniel Veillard 2003-06-06 13:08:13 UTC

Seems you didn't read the right section of the XHTML1 spec:
  http://www.w3.org/TR/xhtml1/#h-4.8

"In XHTML, the script and style elements are declared as having
#PCDATA content. As a result, < and & will be treated as the start of
markup, and entities such as &lt; and &amp; will be recognized as
entity references by the XML processor to < and & respectively.
Wrapping the content of the script or style element within a CDATA
marked section avoids the expansion of these entities."

 libxml2 does the suggested practice from the spec.
It's not a bug it's a feature, really.

Daniel

Comment 2 theseer 2003-06-09 23:28:03 UTC

I disagree with you. Even though you might consider it a feature it
clearly is a problem/bug to me. 

Why?

1. The original document did not contain the CDATA around the //
2. The XML Document is valid without the // beeing wrapped
3. The Specs do NOT deny non-CDATA content, not even the part you did
qoute
4. The wrapped // breaks the javascript processing of current
brower-implementations ( I agree that one can argue that this is
actually a problem of the browser and not of libXML. )

last but not least - MHO:

I didn't write the // in wrapepd CDATA tags, the document is valid
XML/xHTML without them and i don't expect a software to make up tags i
didn't invent just by parsing it. And to make it worse break my
software/website by doing so.

The only options left for me are to either stay at libxml2 2.4.x for
the time beeing, dump my document as HTML and not XML or to manually
regex your 'feature' out on a dumped xml document.

Feel free to correct me, but if i open a document and dump it right
away - despite a better/diffrent indenting - the source and dumped
document should be the same - as long as the source was valid xml. 

I don't see *ANY* reason for doing modifications like the one your
feature is doing.

Comment 3 Daniel Veillard 2003-06-09 23:59:55 UTC

The XHTML1 spec if like 10 page long.
Libxml2 follows the recommended handling from it
You found a software breaking because libxml2 follows the
recommended handling

Result: you blame libxml2 ... cool :-(

Simply don't use the XHTML1 DOCTYPE if you don't want libxml2
to follow the XHTML1 spec when serializing that document.

BTW 
 1/ CDATA sections are *not* tags, they have none of the properties
    of tags, they act mostly as a specific text fragment.
 2/ if it was just  a question of validity <br/> would be just
    as fine as <br></br> or <br /> while only one of those
    serialization is right from an XHTML1 point of view

The "*ANY* reason for doing modifications like the one your
feature is doing" is both prose and an example in a section of
the spec I'm supposed to follow in that case !
Either:
   - ask browser implementors to learn how to read a 10 pages spec
     and conform to it
  or
   - ask the XHTML working group to make an errata for that section
     of the spec so it get removed

But do not blame my software for being conformant, I understand you're
frustrated, but there is NO reason to reject it on the person who
actullly provides you with the correct and free softare part :-(

Daniel

Comment 4 theseer 2003-06-10 07:32:45 UTC

I don't blame libxml2 for a broken 3rd party software. I blame it for
modifying my xml-document in a way that it breaks that software, which
is NOT the same.

As i already stated in my 2nd comment, i do see the problem with
current browers. But that is kind of like the evolution-team that
ignores M$ subject encoding problems in emails for them not beeing
rfc-compliant. They might be correct with that, but it is *NOT*
getting us anyway. Especially since M$ is prolly not going to change
their code. 
So if nobody is moving, the problem won't go away on its own. And
believe me, i had to do enough 'hacks' to get M$ software ( especially
IE) working the way i want so i know at least somewhat what i am
talking about.

To come back to libxml and the feature of changing the xml. I just
reread the script-part you pasted from the specs and - call me picky
;/ - i still don 't get the reason why you *HAVE* to modify it.

The section you pasted and all others i found regarding <script> are
cleary more of a warning-kind: Using & and < will break the XML, thus
you have to wrap those into CDATA. Interpreting that as a general, you
must use CDATA in <script> ONLY is *not* covered - at least imho.

And yes, doing it anyways shouldn't be a problem. Agreed. But the
world out there suxx. And using // is a way to get the
javascript-engine happy AND meet the specs. Since - again, the way i
read and understand the specs - using // in there is valid xml and
CDATA is not enforeced by the XHTML specs either, i don't agree on
modifiy the document automagically.

It used to be correct (imho) in 2.4.x of libxml2 and was changed for
2.5.x. 
I do see the idea behind your feature, but as of most implementations
that try to be smart.. they sometimes are TOO smart.

I didn't complain (and keep doing so) about that because i want to
piss you off. In fact i do rely on your code so much in my current
implementations that your modification really IS a problem for me.

As a workaround i run a regex on the output to get 'my' way of // back
into the code. Ugly but works.

Regards,
 Arne Blankerts

Comment 5 Daniel Veillard 2003-06-10 07:43:52 UTC

show me a *complete* example of the input XML file, not some PHP
script, and I will see if I the result through xmllint need fixing.

Daniel

Comment 6 theseer 2003-06-10 10:51:51 UTC

Created attachment 17379 [details]
Example XHTML document with embedded javascript code

Comment 7 theseer 2003-06-10 10:57:31 UTC

The previous sent attachment should be of mimetype
"application/xhtml+xml" which was not accepted by bugzilla thus saved
as html.

The mimetype is actually a key-part of the problem. If I remove the //
from the file and serve it as 'application/xhtml+xml' mozilla can
handle it just fine - but IE chokes on it: it doesn't even render the
page.

If sent in an 'IE-compliant' way using 'text/html' i get js-errors in
both mozilla and IE.

Adding the // fixes it for both browsers even if sent as
'application/xhtml+xml' to mozilla and 'text/html' to IE.

Comment 8 S Woodside 2007-02-08 05:11:31 UTC

Hi Daniel, I agree with you that you're following the spec and the browsers should fix themselves. However ... unfortunately the current reality is that at least on Mozilla versions up to firefox 2 this is still broken and it makes it impossible to use inline <script> elements with output = xhtml. And that kind of sucks .... Is there some kind of workaround?

For example, could you script content for & or < and if they're not there, don't wrap it in a CDATA section?

I ask because XHTML is genuinely useful, and using libxslt is a great way to generate truly valid XHTML. But this particular bug in the browsers is a big block in the way of that. Thanks, Simon.

Comment 9 Thomas Weinert 2007-08-06 09:18:13 UTC

Please fix that behavior. Sorry but it not a question of the "recommend" handling. The problem is that it tries to be to smart.

I try to create conform pages. So I use XHTML output and XSLT as my template system. The <script> tags in the pages contain only function calls and sometimes variable assignments. Here are no & or <>, so here's no need for a CDATA. 

However, I found a workaround for XSLT.

<script type="text/javascript"><xsl:comment>
  ... js code here ...
//</xsl:comment></script>

I hope this will still work in the next version. :-(