Bug 514181 – empty CDATA sections are discarded

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 514181 - empty CDATA sections are discarded


Summary:	empty CDATA sections are discarded


Status:	RESOLVED FIXED

Product:	libxml2
Classification:	Platform
Component:	general
Version:	git master
Hardware:	Other Mac OS

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2008-02-03 22:46 UTC by Darin Adler
Modified:	2008-03-07 17:06 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Patch which resolves the problem in both modes (1.12 KB, patch) 2008-02-26 22:15 UTC, Mark Rowe	none	Details \| Review

Description Darin Adler 2008-02-03 22:46:17 UTC

Take the following test file:

    <?xml version="1.0"?>
    <html xmlns="http://www.w3.org/1999/xhtml">
    <![CDATA[]]>
    </html>

Run it through xmllint. Note that the output of xmllint doesn't include the CDATA section. That's because it's empty. It gets discarded by libxml2. This causes problems for the WebKit project in various test cases and possibly some day in real world uses. Here's an example of a test that fails in WebKit because of this:

    http://www.hixie.ch/tests/adhoc/dom/traversal/node-iterator/001.xml

Comment 1 Mark Rowe 2008-02-04 06:38:50 UTC

There appears to be two issues here:  serialising a zero-length CDATA section omits the entire section, and the CDATA callback is never fired for a zero-length CDATA section when using a push parser.  

If I run "xmllint empty-cdata.xml" under GDB and set a breakpoint on xmlSAX2CDataBlock, I see that it is triggered with a zero-length string.  The CDATA section is not written in the output.  If I run "xmllint --push empty-cdata.xml" then the breakpoint on xmlSAX2CDataBlock is never triggered, and no CDATA section is written.  WebKit is using the push parser mode of libxml2 so it also does not receive the callback.

The following tiny patch fixes the non-push case, but does not address the push case as without the call to xmlSAX2CDataBlock there is no CDATA node in the tree for it to save when it comes time.  It doesn't address WebKit's use for a similar reason.

diff --git a/libxml2/xmlsave.c b/libxml2/xmlsave.c
index cbabd78..487ca02 100644
--- a/libxml2/xmlsave.c
+++ b/libxml2/xmlsave.c
@@ -727,7 +727,7 @@ xmlNodeDumpOutputInternal(xmlSaveCtxtPtr ctxt, xmlNodePtr cur) {
        return;
     }
     if (cur->type == XML_CDATA_SECTION_NODE) {
-       if (cur->content == NULL) {
+       if (cur->content == NULL || *cur->content == '\0') {
                xmlOutputBufferWrite(buf, 12, "<![CDATA[]]>");
        } else {
            start = end = cur->content;

Comment 2 Darin Adler 2008-02-04 17:35:43 UTC

Please note that xhtmlNodeDumpOutput has code similar to what Mark quotes above, and presumably it would also need to be patched.

For the push case, I think the code that needs patching is this excerpt of xmlParseTryOrFinish:

		    if ((ctxt->sax != NULL) && (base > 0) &&
			(!ctxt->disableSAX)) {

I think the base > 0 check should be removed.

Comment 3 Mark Rowe 2008-02-04 18:17:29 UTC

I agree that it's that piece of code which is responsible for preventing the callback in this case, but removing the base > 0 check can result in an extra callback with a zero length of data being generated at the end of parsing a CDATA section with content.  When base == 0 the callback should only be generated if we have not yet made any callbacks for the current CDATA section.  I'm not entirely sure how to express that in the code though.

Comment 4 Mark Rowe 2008-02-26 22:15:26 UTC

Created attachment 106028 [details] [review]
Patch which resolves the problem in both modes

Comment 5 Mark Rowe 2008-02-26 22:16:42 UTC

Daniel, can you please review and land this fix?

Comment 6 Daniel Veillard 2008-03-03 08:13:53 UTC

Darin, Mark, is that really a bug. if you take the XPath data model,
empty text nodes are not expected to be reported (and text and CData
nodes are merged) it may be considered a specific model still changing
this may have significant impact in other parts, you really
can't look at this in isolation just looking at the specifics of the
DOM model (which BTW is one of the few W3C specs I don't claim libxml2
implements)

http://www.w3.org/TR/xpath#section-Text-Nodes

The first part of the patch makes sense, because if the CDATA node is in
the tree well, it should be saved back. The second one changing the
behaviour would need at least some exposure on the mailing-list,

Daniel

Comment 7 Darin Adler 2008-03-03 17:39:15 UTC

(In reply to comment #6)
> Darin, Mark, is that really a bug? if you take the XPath data model,
> empty text nodes are not expected to be reported (and text and CData
> nodes are merged)

We'd like to continue to use libxml2 in WebKit, which is used in web browsers. All other web browsers preserve these empty CDATA sections, and it's required for various tests written by Ian Hixson.

So if this can't be the default behavior, we'll need a mode that has this behavior. You've suggested this be discussed on the mailing list, so lets do that.

Comment 8 Darin Adler 2008-03-03 17:47:13 UTC

http://mail.gnome.org/archives/xml/2008-March/msg00014.html

Comment 9 Daniel Veillard 2008-03-06 16:40:56 UTC

w.r.t. comment #7
Well this is a bad case of DOM data model being different from XPath data model.
XPath processing already requires for example to resolve and substitute entities
which is not the default for libxml2. I don't know what kind of parsing flags
you use for Webkit, I assume you build you own tree model on top of SAX2, so
what really matters is that we get the SAX2 callbacks then the way libxml2
builds the tree possibly ignoring such empty CDATA sections and hence not
modifying the default tree would be fine. If yes that's probably the best 
approach for everybody.
Are you using libxml2 tree, or your own on top of SAX2 ? maybe there is an
easy way,

Daniel

Comment 10 Darin Adler 2008-03-06 17:12:43 UTC

(In reply to comment #9)
> Are you using libxml2 tree, or your own on top of SAX2?

WebKit currently uses libxml2 in two ways:

    1) SAX2 to create the WebKit DOM tree, for things like XHTML, SVG, and XMLHttpRequest
    2) xmlReadMemory and the libxml2 tree for XSL and documents processed by XSLT

We don't use libxml2 at all for our XPath implementation.

Naturally we do resolve and substitute entities for (1).

I think the CDATA thing may be is an issue both for (1) and (2), but it's far more critical for (1), and thus a SAX2-only solution to the problem would be a helpful first step.

Given what you say about XPath it sounds like browsers with XPath implementations are going to have some trouble here, because the browsers have only a single DOM tree, and can't have it both ways.

Comment 11 Daniel Veillard 2008-03-07 17:06:05 UTC

Okay, I re-reviewed the existing parsing and saving behaviour of
the parser, and first:
   - changing the push parsing code for CDATA section just make it
     compatible with the normal parsing one. I guess fundamentally
     one need to provide as much information as possible at the SAX
     level, then based on the tree builder, one may adapt. XPath
     requiring 1/ coalesced adjacent text nodes and 2/ no CDATA nodes
     I think the existing flags for XPath covers the need already.
   - as stated previously if the CDATA node is in the tree, then it should
     be saved, so I modified the xml and xhtml saving routines.
I patched the relevant areas based on the suggested diff, but changing 
things slightly, added the test to the regression suite and made sure
all parsing interface behaved in a coordinated way on such input (including
the readers, walkers too).

So has been commited in SVN revision 3701,

  thanks for the feedback,

Daniel

P.S.: w.r.t. DOM and XPath incompatibilities, there is quite a few, and
  unless tweaking the XPath engine quite a lot for cdata, text nodes,
  namespace nodes etc., there is such disparities it's nearly impossible
  to have a really compliant XPath engine on top of a normal DOM tree.