After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 610802 - In push mode, libxml2 may create multiple CDATA sections for original single CDATA section, or merger multiple adjacent CDATA sections to single one
In push mode, libxml2 may create multiple CDATA sections for original single ...
Status: RESOLVED OBSOLETE
Product: libxml2
Classification: Platform
Component: general
2.7.6
Other All
: Normal major
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2010-02-23 12:06 UTC by johnnyding
Modified: 2021-07-05 13:22 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
xml file only contains two adjacent CDATA sections (8.04 KB, application/xml)
2010-02-23 12:07 UTC, johnnyding
Details

Description johnnyding 2010-02-23 12:06:42 UTC
When using libxml2 to parse XML document in push mode, I found libxml2 may create multiple CDATA sections for original single CDATA section, or merger multiple CDATA sections to single one.

At first, I use xmllint(libxml2 v2.7.6) to test the attached test4.xml, which only contains two adjacent CDATA sections. 

In libxml2 default SAX cdataBlock handler (SAX2.c, line 2679), it checked whether last child node was CDATA section. If yes, it appended the current input to the last CDATA section node. If not, it created a new CDATA section node to contain the current input. This behavior would cause that multiple adjacent CDATA sections automatically was combined to one single CDATA sections, which did not exactly parse out the original xml structure.

Some other implementation of SAX cdataBlock handler, like WebKit XMLTokenizer (XMLTokenizer::cdataBlock, XMLTokenizerlibxml2.cpp, line 956), didn't check
whether the previous node (last child node) was CDATA section and try to merge
them if the previous node is CDATA section. It alwasy created a CDATA section to contains input data. Which would cause that one single CDATA section was parsed to multiple CDATA sections. 
For example, the contents of first CDATA section in test4.xml are almost 7K, in WebKit(push mode), the whole CDATA section may not one-time be sent to libxml2 parser, so the libxml2 parser created one or a few CDATA section(s) which is/are 300(XML_PARSER_BUG_BUFFER_SIZE) characters since it could NOT find the valid end tag of CDATA section. Until the last part data came, the libxml parser found the the valid end tag "]]>", then the rest contents of the original CDATA section were put into another CDATA section.
(please refer to parse.c, code: base =xmlParseLookupSequence(ctxt,']',']','>') )
Now we got multiple CDATA sections instead of single CDATA section
like the original xml data. Which may break some apps.


This issues is because in push mode, we use small buffer(XML_PARSER_BUG_BUFFER_SIZE) to continuously parse big xml data. So my proposal is to add a flag to indicate whether the data of CDATA section is terminated or not, then the parser context can use this flag to decide whether current input can be appended to last CDATA section node or not. The pseudo code lists below

typedef void (*cdataBlockSAXFunc) (void *ctx,
                                   const xmlChar *value,
                                   int len,
                                   int terminate);


xmlSAX2CDataBlock(void *ctx, const xmlChar *value, int len, int terminate)
{
    ...
    lastChild = xmlGetLastChild(ctxt->node);
    ...
    if ((lastChild != NULL) &&
        (lastChild->type == XML_CDATA_SECTION_NODE) &&
        !ctx->lastCDATAIsTerminated) {
        xmlTextConcat(lastChild, value, len);
    } else {
        ret = xmlNewCDataBlock(ctxt->myDoc, value, len);
        xmlAddChild(ctxt->node, ret);
    }
    ctx->lastCDATAIsTerminated = terminate;
}
Comment 1 johnnyding 2010-02-23 12:07:56 UTC
Created attachment 154489 [details]
xml file only contains two adjacent CDATA sections
Comment 2 Xianzhu Wang 2010-02-24 07:15:22 UTC
I think an ugly workaround without changing the API might be:

void handleCDataBlock(void *ctx, const xmlChar *value, int len) {
   int terminated = (value[len] == ']' && value[len+1] == ']' && value[len+2] == '>');
   ....
}
Comment 3 johnnyding 2010-02-25 02:30:55 UTC
another quick and dirty solution might be skipping to the step of creating a temporary CDATA section(XML_PARSER_BIG_BUFFER_SIZE length).

In parse.c line 11041 (v2.7.6)
base = xmlParseLookupSequence(ctxt, ']', ']', '>');
if (base < 0) {
  goto done;
} else {
  ...
}
Comment 4 johnnyding 2010-02-25 02:34:47 UTC
Continue with the previous comments.
With that workaround, we don't need to check whether last node was CDATA section or not, just create a new CDATA section like WebKit did
Comment 5 apavlov 2010-07-26 16:25:38 UTC
The behaviour described has been adversely affects XML parsing in the Safari and Chromium/Google Chrome browsers for quite a while, breaking multiple sites (see http://code.google.com/p/chromium/issues/detail?id=36431 for an example) as well as the built-in Web Inspector. Is there any way I can contribute to get this bug fixed? Is any of the solutions suggested above acceptable?
Comment 6 GNOME Infrastructure Team 2021-07-05 13:22:37 UTC
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new ticket at
  https://gitlab.gnome.org/GNOME/libxml2/-/issues/

Thank you for your understanding and your help.