GNOME Bugzilla – Bug 610802
In push mode, libxml2 may create multiple CDATA sections for original single CDATA section, or merger multiple adjacent CDATA sections to single one
Last modified: 2021-07-05 13:22:37 UTC
When using libxml2 to parse XML document in push mode, I found libxml2 may create multiple CDATA sections for original single CDATA section, or merger multiple CDATA sections to single one. At first, I use xmllint(libxml2 v2.7.6) to test the attached test4.xml, which only contains two adjacent CDATA sections. In libxml2 default SAX cdataBlock handler (SAX2.c, line 2679), it checked whether last child node was CDATA section. If yes, it appended the current input to the last CDATA section node. If not, it created a new CDATA section node to contain the current input. This behavior would cause that multiple adjacent CDATA sections automatically was combined to one single CDATA sections, which did not exactly parse out the original xml structure. Some other implementation of SAX cdataBlock handler, like WebKit XMLTokenizer (XMLTokenizer::cdataBlock, XMLTokenizerlibxml2.cpp, line 956), didn't check whether the previous node (last child node) was CDATA section and try to merge them if the previous node is CDATA section. It alwasy created a CDATA section to contains input data. Which would cause that one single CDATA section was parsed to multiple CDATA sections. For example, the contents of first CDATA section in test4.xml are almost 7K, in WebKit(push mode), the whole CDATA section may not one-time be sent to libxml2 parser, so the libxml2 parser created one or a few CDATA section(s) which is/are 300(XML_PARSER_BUG_BUFFER_SIZE) characters since it could NOT find the valid end tag of CDATA section. Until the last part data came, the libxml parser found the the valid end tag "]]>", then the rest contents of the original CDATA section were put into another CDATA section. (please refer to parse.c, code: base =xmlParseLookupSequence(ctxt,']',']','>') ) Now we got multiple CDATA sections instead of single CDATA section like the original xml data. Which may break some apps. This issues is because in push mode, we use small buffer(XML_PARSER_BUG_BUFFER_SIZE) to continuously parse big xml data. So my proposal is to add a flag to indicate whether the data of CDATA section is terminated or not, then the parser context can use this flag to decide whether current input can be appended to last CDATA section node or not. The pseudo code lists below typedef void (*cdataBlockSAXFunc) (void *ctx, const xmlChar *value, int len, int terminate); xmlSAX2CDataBlock(void *ctx, const xmlChar *value, int len, int terminate) { ... lastChild = xmlGetLastChild(ctxt->node); ... if ((lastChild != NULL) && (lastChild->type == XML_CDATA_SECTION_NODE) && !ctx->lastCDATAIsTerminated) { xmlTextConcat(lastChild, value, len); } else { ret = xmlNewCDataBlock(ctxt->myDoc, value, len); xmlAddChild(ctxt->node, ret); } ctx->lastCDATAIsTerminated = terminate; }
Created attachment 154489 [details] xml file only contains two adjacent CDATA sections
I think an ugly workaround without changing the API might be: void handleCDataBlock(void *ctx, const xmlChar *value, int len) { int terminated = (value[len] == ']' && value[len+1] == ']' && value[len+2] == '>'); .... }
another quick and dirty solution might be skipping to the step of creating a temporary CDATA section(XML_PARSER_BIG_BUFFER_SIZE length). In parse.c line 11041 (v2.7.6) base = xmlParseLookupSequence(ctxt, ']', ']', '>'); if (base < 0) { goto done; } else { ... }
Continue with the previous comments. With that workaround, we don't need to check whether last node was CDATA section or not, just create a new CDATA section like WebKit did
The behaviour described has been adversely affects XML parsing in the Safari and Chromium/Google Chrome browsers for quite a while, breaking multiple sites (see http://code.google.com/p/chromium/issues/detail?id=36431 for an example) as well as the built-in Web Inspector. Is there any way I can contribute to get this bug fixed? Is any of the solutions suggested above acceptable?
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.