GNOME Bugzilla – Bug 760367
REGRESSION (v2.9.3): Entity is parsed and expanded twice under certain conditions
Last modified: 2017-06-20 10:07:34 UTC
WebKit defines a xmlSAXHandler.getEntity handler method that returns an xmlEntityPtr to a custom xmlEntity. The xmlEntity looks like this (at line 7281 of parser.c in v2.9.3 the first time it is encountered): (lldb) p *ent (xmlEntity) $0 = { _private = 0x0000000000000000 type = XML_ENTITY_DECL name = 0x00007f9e9b8054b5 "nbsp" children = 0x0000000000000000 last = 0x0000000000000000 parent = 0x0000000000000000 next = 0x0000000000000000 prev = 0x0000000000000000 doc = 0x0000000000000000 orig = 0x000000010571f4ec " " content = 0x000000010571f4ec " " length = 2 etype = XML_INTERNAL_GENERAL_ENTITY ExternalID = 0x0000000000000000 SystemID = 0x0000000000000000 nexte = 0x0000000000000000 URI = 0x0000000000000000 owner = 0 checked = 0 } However, with libxml2 v2.9.3, this entity now gets expanded twice, which causes regressions in WebKit's layout test suite, and is (obviously) undesired behavior. Using git-bisect, I found this regressed with the fix for Bug 738805 in v2.9.3: Regression in 2.9.2: entity is not parsed if used in another one, which has been previously parsed <https://bugzilla.gnome.org/show_bug.cgi?id=738805> <https://git.gnome.org/browse/libxml2/commit/?id=df23f584fda15955a0811bd768a8925eb98741c9> For reference, here's what the "&a;" and "&b;" entities from test/ent_738805.xml look like the first time they are encountered: (lldb) p *ent (xmlEntity) $0 = { _private = 0x0000000000000000 type = XML_ENTITY_DECL name = 0x000000010100be5f "a" children = 0x0000000000000000 last = 0x0000000000000000 parent = 0x0000000100400f70 next = 0x00000001004010e0 prev = 0x0000000000000000 doc = 0x0000000100400ec0 orig = 0x0000000100400ff0 "something" content = 0x0000000100401240 "something" length = 9 etype = XML_INTERNAL_GENERAL_ENTITY ExternalID = 0x0000000000000000 SystemID = 0x0000000000000000 nexte = 0x0000000000000000 URI = 0x0000000000000000 owner = 0 checked = 2 } (lldb) p *ent (xmlEntity) $2 = { _private = 0x0000000000000000 type = XML_ENTITY_DECL name = 0x000000010100be61 "b" children = 0x0000000000000000 last = 0x0000000000000000 parent = 0x0000000100400f70 next = 0x0000000000000000 prev = 0x00000001004011b0 doc = 0x0000000100400ec0 orig = 0x0000000100401060 "&a;" content = 0x000000010100be63 "&a;" length = 3 etype = XML_INTERNAL_GENERAL_ENTITY ExternalID = 0x0000000000000000 SystemID = 0x0000000000000000 nexte = 0x0000000000000000 URI = 0x0000000000000000 owner = 0 checked = 0 }
Note that it's possible WebKit is misusing (or abusing) the contract for xmlSAXHandler.getEntity, so any additional info would be appreciated. This likely affects Google Chrome (Blink engine) as well. I don't think their code has changed in this area since forking from WebKit.
Note that WebKit parses XML content using UTF-16, which is why xmlEntity.length = 2 in Comment #0.
Created attachment 318589 [details] [review] Patch v1 This patch fixes the regression for WebKit. I also considered adding a more restrictive check that the entity's document matched the current context's document instead (which also fixes the WebKit regression): - ((ent->children == NULL) && (ctxt->options & XML_PARSE_NOENT))) && + ((ent->children == NULL) && (ent->doc == ctxt->myDoc) && (ctxt->options & XML_PARSE_NOENT))) && I'm not sure whether that's a better fix or not, though. Note that both of these changes also preserve the fix for Bug 738805.
I don't think my explanation above is the whole story. I have another WebKit test case which defines an entity in an SVG document (not from the xmlSAXHandler.getEntity callback), and that entity is expanded twice as well. Here's the plaintext version of the test file: <http://trac.webkit.org/browser/trunk/LayoutTests/svg/custom/viewbox-syntax.svg?format=txt> The issue is that the "&Smile;" entity is expanded twice (each time it appears) when parsing the document, even though xmlEntity.checked is non-zero. I believe these are the two sources of double expansion inside xmlParseReference(): 1. This code now gets executed due to the change from Bug 738805: /* * Check that this entity is well formed * 4.3.2: An internal general parsed entity is well-formed * if its replacement text matches the production labeled * content. */ if (ent->etype == XML_INTERNAL_GENERAL_ENTITY) { ctxt->depth++; ret = xmlParseBalancedChunkMemoryInternal(ctxt, ent->content, user_data, &list); ctxt->depth--; 2. This is the code that was running before the change to Bug 738805: /* * Now that the entity content has been gathered * provide it to the application, this can take different forms based * on the parsing modes. */ if (ent->children == NULL) { /* * Probably running in SAX mode and the callbacks don't * build the entity content. So unless we already went * though parsing for first checking go though the entity * content to generate callbacks associated to the entity */ if (was_checked != 0) { void *user_data; /* * This is a bit hackish but this seems the best * way to make sure both SAX and DOM entity support * behaves okay. */ if (ctxt->userData == ctxt) user_data = NULL; else user_data = ctxt->userData; if (ent->etype == XML_INTERNAL_GENERAL_ENTITY) { ctxt->depth++; ret = xmlParseBalancedChunkMemoryInternal(ctxt, ent->content, user_data, NULL); I'm trying to figure out how to create a stand-alone reproducible test case now.
(In reply to David Kilzer from comment #0) > Using git-bisect, I found this regressed with the fix for Bug 738805 in > v2.9.3: > > Regression in 2.9.2: entity is not parsed if used in another one, which has > been previously parsed > <https://bugzilla.gnome.org/show_bug.cgi?id=738805> > <https://git.gnome.org/browse/libxml2/commit/ > ?id=df23f584fda15955a0811bd768a8925eb98741c9> Oops, the above commit was for the test case. The behavior regressed with: <https://git.gnome.org/browse/libxml2/commit/?id=72a46a519ce7326d9a00f0b6a7f2a8e958cd1675>
(In reply to David Kilzer from comment #4) > I don't think my explanation above is the whole story. > > I have another WebKit test case which defines an entity in an SVG document > (not from the xmlSAXHandler.getEntity callback), and that entity is expanded > twice as well. Here's the plaintext version of the test file: > > <http://trac.webkit.org/browser/trunk/LayoutTests/svg/custom/viewbox-syntax. > svg?format=txt> > > The issue is that the "&Smile;" entity is expanded twice (each time it > appears) when parsing the document, even though xmlEntity.checked is > non-zero. > > I believe these are the two sources of double expansion inside > xmlParseReference(): Oops, I think this just causes the entity to be parsed twice. (That's still bad, but not the cause of the double expansion.) Still investigating which code is doing the double expansion.
Updated title. So I'm seeing double parsing and double expansion of entities in two cases where there are "error conditions": 1. When a custom xmlEntity is returned from xmlSAXHandler.getEntity (as is what happens with WebKit, and probably Chrome/Blink). This was the original issue that I filed this bug for, and it probably only happens in WebKit (and Blink). We're basically trying to get libxml2 to do the entity expansion for us for certain entities as if they were built-in. (If we're doing something bad/wrong here, let us know.) 2. When loading particular test cases in WebKit like LayoutTests/svg/custom/viewbox-syntax.svg that generate warning messages like "Namespace default prefix was not found": $ xmllint --noent LayoutTests/svg/custom/viewbox-syntax.svg namespace warning : Namespace default prefix was not found <rect x='.5' y='.5' width='29' height='39' fill='black' stroke='red'/> ^ namespace warning : Namespace default prefix was not found <g transform='translate(0, 5)'> ^ namespace warning : Namespace default prefix was not found <circle cx='15' cy='15' r='10' fill='yellow'/> ^ namespace warning : Namespace default prefix was not found <circle cx='12' cy='12' r='1.5' fill='black'/> ^ namespace warning : Namespace default prefix was not found <circle cx='17' cy='12' r='1.5' fill='black'/> ^ namespace warning : Namespace default prefix was not found <path d='M 10 19 L 15 23 20 19' stroke='black' stroke-width='2'/> ^ [...content with entities substituted...] When I fix the ENTITIES defined in the viewbox-syntax.svg test case to contain proper <svg></svg> tags, the bad behavior goes away. Interestingly, both #1 and #2 can be fixed by the same change to update a state variable (was_checked) inside xmlParseReference(), which is not Attachment 318589 [details]. I will post a patch for that next.
Created attachment 318993 [details] [review] Patch v2 Fixes both issues noted in Comment #7 by updating a state variable (was_checked) internal to xmlParseReference() to prevent secondary parsing and expansion later in the same method. I have no idea if this is correct, but 'make tests' shows no regressions (including the fix for Bug 738805), and this fixes the issues seen in WebKit's layout tests.
This regression bug hits downstream projects that use Qt's webkit as well. Sigil is a cross platform gpl opensource epub editor and has received bug reports about entities being duplicated when loaded into a Qt QWebView widget (primarily from Linux users using libxml2 2.9.3 and later). We tracked it down to the exact same change introduced in 2.9.3. If the submitted v2 patch fixes the regression issue, please include it into your next release so we can point people to the latest version of libxml2 as a solution. As of right now we can only push people back to libxml2 2.9.2 to get back correct behaviour. This bug will impact all projects that use Qt 4, and 5) and that allow people to use html entities in xhtml so this is a significant regression. Please see: https://github.com/Sigil-Ebook/Sigil/issues/232
Dear Daniel, would you perhaps have the time to look at this bug? It's very boring, and furthermore is a regression... Thanks in advance.
I ran into this as well, when trying to upgrade the libxml version used in WebKitGTK+ layout tests. Chromium added this patch after upgrading to 2.9.4: https://codereview.chromium.org/2539003002 I haven't looked at it in detail, but it might be a fix for the same issue.
Here's a way to reproduce this directly with libxml2. ------------------------------ $ cat test.xml <!DOCTYPE doc [ <!ENTITY ent "<value/>"> ]> <doc>&ent;&ent;</doc> $ ./testSAX --sax2 --noent test.xml SAX.setDocumentLocator() SAX.startDocument() SAX.internalSubset(doc, , ) SAX.entityDecl(ent, 1, (null), (null), <value/>) SAX.getEntity(ent) SAX.externalSubset(doc, , ) SAX.startElementNs(doc, NULL, NULL, 0, 0, 0) SAX.getEntity(ent) SAX.startElementNs(value, NULL, NULL, 0, 0, 0) SAX.endElementNs(value, NULL, NULL) SAX.getEntity(ent) SAX.startElementNs(value, NULL, NULL, 0, 0, 0) SAX.endElementNs(value, NULL, NULL) SAX.startElementNs(value, NULL, NULL, 0, 0, 0) SAX.endElementNs(value, NULL, NULL) SAX.endElementNs(doc, NULL, NULL) SAX.endDocument() ------------------------------ Note the duplicate SAX callbacks for the second entity. David's patch looks OK, but with a separate flag in struct xmlEntity the whole issue could be solved in a more maintainable way. Also note that this only happens with XML_PARSE_NOENT which is why this wasn't caught by the regression tests. Fixed with the following commit: https://git.gnome.org/browse/libxml2/commit/?id=3f0627a1ee854dc965bf61dd6466a798b80cf088 Regression tests for SAX2 callbacks with entity substitution added here: https://git.gnome.org/browse/libxml2/commit/?id=dbaab1f3693bfa3bf27772371542f321f47539fd I also checked that there aren't any other regressions since 2.9.0.
(In reply to Nick Wellnhofer from comment #12) > Here's a way to reproduce this directly with libxml2. > > ------------------------------ > > $ cat test.xml > <!DOCTYPE doc [ > <!ENTITY ent "<value/>"> > ]> > <doc>&ent;&ent;</doc> > > $ ./testSAX --sax2 --noent test.xml > SAX.setDocumentLocator() > SAX.startDocument() > SAX.internalSubset(doc, , ) > SAX.entityDecl(ent, 1, (null), (null), <value/>) > SAX.getEntity(ent) > SAX.externalSubset(doc, , ) > SAX.startElementNs(doc, NULL, NULL, 0, 0, 0) > SAX.getEntity(ent) > SAX.startElementNs(value, NULL, NULL, 0, 0, 0) > SAX.endElementNs(value, NULL, NULL) > SAX.getEntity(ent) > SAX.startElementNs(value, NULL, NULL, 0, 0, 0) > SAX.endElementNs(value, NULL, NULL) > SAX.startElementNs(value, NULL, NULL, 0, 0, 0) > SAX.endElementNs(value, NULL, NULL) > SAX.endElementNs(doc, NULL, NULL) > SAX.endDocument() > > ------------------------------ > > Note the duplicate SAX callbacks for the second entity. David's patch looks > OK, but with a separate flag in struct xmlEntity the whole issue could be > solved in a more maintainable way. Wouldn't changing struct _xmlEntity cause binary compatibility issues with existing binaries since it's defined in the public header file, entities.h? > Also note that this only happens with > XML_PARSE_NOENT which is why this wasn't caught by the regression tests. > > Fixed with the following commit: > > https://git.gnome.org/browse/libxml2/commit/ > ?id=3f0627a1ee854dc965bf61dd6466a798b80cf088 > > Regression tests for SAX2 callbacks with entity substitution added here: > > https://git.gnome.org/browse/libxml2/commit/ > ?id=dbaab1f3693bfa3bf27772371542f321f47539fd > > I also checked that there aren't any other regressions since 2.9.0. Thanks for landing my patch, and adding more tests!
> Wouldn't changing struct _xmlEntity cause binary compatibility issues with > existing binaries since it's defined in the public header file, entities.h? In libxml2, it's always safe to append struct fields. That's because users are expected to always allocate structs using libxml2 functions, so that a wrong struct size is never compiled into client code. If you look at the history of structs like xmlParserCtxt, this has been done many times in the past.