After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 760367 - REGRESSION (v2.9.3): Entity is parsed and expanded twice under certain conditions
REGRESSION (v2.9.3): Entity is parsed and expanded twice under certain condit...
Status: RESOLVED FIXED
Product: libxml2
Classification: Platform
Component: general
git master
Other All
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on: 738805
Blocks:
 
 
Reported: 2016-01-09 16:27 UTC by David Kilzer
Modified: 2017-06-20 10:07 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Patch v1 (1.21 KB, patch)
2016-01-09 17:05 UTC, David Kilzer
none Details | Review
Patch v2 (932 bytes, patch)
2016-01-14 01:00 UTC, David Kilzer
none Details | Review

Description David Kilzer 2016-01-09 16:27:32 UTC
WebKit defines a xmlSAXHandler.getEntity handler method that returns an xmlEntityPtr to a custom xmlEntity.  The xmlEntity looks like this (at line 7281 of parser.c in v2.9.3 the first time it is encountered):

(lldb) p *ent
(xmlEntity) $0 = {
  _private = 0x0000000000000000
  type = XML_ENTITY_DECL
  name = 0x00007f9e9b8054b5 "nbsp"
  children = 0x0000000000000000
  last = 0x0000000000000000
  parent = 0x0000000000000000
  next = 0x0000000000000000
  prev = 0x0000000000000000
  doc = 0x0000000000000000
  orig = 0x000000010571f4ec " "
  content = 0x000000010571f4ec " "
  length = 2
  etype = XML_INTERNAL_GENERAL_ENTITY
  ExternalID = 0x0000000000000000
  SystemID = 0x0000000000000000
  nexte = 0x0000000000000000
  URI = 0x0000000000000000
  owner = 0
  checked = 0
}

However, with libxml2 v2.9.3, this entity now gets expanded twice, which causes regressions in WebKit's layout test suite, and is (obviously) undesired behavior.

Using git-bisect, I found this regressed with the fix for Bug 738805 in v2.9.3:

Regression in 2.9.2: entity is not parsed if used in another one, which has been previously parsed
<https://bugzilla.gnome.org/show_bug.cgi?id=738805>
<https://git.gnome.org/browse/libxml2/commit/?id=df23f584fda15955a0811bd768a8925eb98741c9>

For reference, here's what the "&a;" and "&b;" entities from test/ent_738805.xml look like the first time they are encountered:

(lldb) p *ent
(xmlEntity) $0 = {
  _private = 0x0000000000000000
  type = XML_ENTITY_DECL
  name = 0x000000010100be5f "a"
  children = 0x0000000000000000
  last = 0x0000000000000000
  parent = 0x0000000100400f70
  next = 0x00000001004010e0
  prev = 0x0000000000000000
  doc = 0x0000000100400ec0
  orig = 0x0000000100400ff0 "something"
  content = 0x0000000100401240 "something"
  length = 9
  etype = XML_INTERNAL_GENERAL_ENTITY
  ExternalID = 0x0000000000000000
  SystemID = 0x0000000000000000
  nexte = 0x0000000000000000
  URI = 0x0000000000000000
  owner = 0
  checked = 2
}

(lldb)  p *ent
(xmlEntity) $2 = {
  _private = 0x0000000000000000
  type = XML_ENTITY_DECL
  name = 0x000000010100be61 "b"
  children = 0x0000000000000000
  last = 0x0000000000000000
  parent = 0x0000000100400f70
  next = 0x0000000000000000
  prev = 0x00000001004011b0
  doc = 0x0000000100400ec0
  orig = 0x0000000100401060 "&a;"
  content = 0x000000010100be63 "&a;"
  length = 3
  etype = XML_INTERNAL_GENERAL_ENTITY
  ExternalID = 0x0000000000000000
  SystemID = 0x0000000000000000
  nexte = 0x0000000000000000
  URI = 0x0000000000000000
  owner = 0
  checked = 0
}
Comment 1 David Kilzer 2016-01-09 16:39:24 UTC
Note that it's possible WebKit is misusing (or abusing) the contract for xmlSAXHandler.getEntity, so any additional info would be appreciated.

This likely affects Google Chrome (Blink engine) as well.  I don't think their code has changed in this area since forking from WebKit.
Comment 2 David Kilzer 2016-01-09 16:41:10 UTC
Note that WebKit parses XML content using UTF-16, which is why xmlEntity.length = 2 in Comment #0.
Comment 3 David Kilzer 2016-01-09 17:05:15 UTC
Created attachment 318589 [details] [review]
Patch v1

This patch fixes the regression for WebKit.

I also considered adding a more restrictive check that the entity's document matched the current context's document instead (which also fixes the WebKit regression):

-         ((ent->children == NULL) && (ctxt->options & XML_PARSE_NOENT))) &&
+         ((ent->children == NULL) && (ent->doc == ctxt->myDoc) && (ctxt->options & XML_PARSE_NOENT))) &&

I'm not sure whether that's a better fix or not, though.

Note that both of these changes also preserve the fix for Bug 738805.
Comment 4 David Kilzer 2016-01-11 20:04:17 UTC
I don't think my explanation above is the whole story.

I have another WebKit test case which defines an entity in an SVG document (not from the xmlSAXHandler.getEntity callback), and that entity is expanded twice as well.  Here's the plaintext version of the test file:

<http://trac.webkit.org/browser/trunk/LayoutTests/svg/custom/viewbox-syntax.svg?format=txt>

The issue is that the "&Smile;" entity is expanded twice (each time it appears) when parsing the document, even though xmlEntity.checked is non-zero.

I believe these are the two sources of double expansion inside xmlParseReference():

1. This code now gets executed due to the change from Bug 738805:

	/*
	 * Check that this entity is well formed
	 * 4.3.2: An internal general parsed entity is well-formed
	 * if its replacement text matches the production labeled
	 * content.
	 */
	if (ent->etype == XML_INTERNAL_GENERAL_ENTITY) {
	    ctxt->depth++;
	    ret = xmlParseBalancedChunkMemoryInternal(ctxt, ent->content,
	                                              user_data, &list);
	    ctxt->depth--;

2. This is the code that was running before the change to Bug 738805:

    /*
     * Now that the entity content has been gathered
     * provide it to the application, this can take different forms based
     * on the parsing modes.
     */
    if (ent->children == NULL) {
	/*
	 * Probably running in SAX mode and the callbacks don't
	 * build the entity content. So unless we already went
	 * though parsing for first checking go though the entity
	 * content to generate callbacks associated to the entity
	 */
	if (was_checked != 0) {
	    void *user_data;
	    /*
	     * This is a bit hackish but this seems the best
	     * way to make sure both SAX and DOM entity support
	     * behaves okay.
	     */
	    if (ctxt->userData == ctxt)
		user_data = NULL;
	    else
		user_data = ctxt->userData;

	    if (ent->etype == XML_INTERNAL_GENERAL_ENTITY) {
		ctxt->depth++;
		ret = xmlParseBalancedChunkMemoryInternal(ctxt,
				   ent->content, user_data, NULL);

I'm trying to figure out how to create a stand-alone reproducible test case now.
Comment 5 David Kilzer 2016-01-11 20:05:44 UTC
(In reply to David Kilzer from comment #0)
> Using git-bisect, I found this regressed with the fix for Bug 738805 in
> v2.9.3:
> 
> Regression in 2.9.2: entity is not parsed if used in another one, which has
> been previously parsed
> <https://bugzilla.gnome.org/show_bug.cgi?id=738805>
> <https://git.gnome.org/browse/libxml2/commit/
> ?id=df23f584fda15955a0811bd768a8925eb98741c9>

Oops, the above commit was for the test case.  The behavior regressed with:

<https://git.gnome.org/browse/libxml2/commit/?id=72a46a519ce7326d9a00f0b6a7f2a8e958cd1675>
Comment 6 David Kilzer 2016-01-11 23:37:51 UTC
(In reply to David Kilzer from comment #4)
> I don't think my explanation above is the whole story.
> 
> I have another WebKit test case which defines an entity in an SVG document
> (not from the xmlSAXHandler.getEntity callback), and that entity is expanded
> twice as well.  Here's the plaintext version of the test file:
> 
> <http://trac.webkit.org/browser/trunk/LayoutTests/svg/custom/viewbox-syntax.
> svg?format=txt>
> 
> The issue is that the "&Smile;" entity is expanded twice (each time it
> appears) when parsing the document, even though xmlEntity.checked is
> non-zero.
> 
> I believe these are the two sources of double expansion inside
> xmlParseReference():

Oops, I think this just causes the entity to be parsed twice.  (That's still bad, but not the cause of the double expansion.)

Still investigating which code is doing the double expansion.
Comment 7 David Kilzer 2016-01-13 23:50:47 UTC
Updated title.  So I'm seeing double parsing and double expansion of entities in two cases where there are "error conditions":

1. When a custom xmlEntity is returned from xmlSAXHandler.getEntity (as is what happens with WebKit, and probably Chrome/Blink).  This was the original issue that I filed this bug for, and it probably only happens in WebKit (and Blink).  We're basically trying to get libxml2 to do the entity expansion for us for certain entities as if they were built-in.  (If we're doing something bad/wrong here, let us know.)

2. When loading particular test cases in WebKit like LayoutTests/svg/custom/viewbox-syntax.svg that generate warning messages like "Namespace default prefix was not found":

$ xmllint --noent LayoutTests/svg/custom/viewbox-syntax.svg
namespace warning : Namespace default prefix was not found
<rect x='.5' y='.5' width='29' height='39' fill='black' stroke='red'/>
                                                                    ^
namespace warning : Namespace default prefix was not found
<g transform='translate(0, 5)'>
                              ^
namespace warning : Namespace default prefix was not found
<circle cx='15' cy='15' r='10' fill='yellow'/>
                                            ^
namespace warning : Namespace default prefix was not found
<circle cx='12' cy='12' r='1.5' fill='black'/>
                                            ^
namespace warning : Namespace default prefix was not found
<circle cx='17' cy='12' r='1.5' fill='black'/>
                                            ^
namespace warning : Namespace default prefix was not found
<path d='M 10 19 L 15 23 20 19' stroke='black' stroke-width='2'/>
                                                               ^
[...content with entities substituted...]

When I fix the ENTITIES defined in the viewbox-syntax.svg test case to contain proper <svg></svg> tags, the bad behavior goes away.

Interestingly, both #1 and #2 can be fixed by the same change to update a state variable (was_checked) inside xmlParseReference(), which is not Attachment 318589 [details].  I will post a patch for that next.
Comment 8 David Kilzer 2016-01-14 01:00:22 UTC
Created attachment 318993 [details] [review]
Patch v2

Fixes both issues noted in Comment #7 by updating a state variable (was_checked) internal to xmlParseReference() to prevent secondary parsing and expansion later in the same method.

I have no idea if this is correct, but 'make tests' shows no regressions (including the fix for Bug 738805), and this fixes the issues seen in WebKit's layout tests.
Comment 9 Kevin Hendricks 2016-07-13 13:01:43 UTC
This regression bug hits downstream projects that use Qt's webkit as well.  Sigil is a cross platform gpl opensource epub editor and has received bug reports about entities being duplicated when loaded into a Qt QWebView widget (primarily from Linux users using libxml2 2.9.3 and later).

We tracked it down to the exact same change introduced in 2.9.3.   If the submitted v2 patch fixes the regression issue, please include it into your next release so we can point people to the latest version of libxml2 as a solution.  As of right now we can only push people back to libxml2 2.9.2 to get back correct behaviour.

This bug will impact all projects that use Qt 4, and 5) and that allow people to use html entities in xhtml so this is a significant regression.

Please see:  https://github.com/Sigil-Ebook/Sigil/issues/232
Comment 10 Mattia Rizzolo 2016-09-04 19:32:07 UTC
Dear Daniel,

would you perhaps have the time to look at this bug?
It's very boring, and furthermore is a regression...

Thanks in advance.
Comment 11 Carlos Garcia Campos 2017-04-10 10:25:42 UTC
I ran into this as well, when trying to upgrade the libxml version used in WebKitGTK+ layout tests. Chromium added this patch after upgrading to 2.9.4:

https://codereview.chromium.org/2539003002

I haven't looked at it in detail, but it might be a fix for the same issue.
Comment 12 Nick Wellnhofer 2017-06-16 19:49:28 UTC
Here's a way to reproduce this directly with libxml2.

------------------------------

$ cat test.xml
<!DOCTYPE doc [
  <!ENTITY ent "<value/>">
]>
<doc>&ent;&ent;</doc>

$ ./testSAX --sax2 --noent test.xml
SAX.setDocumentLocator()
SAX.startDocument()
SAX.internalSubset(doc, , )
SAX.entityDecl(ent, 1, (null), (null), <value/>)
SAX.getEntity(ent)
SAX.externalSubset(doc, , )
SAX.startElementNs(doc, NULL, NULL, 0, 0, 0)
SAX.getEntity(ent)
SAX.startElementNs(value, NULL, NULL, 0, 0, 0)
SAX.endElementNs(value, NULL, NULL)
SAX.getEntity(ent)
SAX.startElementNs(value, NULL, NULL, 0, 0, 0)
SAX.endElementNs(value, NULL, NULL)
SAX.startElementNs(value, NULL, NULL, 0, 0, 0)
SAX.endElementNs(value, NULL, NULL)
SAX.endElementNs(doc, NULL, NULL)
SAX.endDocument()

------------------------------

Note the duplicate SAX callbacks for the second entity. David's patch looks OK, but with a separate flag in struct xmlEntity the whole issue could be solved in a more maintainable way. Also note that this only happens with XML_PARSE_NOENT which is why this wasn't caught by the regression tests.

Fixed with the following commit:

https://git.gnome.org/browse/libxml2/commit/?id=3f0627a1ee854dc965bf61dd6466a798b80cf088

Regression tests for SAX2 callbacks with entity substitution added here:

https://git.gnome.org/browse/libxml2/commit/?id=dbaab1f3693bfa3bf27772371542f321f47539fd

I also checked that there aren't any other regressions since 2.9.0.
Comment 13 David Kilzer 2017-06-19 19:47:06 UTC
(In reply to Nick Wellnhofer from comment #12)
> Here's a way to reproduce this directly with libxml2.
> 
> ------------------------------
> 
> $ cat test.xml
> <!DOCTYPE doc [
>   <!ENTITY ent "<value/>">
> ]>
> <doc>&ent;&ent;</doc>
> 
> $ ./testSAX --sax2 --noent test.xml
> SAX.setDocumentLocator()
> SAX.startDocument()
> SAX.internalSubset(doc, , )
> SAX.entityDecl(ent, 1, (null), (null), <value/>)
> SAX.getEntity(ent)
> SAX.externalSubset(doc, , )
> SAX.startElementNs(doc, NULL, NULL, 0, 0, 0)
> SAX.getEntity(ent)
> SAX.startElementNs(value, NULL, NULL, 0, 0, 0)
> SAX.endElementNs(value, NULL, NULL)
> SAX.getEntity(ent)
> SAX.startElementNs(value, NULL, NULL, 0, 0, 0)
> SAX.endElementNs(value, NULL, NULL)
> SAX.startElementNs(value, NULL, NULL, 0, 0, 0)
> SAX.endElementNs(value, NULL, NULL)
> SAX.endElementNs(doc, NULL, NULL)
> SAX.endDocument()
> 
> ------------------------------
> 
> Note the duplicate SAX callbacks for the second entity. David's patch looks
> OK, but with a separate flag in struct xmlEntity the whole issue could be
> solved in a more maintainable way.

Wouldn't changing struct _xmlEntity cause binary compatibility issues with existing binaries since it's defined in the public header file, entities.h?

> Also note that this only happens with
> XML_PARSE_NOENT which is why this wasn't caught by the regression tests.
> 
> Fixed with the following commit:
> 
> https://git.gnome.org/browse/libxml2/commit/
> ?id=3f0627a1ee854dc965bf61dd6466a798b80cf088
> 
> Regression tests for SAX2 callbacks with entity substitution added here:
> 
> https://git.gnome.org/browse/libxml2/commit/
> ?id=dbaab1f3693bfa3bf27772371542f321f47539fd
> 
> I also checked that there aren't any other regressions since 2.9.0.

Thanks for landing my patch, and adding more tests!
Comment 14 Nick Wellnhofer 2017-06-20 10:07:34 UTC
> Wouldn't changing struct _xmlEntity cause binary compatibility issues with
> existing binaries since it's defined in the public header file, entities.h?

In libxml2, it's always safe to append struct fields. That's because users are expected to always allocate structs using libxml2 functions, so that a wrong struct size is never compiled into client code. If you look at the history of structs like xmlParserCtxt, this has been done many times in the past.