GNOME Bugzilla – Bug 553511
entity in attribute parse error
Last modified: 2021-07-05 13:27:07 UTC
Please describe the problem: When parsing a document with an attribute value containing an entity reference (like attr="é"), libxml2 builds a wrong DOM if no special options are given: the entity is removed from the attribute, and placed before the element. Steps to reproduce: 1. Create this valid XHTML file (libxml2.txt): <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head><title>Example</title></head> <body><p title="é">.</p></body> </html> 2. Run xmllint with no option on this file (xmllint libxml2.txt) 3. Observe that there is a warning (for a valid document!), but a result is still produced, but with é<p title="">.</p> instead of <p title="é">.</p> Actual results: 06:24:53 marc@kameha /tmp xmllint libxml2.txt libxml2.txt:5: parser error : Entity 'eacute' not defined <body><p title="é">.</p></body> ^ <?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>Example</title></head> <body>é<p title="">.</p></body> </html> 06:25:03 marc@kameha /tmp Expected results: 06:24:53 marc@kameha /tmp xmllint libxml2.txt ^ <?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>Example</title></head> <body><p title="é">.</p></body> </html> 06:25:03 marc@kameha /tmp Does this happen every time? Yes Other information: xmllint is the easiest way to see/reproduce the bug, but it really is in libxml2.
I'm seeing the same problem using the lxml library.
If xmlParseEntityRef finds an undeclared entity and [ WFC: Entity Declared ] isn't violated, it must not call sax->reference (this doesn't work in attributes) but should either 1. create a dummy extSubset in the document and add an entity without content, or 2. the callers should check for XML_WAR_UNDECLARED_ENTITY and handle the situation themselvses. Solution 1 might confuse existing users that don't expect such dummy external subsets. For solution 2 the callers of xmlParseEntityRef have to get the name of the entity somehow. But xmlParseEntityRef throws the name away if it can't resolve the entity. I'm not sure what's the best approach.
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.