GNOME Bugzilla – Bug 727935
HTML parsing with early </html> discards elements in lxml.
Last modified: 2021-07-05 13:23:00 UTC
I'm using lxml to parse some HTML soup. If the document contains </html> in the middle, the rest of the document is discarded in lxml. (The original lxml bug report at https://bugs.launchpad.net/lxml/+bug/1305381 was marked invalid.) Example 1: <!DOCTYPE html> <html><body>1<a href="2"></a></body><img src="3"></html><hr>4 Using xmllint --html gives: <!DOCTYPE html> <html> <body>1<a href="2"></a> </body> <img src="3"> </html><html> <hr> <p>4 </p> </html> Example 2 (no doctype): <html><body>1<a href="2"></a></body><img src="3"></html><hr>4 Using xmllint --html: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <body>1<a href="2"></a> </body> <img src="3"><html> <hr> <p>4 </p> </html> </html> For Example 1, lxml will not return the <hr> and <p> elements. For example 2, lxml will return the tree ok with the extra <html>. xmllint --version xmllint: using libxml version 20901 compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib Lzma
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.