After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 727935 - HTML parsing with early </html> discards elements in lxml.
HTML parsing with early </html> discards elements in lxml.
Status: RESOLVED OBSOLETE
Product: libxml2
Classification: Platform
Component: htmlparser
git master
Other Linux
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2014-04-10 07:02 UTC by chris.foo
Modified: 2021-07-05 13:23 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description chris.foo 2014-04-10 07:02:01 UTC
I'm using lxml to parse some HTML soup. If the document contains </html> in the middle, the rest of the document is discarded in lxml. (The original lxml bug report at https://bugs.launchpad.net/lxml/+bug/1305381 was marked invalid.)

Example 1:

    <!DOCTYPE html>
    <html><body>1<a href="2"></a></body><img src="3"></html><hr>4

Using xmllint --html gives:

    <!DOCTYPE html>
    <html>
    <body>1<a href="2"></a>
    </body>
    <img src="3">
    </html><html>
    <hr>
    <p>4
    </p>
    </html>

Example 2 (no doctype):

    <html><body>1<a href="2"></a></body><img src="3"></html><hr>4

Using xmllint --html:

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
    <html>
    <body>1<a href="2"></a>
    </body>
    <img src="3"><html>
    <hr>
    <p>4
    </p>
    </html>
    </html>

For Example 1, lxml will not return the <hr> and <p> elements. For example 2, lxml will return the tree ok with the extra <html>.

xmllint --version
xmllint: using libxml version 20901
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib Lzma
Comment 1 GNOME Infrastructure Team 2021-07-05 13:23:00 UTC
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new ticket at
  https://gitlab.gnome.org/GNOME/libxml2/-/issues/

Thank you for your understanding and your help.