After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 651165 - HTML parser produces weird output for invalid HTML
HTML parser produces weird output for invalid HTML
Status: RESOLVED OBSOLETE
Product: libxml2
Classification: Platform
Component: htmlparser
2.7.8
Other Linux
: Normal minor
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2011-05-26 16:15 UTC by Rainer Canavan
Modified: 2021-07-05 13:21 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Test File with invalid atttribute name (75 bytes, application/octet-stream)
2011-05-26 16:15 UTC, Rainer Canavan
Details

Description Rainer Canavan 2011-05-26 16:15:56 UTC
Created attachment 188680 [details]
Test File with invalid atttribute name

A test-case for our product has failed after the update from 2.7.6 to 2.7.8. The result of xmllint -html is a lot more broken than in previous versions (and more broken than the original file): Note that the attribute Name starts with "_" and a Vertical Tab (Character 0x0B). At least for this test case it would be ideal if HTMLparser.c/htmlParseContentInternal() would be fixed to simply discard all characters up to the first closing '>'.


libxml 2.7.8:

$ xmllint  --version
xmllint: using libxml version 20708
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib 

$ xmllint  -html /tmp/html_unprintablechars_test_2.htmlZ 
/tmp/html_unprintablechars_test_2.htmlZ:4: HTML parser error : Couldn't find end of Start Tag div
<div      _
           3"a_name">
           ^
/tmp/html_unprintablechars_test_2.htmlZ:4: HTML parser error : Invalid char in CDATA 0xB
<div      _
           3"a_name">
           ^
/tmp/html_unprintablechars_test_2.htmlZ:6: HTML parser error : Unexpected end tag : div
</div>
      ^
/tmp/html_unprintablechars_test_2.htmlZ:8: HTML parser error : Unexpected end tag : body
</body>
       ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
HUHU
<div _></div>
</body>
<html><p>3"a_name"&gt;
HEHE

HAHA

</p></html>
</html>




2.7.4:

$ ./xmllint --version
.libs/lt-xmllint: using libxml version 20704
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug 
$ ./xmllint -html html_unprintablechars_test_2.htmlZ
html_unprintablechars_test_2.htmlZ:4: HTML parser error : Couldn't find end of Start Tag div
<div      _
           3"a_name">
           ^
html_unprintablechars_test_2.htmlZ:4: HTML parser error : Invalid char in CDATA 0xB
<div      _
           3"a_name">
           ^
html_unprintablechars_test_2.htmlZ:6: HTML parser error : Unexpected end tag : div
</div>
      ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
HUHU
<div _></div>3"a_name"&gt;
HEHE

HAHA
</body></html>
Comment 1 GNOME Infrastructure Team 2021-07-05 13:21:42 UTC
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new ticket at
  https://gitlab.gnome.org/GNOME/libxml2/-/issues/

Thank you for your understanding and your help.