GNOME Bugzilla – Bug 651165
HTML parser produces weird output for invalid HTML
Last modified: 2021-07-05 13:21:42 UTC
Created attachment 188680 [details] Test File with invalid atttribute name A test-case for our product has failed after the update from 2.7.6 to 2.7.8. The result of xmllint -html is a lot more broken than in previous versions (and more broken than the original file): Note that the attribute Name starts with "_" and a Vertical Tab (Character 0x0B). At least for this test case it would be ideal if HTMLparser.c/htmlParseContentInternal() would be fixed to simply discard all characters up to the first closing '>'. libxml 2.7.8: $ xmllint --version xmllint: using libxml version 20708 compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib $ xmllint -html /tmp/html_unprintablechars_test_2.htmlZ /tmp/html_unprintablechars_test_2.htmlZ:4: HTML parser error : Couldn't find end of Start Tag div <div _ 3"a_name"> ^ /tmp/html_unprintablechars_test_2.htmlZ:4: HTML parser error : Invalid char in CDATA 0xB <div _ 3"a_name"> ^ /tmp/html_unprintablechars_test_2.htmlZ:6: HTML parser error : Unexpected end tag : div </div> ^ /tmp/html_unprintablechars_test_2.htmlZ:8: HTML parser error : Unexpected end tag : body </body> ^ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <body> HUHU <div _></div> </body> <html><p>3"a_name"> HEHE HAHA </p></html> </html> 2.7.4: $ ./xmllint --version .libs/lt-xmllint: using libxml version 20704 compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug $ ./xmllint -html html_unprintablechars_test_2.htmlZ html_unprintablechars_test_2.htmlZ:4: HTML parser error : Couldn't find end of Start Tag div <div _ 3"a_name"> ^ html_unprintablechars_test_2.htmlZ:4: HTML parser error : Invalid char in CDATA 0xB <div _ 3"a_name"> ^ html_unprintablechars_test_2.htmlZ:6: HTML parser error : Unexpected end tag : div </div> ^ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body> HUHU <div _></div>3"a_name"> HEHE HAHA </body></html>
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.