GNOME Bugzilla – Bug 534230
HTMLparser truncates encoded document
Last modified: 2021-07-05 13:26:24 UTC
Please describe the problem: Parsing the attached (and valid) HTML file with libxml2's HTMLparser leads to a truncated document (try "xmllint --html utf8.html"). The same bug occurs on the corresponding iso-8859-* encoded document, too, so it isn't limited to a particular encoding. Having a look at HTMLparser.c and friends didn't really help, however I tried to narrow it down a bit. The problem seems to be that on the first non-ASCII character read /* * Humm this is bad, do an automatic flow conversion */ xmlSwitchEncoding(ctxt, XML_CHAR_ENCODING_8859_1); ctxt->charset = XML_CHAR_ENCODING_UTF8; return(xmlCurrentChar(ctxt, len)); (HTMLparser.c:386-391) is executed _before_ the <meta>-tag is parsed and if ((http) && (content != NULL)) htmlCheckEncoding(ctxt, content); (HTMLparser.c:3404-3405) is executed, which happens some characters later. Accordingly each of the following makes the problem vanish: * commenting out lines 3404-3405 of HTMLparser.c * commenting out lines 389-390 of HTMLparser.c * swapping lines 4 <-> 5 of utf8.html (which is effectively the same as the above) Don't know if this is the right trace, any hints or ideas? Steps to reproduce: xmllint --html utf8.html Actual results: xmllint prints HTML code with truncated text between <p> and </p> tags. Expected results: xmllint should print the document with complete text between <p> and </p> tags. Does this happen every time? yes Other information:
Created attachment 111288 [details] HTML file that causes the problem
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.