GNOME Bugzilla – Bug 444994
HTML chunked parsing failure when attribute contains <>
Last modified: 2009-08-25 12:44:18 UTC
The input <td onmouseover="ChangeText('<b>Trouble at sea</b>')"> causes HTML parser errors when using htmlParseChunk if a chunk split occurs at some places in the attribute. The attached source and html file reproduce the problem. Parsing in 100 byte chunks succeeds: $ ./html_chunk_test 100 test3.html htmlParseChunk 55 htmlParseChunk 0 HTML DOCUMENT standalone=true DTD(html), PUBLIC -//W3C//DTD HTML 4.0 Transitional//EN, SYSTEM http://www.w3.org/TR/REC-html40/loose.dtd ELEMENT html ELEMENT body ELEMENT td ATTRIBUTE onmouseover TEXT content=ChangeText('<b>Trouble at sea</b>') Parsing in 10 byte chunks fails: $ ./html_chunk_test 10 test3.html htmlParseChunk 10 htmlParseChunk 10 htmlParseChunk 10 htmlParseChunk 10 HTML parser error : AttValue: " expected <td onmouseover="ChangeText('<b>Trouble ^ HTML parser error : Couldn't find end of Start Tag td <td onmouseover="ChangeText('<b>Trouble ^ htmlParseChunk 10 HTML parser error : Unexpected end tag : b <td onmouseover="ChangeText('<b>Trouble at sea</b> ^ htmlParseChunk 5 htmlParseChunk 0 HTML DOCUMENT standalone=true DTD(html), PUBLIC -//W3C//DTD HTML 4.0 Transitional//EN, SYSTEM http://www.w3.org/TR/REC-html40/loose.dtd ELEMENT html ELEMENT body ELEMENT td ATTRIBUTE onmouseover TEXT content=ChangeText('<b>Trouble TEXT content=at sea')">
Created attachment 89529 [details] Source that tests htmlParseChunk with a given file and chunk size
Created attachment 89530 [details] HTML file that triggers the parse error
I can confirm this, it's still present in libxml2 2.6.32. It happens when chunks are split at or after an '>' in an attribute value and before the tag's closing '>'. I guess it's because htmlParseLookupSequence(ctxt, '>', 0, 0, 0), which is used to scan forward to the end of the tag in htmlParseTryOrFinish() ignores quotations.
Created attachment 113856 [details] C++ program to show the bug. This C++ program uses the same simple HTML document and parses it multiple times in two chunks, with the split traversing the critical section.
I am experiencing this behavior, as well. One additional side-effect is that the SAX startElement callback will be invoked with an incomplete attribute list in this case.
Okay found, it was in htmlParseLookupSequence() basically if the chunck ended in an attribute value, then ctxt->checkIndex would still be saved but without knowledge of being within the attribute amd on next call we would restart the parsing from within the attribute but without that knowledge. paphio:~/XML -> ./tst 100 test3.html htmlParseChunk 55 htmlParseChunk 0 HTML DOCUMENT standalone=true DTD(html), PUBLIC -//W3C//DTD HTML 4.0 Transitional//EN, SYSTEM http://www.w3.org/TR/REC-html40/loose.dtd ELEMENT html ELEMENT body ELEMENT td ATTRIBUTE onmouseover TEXT content=ChangeText('<b>Trouble at sea</b>') paphio:~/XML -> ./tst 10 test3.html htmlParseChunk 10 htmlParseChunk 10 htmlParseChunk 10 htmlParseChunk 10 htmlParseChunk 10 htmlParseChunk 5 htmlParseChunk 0 HTML DOCUMENT standalone=true DTD(html), PUBLIC -//W3C//DTD HTML 4.0 Transitional//EN, SYSTEM http://www.w3.org/TR/REC-html40/loose.dtd ELEMENT html ELEMENT body ELEMENT td ATTRIBUTE onmouseover TEXT content=ChangeText('<b>Trouble at sea</b>') paphio:~/XML -> thanks for html_chunk.c and the reproducer ! Daniel