GNOME Bugzilla – Bug 169834
[PATCH] Add option to HTML parser to behave more like web browsers
Last modified: 2009-08-15 18:40:50 UTC
When the libxml2 HTML parser encounter a SCRIPT or STYLE tag it will continue parsing the content of the tag and will only handle it as CDATA if the tag is directly followed by a comment (i.e. "<script><!-- "). This is according to the HTML4 reccomendation as far as I can see. However, many web browsers seem to implicitly add comments to script and style tags, and treat the data between <script> and </script> as CDATA without parsing it. It would be nice if the HTML parser in libxml2 had an option to behave that way too.
Created attachment 45216 [details] [review] Patch for more relaxed parsing of script blocks Suggested patch that add a "HTML_PARSE_RELAXED" option flag for the HTMLparser. When enabled, other end tags are ignored inside a script/style block.
Created attachment 47036 [details] [review] Patch to allow attributes on end-tags Some web sites also put attributes on the end tags in their HTML. This patch will, if the HTML_PARSE_RELAXED option is set, ignore these and skip to the '>' instead of including the attributes and the '>' of the end tag as a text node.
Okay, I looked at both patches. They are not acceptable as-is as they dismiss the error and don't report them. Also the first patch adds a field in the middle of a public structure, it's an ABI breaker unacceptable as is, also I did not want to "invent" a new option while there is a RECOVER one in the XML parser. I reused your patches in the following way: - use the recovery ctxt field - create a HTML_PARSE_RECOVER flags using same value as its XML counter part - rewrite the patches to use those and always emit an error message Note that your second patch may be worse than the current one, as you may loose the following tag. Example: paphio:~/XML -> cat tst.html <html> <head> <script> "</foo>" </script> </head> <body> <p> this is really </p <hr /> </body> </html> paphio:~/XML -> When parsed with recovery: paphio:~/XML -> xmllint --recover --html tst.html tst.html:4: HTML parser error : Element script embbeds close tag "</foo>" ^ tst.html:8: HTML parser error : End tag : expected '>' <p> this is really </p <hr /> ^ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <head><script> "</foo>" </script></head> <body><p> this is really </p></body> </html> paphio:~/XML -> At least the errors are signalled. The default behaviour of the parser remains the same: paphio:~/XML -> xmllint --html tst.html tst.html:4: HTML parser error : Unexpected end tag : foo "</foo>" ^ tst.html:8: HTML parser error : End tag : expected '>' <p> this is really </p <hr /> ^ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <head><script> "" </script></head> <body> <p> this is really </p> <hr> </body> </html> paphio:~/XML -> The changes are in CVS, Daniel
This should be closed by release of libxml2-2.6.21, thanks, Daniel