GNOME Bugzilla – Bug 474205
libxml2 stops parsing html pages on special chars
Last modified: 2008-01-11 07:42:26 UTC
Hi, libxml2 stops parsing html pages on non-ASCII characters. markus@thekorn:/# xmllint --html test.xml test.xml:1: HTML parser error : Tag foo invalid <foo> ^ test.xml:2: HTML parser error : Tag bar invalid <bar>this is a test</bar> ^ test.xml:3: HTML parser error : Tag bar invalid <bar>StacktraceTop:�� () from /lib/tls/i686/c</bar> ^ test.xml:3: HTML parser error : detected an error in element content <bar>StacktraceTop:�� () from /lib/tls/i686/c</bar> ^ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><foo><bar>this is a test</bar><bar>StacktraceTop:</bar></foo></body></html> markus@thekorn:/# xmllint --version xmllint: using libxml version 20629 compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib markus@thekorn:/# I would expect libxml2 either to raise an error or to just ignore un-parseable characters. Thanks for looking at this issue, Markus
Created attachment 95048 [details] xml file to reproduce this issue I used this xml file
Created attachment 95049 [details] upload as plain text I used this file to run xmllint --html
Okay, I think I found and fixed the problem, now we get: laptop:~/XML -> ./xmllint --html ../test.html ../test.html:1: HTML parser error : Tag foo invalid <foo> ^ ../test.html:2: HTML parser error : Tag bar invalid <bar>this is a test</bar> ^ ../test.html:3: HTML parser error : Tag bar invalid <bar>StacktraceTop:�� () from /lib/tls/i686/c</bar> ^ ../test.html:3: HTML parser error : Invalid char in CDATA 0xF <bar>StacktraceTop:�� () from /lib/tls/i686/c</bar> ^ ../test.html:4: HTML parser error : Tag bar invalid <bar>next line</bar> ^ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><foo><bar>this is a test</bar><bar>StacktraceTop:�� () from /lib/tls/i686/c</bar><bar>next line</bar></foo></body></html> laptop:~/XML -> i.e. error is reported, but the parsing continues, commited in SVN revision 3675. thanks for the report, Daniel