Bug 474205 - libxml2 stops parsing html pages on special chars
libxml2 stops parsing html pages on special chars
Status: RESOLVED FIXED
Product: libxml2
Classification: Platform
Component: general
2.6.29
Other Linux
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
:
Depends on:
Blocks:
  Show dependency tree
 
Reported: 2007-09-06 11:53 UTC by Markus Korn
Modified: 2008-01-11 07:42 UTC (History)
1 user (show)

See Also:
GNOME target: ---
GNOME version: ---


Attachments
xml file to reproduce this issue (129 bytes, application/xml)
2007-09-06 11:54 UTC, Markus Korn
Details
upload as plain text (129 bytes, text/plain)
2007-09-06 11:57 UTC, Markus Korn
Details

Description Markus Korn 2007-09-06 11:53:15 UTC
Hi,
libxml2 stops parsing html pages on non-ASCII characters.

markus@thekorn:/# xmllint --html test.xml 
test.xml:1: HTML parser error : Tag foo invalid
<foo>
    ^
test.xml:2: HTML parser error : Tag bar invalid
    <bar>this is a test</bar>
        ^
test.xml:3: HTML parser error : Tag bar invalid
    <bar>StacktraceTop:�� () from /lib/tls/i686/c</bar>
        ^
test.xml:3: HTML parser error : detected an error in element content
    <bar>StacktraceTop:�� () from /lib/tls/i686/c</bar>
                       ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><foo><bar>this is a test</bar><bar>StacktraceTop:</bar></foo></body></html>
markus@thekorn:/# xmllint --version
xmllint: using libxml version 20629
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib 
markus@thekorn:/#

I would expect libxml2 either to raise an error or to just ignore un-parseable characters.

Thanks for looking at this issue,
Markus
Comment 1 Markus Korn 2007-09-06 11:54:35 UTC
Created attachment 95048 [details]
xml file to reproduce this issue

I used this xml file
Comment 2 Markus Korn 2007-09-06 11:57:51 UTC
Created attachment 95049 [details]
upload as plain text

I used this file to run xmllint --html
Comment 3 Daniel Veillard 2008-01-11 07:42:26 UTC
Okay, I think I found and fixed the problem, now we get:

laptop:~/XML -> ./xmllint --html ../test.html 
../test.html:1: HTML parser error : Tag foo invalid
<foo>
    ^
../test.html:2: HTML parser error : Tag bar invalid
    <bar>this is a test</bar>
        ^
../test.html:3: HTML parser error : Tag bar invalid
    <bar>StacktraceTop:�� () from /lib/tls/i686/c</bar>
        ^
../test.html:3: HTML parser error : Invalid char in CDATA 0xF
    <bar>StacktraceTop:�� () from /lib/tls/i686/c</bar>
                       ^
../test.html:4: HTML parser error : Tag bar invalid
    <bar>next line</bar>
        ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><foo><bar>this is a test</bar><bar>StacktraceTop:&iuml;&iquest;&frac12;&iuml;&iquest;&frac12; () from /lib/tls/i686/c</bar><bar>next line</bar></foo></body></html>
laptop:~/XML ->

  i.e. error is reported, but the parsing continues, commited in
SVN revision 3675.

    thanks for the report,

Daniel

Note You need to log in before you can comment on or make changes to this bug.