After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 474205 - libxml2 stops parsing html pages on special chars
libxml2 stops parsing html pages on special chars
Status: RESOLVED FIXED
Product: libxml2
Classification: Platform
Component: general
2.6.29
Other Linux
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2007-09-06 11:53 UTC by Markus Korn
Modified: 2008-01-11 07:42 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
xml file to reproduce this issue (129 bytes, application/xml)
2007-09-06 11:54 UTC, Markus Korn
Details
upload as plain text (129 bytes, text/plain)
2007-09-06 11:57 UTC, Markus Korn
Details

Description Markus Korn 2007-09-06 11:53:15 UTC
Hi,
libxml2 stops parsing html pages on non-ASCII characters.

markus@thekorn:/# xmllint --html test.xml 
test.xml:1: HTML parser error : Tag foo invalid
<foo>
    ^
test.xml:2: HTML parser error : Tag bar invalid
    <bar>this is a test</bar>
        ^
test.xml:3: HTML parser error : Tag bar invalid
    <bar>StacktraceTop:�� () from /lib/tls/i686/c</bar>
        ^
test.xml:3: HTML parser error : detected an error in element content
    <bar>StacktraceTop:�� () from /lib/tls/i686/c</bar>
                       ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><foo><bar>this is a test</bar><bar>StacktraceTop:</bar></foo></body></html>
markus@thekorn:/# xmllint --version
xmllint: using libxml version 20629
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib 
markus@thekorn:/#

I would expect libxml2 either to raise an error or to just ignore un-parseable characters.

Thanks for looking at this issue,
Markus
Comment 1 Markus Korn 2007-09-06 11:54:35 UTC
Created attachment 95048 [details]
xml file to reproduce this issue

I used this xml file
Comment 2 Markus Korn 2007-09-06 11:57:51 UTC
Created attachment 95049 [details]
upload as plain text

I used this file to run xmllint --html
Comment 3 Daniel Veillard 2008-01-11 07:42:26 UTC
Okay, I think I found and fixed the problem, now we get:

laptop:~/XML -> ./xmllint --html ../test.html 
../test.html:1: HTML parser error : Tag foo invalid
<foo>
    ^
../test.html:2: HTML parser error : Tag bar invalid
    <bar>this is a test</bar>
        ^
../test.html:3: HTML parser error : Tag bar invalid
    <bar>StacktraceTop:�� () from /lib/tls/i686/c</bar>
        ^
../test.html:3: HTML parser error : Invalid char in CDATA 0xF
    <bar>StacktraceTop:�� () from /lib/tls/i686/c</bar>
                       ^
../test.html:4: HTML parser error : Tag bar invalid
    <bar>next line</bar>
        ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><foo><bar>this is a test</bar><bar>StacktraceTop:&iuml;&iquest;&frac12;&iuml;&iquest;&frac12; () from /lib/tls/i686/c</bar><bar>next line</bar></foo></body></html>
laptop:~/XML ->

  i.e. error is reported, but the parsing continues, commited in
SVN revision 3675.

    thanks for the report,

Daniel