GNOME Bugzilla – Bug 579317
HTML Encoding detection failure
Last modified: 2009-08-12 18:26:11 UTC
Please describe the problem: If an HTML file contains a meta tag hinting at the encoding, libxml2 will use the encoding in the meta tag *unless* there are strange characters before the meta tag. If there are strange characters before the meta tag, libxml2 will guess the encoding and use the guessed encoding for the rest of the document even though the meta tag reported the correct encoding. What's worse is that libxml2 will report that it used the encoding from the meta tag when outputting the content of the document indicates that it did not. Steps to reproduce: 1. Try to parse a document with strange characters before the meta tag 2. Search for data in the document you expect to be encoded properly 3. Examine data returned from libxml2 Actual results: libxml2 ignores the encoding in the meta tag. The data is not encoded using the encoding specified in the meta tag. Expected results: The document should be encoded using the encoding from the meta tag. Does this happen every time? Only when there are characters before the meta tag that fall outside ASCII. Other information: I've posted code that reproduces the problem here: http://gist.github.com/96641 Also, I emailed the list about the problem here: http://mail.gnome.org/archives/xml/2009-April/msg00023.html
Done, I have added a new function triggered on encoding error trying to look up the <meta> encoding info in the current input buffer. Won't work in all cases but seems to work well with your test. Commited to git head (e176ba2adf6c07253dca132d15ac5e5ee32faa55), thanks, Daniel
Thanks Daniel, you're the best!