After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 579317 - HTML Encoding detection failure
HTML Encoding detection failure
Status: RESOLVED FIXED
Product: libxml2
Classification: Platform
Component: general
2.7.3
Other All
: Normal critical
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2009-04-17 16:35 UTC by Aaron Patterson
Modified: 2009-08-12 18:26 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Aaron Patterson 2009-04-17 16:35:30 UTC
Please describe the problem:
If an HTML file contains a meta tag
hinting at the encoding, libxml2 will use the encoding in the meta tag
*unless* there are strange characters before the meta tag.

If there are strange characters before the meta tag, libxml2 will
guess the encoding and use the guessed encoding for the rest of the
document even though the meta tag reported the correct encoding.
What's worse is that libxml2 will report that it used the encoding
from the meta tag when outputting the content of the document
indicates that it did not.

Steps to reproduce:
1. Try to parse a document with strange characters before the meta tag
2. Search for data in the document you expect to be encoded properly
3. Examine data returned from libxml2


Actual results:
libxml2 ignores the encoding in the meta tag.  The data is not encoded using the encoding specified in the meta tag.

Expected results:
The document should be encoded using the encoding from the meta tag.

Does this happen every time?
Only when there are characters before the meta tag that fall outside ASCII.

Other information:
I've posted code that reproduces the problem here:

http://gist.github.com/96641

Also, I emailed the list about the problem here:

http://mail.gnome.org/archives/xml/2009-April/msg00023.html
Comment 1 Daniel Veillard 2009-08-12 18:18:15 UTC
Done, I have added a new function triggered on encoding error trying to
look up the <meta> encoding info in the current input buffer. Won't
work in all cases but seems to work well with your test. Commited
to git head (e176ba2adf6c07253dca132d15ac5e5ee32faa55),

 thanks,

Daniel
Comment 2 Aaron Patterson 2009-08-12 18:26:11 UTC
Thanks Daniel, you're the best!