Bug 579317 – HTML Encoding detection failure

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 579317 - HTML Encoding detection failure


Summary:	HTML Encoding detection failure


Status:	RESOLVED FIXED

Product:	libxml2
Classification:	Platform
Component:	general
Version:	2.7.3
Hardware:	Other All

Importance:	Normal critical
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2009-04-17 16:35 UTC by Aaron Patterson
Modified:	2009-08-12 18:26 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Aaron Patterson 2009-04-17 16:35:30 UTC

Please describe the problem:
If an HTML file contains a meta tag
hinting at the encoding, libxml2 will use the encoding in the meta tag
*unless* there are strange characters before the meta tag.

If there are strange characters before the meta tag, libxml2 will
guess the encoding and use the guessed encoding for the rest of the
document even though the meta tag reported the correct encoding.
What's worse is that libxml2 will report that it used the encoding
from the meta tag when outputting the content of the document
indicates that it did not.

Steps to reproduce:
1. Try to parse a document with strange characters before the meta tag
2. Search for data in the document you expect to be encoded properly
3. Examine data returned from libxml2


Actual results:
libxml2 ignores the encoding in the meta tag.  The data is not encoded using the encoding specified in the meta tag.

Expected results:
The document should be encoded using the encoding from the meta tag.

Does this happen every time?
Only when there are characters before the meta tag that fall outside ASCII.

Other information:
I've posted code that reproduces the problem here:

http://gist.github.com/96641

Also, I emailed the list about the problem here:

http://mail.gnome.org/archives/xml/2009-April/msg00023.html

Comment 1 Daniel Veillard 2009-08-12 18:18:15 UTC

Done, I have added a new function triggered on encoding error trying to
look up the <meta> encoding info in the current input buffer. Won't
work in all cases but seems to work well with your test. Commited
to git head (e176ba2adf6c07253dca132d15ac5e5ee32faa55),

 thanks,

Daniel

Comment 2 Aaron Patterson 2009-08-12 18:26:11 UTC

Thanks Daniel, you're the best!