After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 576485 - libxslt document() pays attention to charset in HTTP header
libxslt document() pays attention to charset in HTTP header
Status: RESOLVED NOTABUG
Product: libxslt
Classification: Platform
Component: general
1.1.x
Other All
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2009-03-23 20:49 UTC by Chuck Bearden
Modified: 2009-03-24 07:09 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
XSLT to exercise the bug; takes itself as input (1.71 KB, application/xslt+xml)
2009-03-23 20:52 UTC, Chuck Bearden
Details

Description Chuck Bearden 2009-03-23 20:49:49 UTC
Please describe the problem:
It appears that libxslt1.1 pays attention to the charset declaration in the Content-Type HTTP header when retrieving XML files with MIME types of application/xml or text/xml via the document() function.  If a misconfigured web server sends "Content-Type: text/xml; charset=iso-8859-15" but the XML file itself has no encoding declaration in the XML prolog (and is thus to be taken as UTF-8), libxslt treats the incoming file as ISO-8859-15 and so mangles byte sequences that express e.g. many common vowels with diacritics. libxslt does not exhibit the behavior when the MIME type is 'text/html'. Saxon 6.5.5 does not exhibit the same behavior with any MIME type/charset combination.

I am attaching a test stylesheet that takes itself as input, and retrieves a simple file in UTF-8 and Latin-9 encodings from a webserver, and outputs the results with MIME types and charsets noted.

Steps to reproduce:
1. Use the attached XSLT to transform itself, e.g. with 'xsltproc test.xsl test.xsl', and observe the output.


Actual results:
When the server gives the MIME type as 'application/xml' or 'text/xml' and the encoding as 'ISO-8859-15', the conversion from the ISO encoding to UTF-8 is applied to the UTF-8 document, resulting in mangled bytes in the 'text' element where the encodings differ.

Expected results:
I would expect the incoming files to be treated as UTF-8 always.

Does this happen every time?
Yes.

Other information:
Comment 1 Chuck Bearden 2009-03-23 20:52:05 UTC
Created attachment 131216 [details]
XSLT to exercise the bug; takes itself as input

Depends on my web server being set up as it presently is.
Comment 2 Daniel Veillard 2009-03-24 07:09:00 UTC
Wrong ! See XML 1.0 specification appendix F. Encoding information
coming from the context (and HTTP headers are explicitely listed)
take predominance over what may or may not be found in the XMLDecl
section of the document. Fix the server config or what is being received
is not XML as a result !!!
  http://www.w3.org/TR/REC-xml/#sec-guessing-with-ext-info

Not a bug. Annoying especially as Apache is often misconfigured, fix
your config, there is no way around !

Daniel