GNOME Bugzilla – Bug 576485
libxslt document() pays attention to charset in HTTP header
Last modified: 2009-03-24 07:09:00 UTC
Please describe the problem: It appears that libxslt1.1 pays attention to the charset declaration in the Content-Type HTTP header when retrieving XML files with MIME types of application/xml or text/xml via the document() function. If a misconfigured web server sends "Content-Type: text/xml; charset=iso-8859-15" but the XML file itself has no encoding declaration in the XML prolog (and is thus to be taken as UTF-8), libxslt treats the incoming file as ISO-8859-15 and so mangles byte sequences that express e.g. many common vowels with diacritics. libxslt does not exhibit the behavior when the MIME type is 'text/html'. Saxon 6.5.5 does not exhibit the same behavior with any MIME type/charset combination. I am attaching a test stylesheet that takes itself as input, and retrieves a simple file in UTF-8 and Latin-9 encodings from a webserver, and outputs the results with MIME types and charsets noted. Steps to reproduce: 1. Use the attached XSLT to transform itself, e.g. with 'xsltproc test.xsl test.xsl', and observe the output. Actual results: When the server gives the MIME type as 'application/xml' or 'text/xml' and the encoding as 'ISO-8859-15', the conversion from the ISO encoding to UTF-8 is applied to the UTF-8 document, resulting in mangled bytes in the 'text' element where the encodings differ. Expected results: I would expect the incoming files to be treated as UTF-8 always. Does this happen every time? Yes. Other information:
Created attachment 131216 [details] XSLT to exercise the bug; takes itself as input Depends on my web server being set up as it presently is.
Wrong ! See XML 1.0 specification appendix F. Encoding information coming from the context (and HTTP headers are explicitely listed) take predominance over what may or may not be found in the XMLDecl section of the document. Fix the server config or what is being received is not XML as a result !!! http://www.w3.org/TR/REC-xml/#sec-guessing-with-ext-info Not a bug. Annoying especially as Apache is often misconfigured, fix your config, there is no way around ! Daniel