GNOME Bugzilla – Bug 331266
Support for user-specified encoding information for parsing
Last modified: 2021-07-05 13:26:19 UTC
According to Appendice F.2 [1] of the XML spec, there should be a way for the user to specify the encoding information, thus override any encoding information found in the XML entity. Known use cases: 1) Parsing DOMString; a DOMString is always UTF-16 encoded regardless what the encoding declaration says. 2) Parsing XML using HTTP Content-Encoding Functions like xmlCtxtReadIO() take already an @encoding argument, but this user-specified encoding is overriden by the BOM of the XML entity and the @encoding declaration. So the @encoding is currently handled as a fallback-encoding. Proposals: 1) Add a parser option to explicitely instruct the parser to override any encoding information extracted from the XML entity by the specified @encoding. Leave the @encoding to state a fallback encoding if such a parser option is not set. See http://mail.gnome.org/archives/xml/2006-February/msg00063.html 2) Change the @encoding argument in relevant functions to _override_ any encoding information extracted from the XML entity. See http://mail.gnome.org/archives/xml/2006-February/msg00064.html Note that proposal 2) could break existing apps, since the @encoding argument was only used as a _fallback_ encoding, i.e., was used when the parser could not extract any encoding information from the XML entity. Eric Seidel requested a way of doing this with the push-parser as well (see http://mail.gnome.org/archives/xml/2006-February/msg00052.html). [1] http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing-with-ext-info
It would be nice to preserve the value of the declared encoding, even if the encoding is overriden. For DOM, this would allow to feed the Document.xmlEncoding [1] property. [1] http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#Document3-encoding
Correction wrt to the BOM: I think if a BOM exists and says that the actual encoding differs from the explicitely specified, then this should produce an error.
(In reply to comment #0) > According to Appendice F.2 [1] of the XML spec, there should be a way for > the user to specify the encoding information, thus override any encoding > information found in the XML entity. [ ... ] > [1] http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing-with-ext-info Your reading of XML 1.0 (third edition and - current - fift edition, as well) is taken out of the air. (Or perhaps relies on Daniel Veillard's claim that it does. [0]) The appendix "F.2 Priorities in the Presence of External Encoding Information" does *not* say that there should be a way for the *user* to override the encoding information. This is plain wrong. This the correct reading: * FIRSTLY: Appendix "F Autodetection of Character Encodings (Non-Normative)" is, as the heading says, non-normative. [1] It is non-normative because it only gives strategies for how to handle XML 1.0's *normative* section about encoding. * SECONDLY: The normative section on encoding (and which appendix F tries to help dealing with), is found in section "4.3.3 Character Encoding in Entities" [2] * THIRDLY, even if we look squarely at appendix F (including F.2), there is nothing there which supports the claim that the user should be able to override the encoding. The first sentence of F.2 says: ]] .... when the XML entity is accompanied by encoding information, as in some file systems and some network protocols ... [[ The keyword here is "accompanied". When a HTTP server serves an XML document, it can accompany the document with encoding info in the HTTP protocol's Content-Type: header. Webkit - or whichever parser that parses the document - cannot "accompany" the document on the server with encoding info. And ditto for file systems: the parser cannot decide what encoding info the file system is supposed to accompany the file with. (I think perhaps the word "external" is misinterpreted to mean "any external encoding override", but it is only "accompanied encoding info/ovverride" that is meant. *BUT* I suppose that if the _user *agent*_ (rather than the *user*) internally somehow wants to "serve" the document - pass it along - to libxml2 as UTF-16, then you can transcode the document and accompany it with info which says that the document is UTF-16 encoded. And I suppose that xmllib2 could be supplied with a feature that lets understand this encoding iformation that you accompany the document with. [0] http://mail.gnome.org/archives/xml/2006-February/msg00060.html [1] http://www.w3.org/TR/REC-xml/#sec-guessing [2] http://www.w3.org/TR/REC-xml/#charencoding
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.