After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 331266 - Support for user-specified encoding information for parsing
Support for user-specified encoding information for parsing
Status: RESOLVED OBSOLETE
Product: libxml2
Classification: Platform
Component: general
git master
Other All
: Normal enhancement
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2006-02-15 11:56 UTC by kbuchcik
Modified: 2021-07-05 13:26 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description kbuchcik 2006-02-15 11:56:00 UTC
According to Appendice F.2 [1] of the XML spec, there should be a way for
the user to specify the encoding information, thus override any encoding
information found in the XML entity.

Known use cases:
 1) Parsing DOMString; a DOMString is always UTF-16 encoded regardless what
    the encoding declaration says.
 2) Parsing XML using HTTP Content-Encoding

Functions like xmlCtxtReadIO() take already an @encoding argument, but this 
user-specified encoding is overriden by the BOM of the XML entity and the 
@encoding declaration. So the @encoding is currently handled as a 
fallback-encoding.

Proposals:

1) Add a parser option to explicitely instruct the parser to override any
   encoding information extracted from the XML entity by the specified
   @encoding.
   Leave the @encoding to state a fallback encoding if such a parser
   option is not set.
   See http://mail.gnome.org/archives/xml/2006-February/msg00063.html

2) Change the @encoding argument in relevant functions to _override_ any
   encoding information extracted from the XML entity.
   See http://mail.gnome.org/archives/xml/2006-February/msg00064.html

Note that proposal 2) could break existing apps, since the @encoding
argument was only used as a _fallback_ encoding, i.e., was used when the
parser could not extract any encoding information from the XML entity.

Eric Seidel requested a way of doing this with the push-parser as well
(see http://mail.gnome.org/archives/xml/2006-February/msg00052.html).

[1] http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing-with-ext-info
Comment 1 kbuchcik 2006-02-15 13:02:31 UTC
It would be nice to preserve the value of the declared encoding, even if the encoding is overriden. For DOM, this would allow to feed the Document.xmlEncoding [1] property.

[1] http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#Document3-encoding
Comment 2 kbuchcik 2006-02-22 14:54:31 UTC
Correction wrt to the BOM: I think if a BOM exists and says that the actual encoding differs from the explicitely specified, then this should produce an error.
Comment 3 Leif Halvard Silli 2011-08-11 18:00:55 UTC
(In reply to comment #0)
> According to Appendice F.2 [1] of the XML spec, there should be a way for
> the user to specify the encoding information, thus override any encoding
> information found in the XML entity.
   [ ... ]
> [1] http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing-with-ext-info

Your reading of XML 1.0 (third edition and - current - fift edition, as well)  is taken out of the air. (Or perhaps relies on Daniel Veillard's claim that it does. [0]) The appendix "F.2 Priorities in the Presence of External Encoding Information" does *not* say that there should be a way for the *user* to override the encoding information. This is plain wrong. 

This the correct reading:

* FIRSTLY: Appendix "F Autodetection of Character Encodings (Non-Normative)" is, as the heading says, non-normative. [1] It is non-normative because it only gives strategies for how to handle XML 1.0's *normative* section about encoding.

* SECONDLY: The normative section on encoding (and which appendix F tries to help dealing with), is found in  section "4.3.3 Character Encoding in Entities" [2]

* THIRDLY, even if we look squarely at appendix F (including F.2), there is nothing there which supports the claim that the user should be able to override the encoding. The first sentence of F.2 says: 
    ]] .... when the XML entity is accompanied by encoding information, as in some file systems and some network protocols ... [[
 The keyword here is "accompanied". When a HTTP server serves an XML document, it can  accompany the document with encoding info in the HTTP protocol's Content-Type: header. Webkit - or whichever parser that parses the document - cannot "accompany" the document on the server with encoding info. And ditto for file systems: the parser cannot decide what encoding info the file system is supposed to accompany the file with. (I think perhaps the word "external" is misinterpreted to mean "any external encoding override", but it is only "accompanied encoding info/ovverride" that is meant.


*BUT* I suppose that if the _user *agent*_ (rather than the *user*) internally somehow wants to "serve" the document - pass it along - to libxml2 as UTF-16, then you can transcode the document and accompany it with info which says that the document is UTF-16 encoded. And I suppose that xmllib2 could be supplied with a feature that lets understand this encoding iformation that you accompany the document with.

[0] http://mail.gnome.org/archives/xml/2006-February/msg00060.html
[1] http://www.w3.org/TR/REC-xml/#sec-guessing
[2] http://www.w3.org/TR/REC-xml/#charencoding
Comment 4 GNOME Infrastructure Team 2021-07-05 13:26:19 UTC
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new ticket at
  https://gitlab.gnome.org/GNOME/libxml2/-/issues/

Thank you for your understanding and your help.