Bug 599170 – Some way to disable auto-detection of encoding during parsing

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 599170 - Some way to disable auto-detection of encoding during parsing


Summary:	Some way to disable auto-detection of encoding during parsing


Status:	RESOLVED OBSOLETE

Product:	libxml2
Classification:	Platform
Component:	general
Version:	git master
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2009-10-21 11:34 UTC by Philippe Normand
Modified:	2021-07-05 13:24 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Philippe Normand 2009-10-21 11:34:54 UTC

Libxml2 tries to be clever when parsing XML data by detecting an encoding= parameter in the XML declaration. But if the application passes an UTF-16 string directly to the parser this autodetection should not be needed. So it would be nice to have an API allowing to disable this feature maybe?

Comment 1 Daniel Veillard 2009-10-21 16:31:36 UTC

It's not 'tries to be clever' it's "follow the specification" !

Autodetection is definitely needed, especially with UTF-16 where
you could have big-endian or little-endian variants.
Most of the parsing APIs like xmlRead* allows the caller to pass a known
encoding. I suggest you read Appendix F of the XML-1.0 specification !

No bug reported, so closed as NOTABUG 

Daniel

Comment 2 Philippe Normand 2009-10-22 07:44:54 UTC

Sorry I wasn't clear enough. This is in the context of WebKit but
other applications could do the same. WebKit takes care of
detecting the encoding of the data and decodes it to UTF-16 before
handing it to libxml2.

Since WebKit has already handled detecting the encoding and is
decoding the data itself we would like the ability to disable
libxml2's autodetection of encodings to prevent it from switching
to the encoding of the original source data rather than sticking
with UTF-16.

Comment 3 Daniel Veillard 2009-10-22 12:00:32 UTC

As I said, just use one of the APIs where the encoding is provided
when creating the parser if you're 100% sure of what encoding the data
stream is in ! Again reread the spec appendix F, contextual encoding
information may override detection or declared encoding !
 
Daniel

Comment 4 Philippe Normand 2009-10-22 12:22:31 UTC

By the way, we use xmlReadMemory() and we pass the correct UTF-16 encoding variant.

Comment 5 Mark Rowe 2009-10-22 19:42:27 UTC

Daniel,  we’re looking for a way to take advantage of the aspect of appendix F that you mention but the libxml2 API seems to be lacking.  WebKit creates a parser context via xmlCreatePushParserCtxt and feeds it data via xmlParseChunk.  As far as I can see there is no way to specify the encoding when creating a push parser context.  That’s what motivated the filing of this enhancement request.  Is there something that we’re missing?

Comment 6 Daniel Veillard 2009-10-23 21:37:36 UTC

Mark, ah you were using xmlCreatePushParserCtxt ! 
Well in that specific API right you can't do that directly.
I don't see how that would not contradict with comment #4 though,
so who's right ? You I assume. 

I guess what's needed is a slightly variation on 
xmlCreatePushParserCtxt() which would set the UTF-16 decoder on the
input buffer.
Not very fun to write and test especially since with push mode and
UTF-16 having a standalone testing program in C is not trivial.

One trick might be to call xmlCreatePushParserCtxt() with no first
data, then call xmlSwitchEncoding(ctxt, XML_CHAR_ENCODING_UTF16LE)
(or BE if that's what you're using) then do the normal
xmlParseChunk() calls.

  might be sufficient for your case,

Daniel

Comment 7 Mark Rowe 2009-10-23 21:52:54 UTC

The trick you suggest is what we have been using up until recently.  When doing that libxml2 still attempts to switch to any encoding found within the XML declaration.  We had workarounds for that switching, but they were foiled when the encoding switch was made more aggressive by <http://git.gnome.org/cgit/libxml2/commit/?id=a6c76a>.  That’s what motivated us to request a more explicit way to overriding the encoding.

We do use xmlReadMemory in some situations related to XSLT, but the majority of our parsing is done via the push parser.  I haven’t looked to see whether there’s an equivalent to xmlReadMemory that allows us to override the encoding.

Comment 8 Tobias Mueller 2010-05-06 16:11:10 UTC

Reopening as I can't see any open question.

Comment 9 GNOME Infrastructure Team 2021-07-05 13:24:24 UTC

GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new ticket at
  https://gitlab.gnome.org/GNOME/libxml2/-/issues/

Thank you for your understanding and your help.