GNOME Bugzilla – Bug 599170
Some way to disable auto-detection of encoding during parsing
Last modified: 2021-07-05 13:24:24 UTC
Libxml2 tries to be clever when parsing XML data by detecting an encoding= parameter in the XML declaration. But if the application passes an UTF-16 string directly to the parser this autodetection should not be needed. So it would be nice to have an API allowing to disable this feature maybe?
It's not 'tries to be clever' it's "follow the specification" ! Autodetection is definitely needed, especially with UTF-16 where you could have big-endian or little-endian variants. Most of the parsing APIs like xmlRead* allows the caller to pass a known encoding. I suggest you read Appendix F of the XML-1.0 specification ! No bug reported, so closed as NOTABUG Daniel
Sorry I wasn't clear enough. This is in the context of WebKit but other applications could do the same. WebKit takes care of detecting the encoding of the data and decodes it to UTF-16 before handing it to libxml2. Since WebKit has already handled detecting the encoding and is decoding the data itself we would like the ability to disable libxml2's autodetection of encodings to prevent it from switching to the encoding of the original source data rather than sticking with UTF-16.
As I said, just use one of the APIs where the encoding is provided when creating the parser if you're 100% sure of what encoding the data stream is in ! Again reread the spec appendix F, contextual encoding information may override detection or declared encoding ! Daniel
By the way, we use xmlReadMemory() and we pass the correct UTF-16 encoding variant.
Daniel, we’re looking for a way to take advantage of the aspect of appendix F that you mention but the libxml2 API seems to be lacking. WebKit creates a parser context via xmlCreatePushParserCtxt and feeds it data via xmlParseChunk. As far as I can see there is no way to specify the encoding when creating a push parser context. That’s what motivated the filing of this enhancement request. Is there something that we’re missing?
Mark, ah you were using xmlCreatePushParserCtxt ! Well in that specific API right you can't do that directly. I don't see how that would not contradict with comment #4 though, so who's right ? You I assume. I guess what's needed is a slightly variation on xmlCreatePushParserCtxt() which would set the UTF-16 decoder on the input buffer. Not very fun to write and test especially since with push mode and UTF-16 having a standalone testing program in C is not trivial. One trick might be to call xmlCreatePushParserCtxt() with no first data, then call xmlSwitchEncoding(ctxt, XML_CHAR_ENCODING_UTF16LE) (or BE if that's what you're using) then do the normal xmlParseChunk() calls. might be sufficient for your case, Daniel
The trick you suggest is what we have been using up until recently. When doing that libxml2 still attempts to switch to any encoding found within the XML declaration. We had workarounds for that switching, but they were foiled when the encoding switch was made more aggressive by <http://git.gnome.org/cgit/libxml2/commit/?id=a6c76a>. That’s what motivated us to request a more explicit way to overriding the encoding. We do use xmlReadMemory in some situations related to XSLT, but the majority of our parsing is done via the push parser. I haven’t looked to see whether there’s an equivalent to xmlReadMemory that allows us to override the encoding.
Reopening as I can't see any open question.
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.