GNOME Bugzilla – Bug 746126
Support not substituting entities in HTML parser
Last modified: 2021-07-05 13:27:07 UTC
Created attachment 299257 [details] Patch to add support for retaining entities in HTML parser The HTML parser currently converts entities to UTF8 text, e.g. is converted to 0xC2A0 in a text node. When programming HTML transformations, this can be undesirable. I have also found it to be an unhelpful behaviour when working with incorrectly encoded documents. In these cases, keeping the entity references is preferable. The attached patch introduces the option to do this; the DOM will include entity reference nodes. I would be pleased if the patch could be included in libxml2. Happy to assist however I can.
That sounds a reasonable feature addition indeed, the patch looks mostly good but: 1/ where is that name parsing code in htmlParseRefName coming from, if we parse a name we already have routine for that I would prefer reuse, than reimplementing another segement. 2/ code after success = htmlParseRefName(ctxt, &name); is unclear, what are you attempting to do there thanks for the patch, I need to understand a bit better and possibly change those parts before adding. Daniel
Thanks for your prompt attention to this, Daniel. It is quite a while since I authored the patch, but I believe this is the situation: 1. htmlParseRefName: existing routines weren't really appropriate for this. a. For character references (e.g. & &) the existing routine (htmlParseCharRef) returns the numeric value (e.g. 38). b. For named references (e.g. &) the existing routine (htmlParseEntityRef) again fully interprets the reference and returns the associated htmlEntityDescPtr. A routine was needed that merely parsed syntactically (not semantically) merely returning the name as a string, so that: a. Character references aren't 'simplified' so, for example, & stays that way and doesn't turn into & (or vice versa, nor hex references turn into decimal, etc.) b. Named references could be represented even if the entity is technically undefined; indeed, a use case for this new functionality is finding entities in the DOM tree and validating them in client code. 2. The code following the line success = htmlParseRefName(ctxt, &name); is modelled on the code directly below it (the existing entity parsing code). If htmlParseRefName fails the text removed from the input stream prior to failure is returned in 'name' (the existing htmlParseEntityRef function behaves the same way; so if we encounter &foo&bar; then it will consume &foo and then fail, so 'name' will contain 'foo' and the parser will continue from &bar;); the '&' as well 'name' need to be added to the DOM as text (hence the call to SAX method 'characters'). On the other hand, if htmlParseRefName succeeds, the entity reference is added to the DOM (via the SAX method 'reference'). Happy to add some code comments to clarify this if desired, but not sure if it's warranted, as the existing code is similarly sparse. Let me know if you'd like any further information or action from me. Best regards, Ben
Would there be any chance in getting this patch included? Any way I can help?
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.