After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 746126 - Support not substituting entities in HTML parser
Support not substituting entities in HTML parser
Status: RESOLVED OBSOLETE
Product: libxml2
Classification: Platform
Component: htmlparser
git master
Other All
: Normal enhancement
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2015-03-13 01:54 UTC by Ben Schmidt
Modified: 2021-07-05 13:27 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Patch to add support for retaining entities in HTML parser (7.75 KB, text/plain)
2015-03-13 01:54 UTC, Ben Schmidt
Details

Description Ben Schmidt 2015-03-13 01:54:30 UTC
Created attachment 299257 [details]
Patch to add support for retaining entities in HTML parser

The HTML parser currently converts entities to UTF8 text, e.g.   is converted to 0xC2A0 in a text node. When programming HTML transformations, this can be undesirable. I have also found it to be an unhelpful behaviour when working with incorrectly encoded documents. In these cases, keeping the entity references is preferable. The attached patch introduces the option to do this; the DOM will include entity reference nodes. I would be pleased if the patch could be included in libxml2. Happy to assist however I can.
Comment 1 Daniel Veillard 2015-03-15 01:29:43 UTC
That sounds a reasonable feature addition indeed, the patch looks mostly
good but:
  1/ where is that name parsing code in htmlParseRefName coming from,
     if we parse a name we already have routine for that I would prefer
     reuse, than reimplementing another segement.
  2/ code after
     success = htmlParseRefName(ctxt, &name);
     is unclear, what are you attempting to do there

  thanks for the patch, I need to understand a bit better and
possibly change those parts before adding.

Daniel
Comment 2 Ben Schmidt 2015-03-17 03:47:55 UTC
Thanks for your prompt attention to this, Daniel.

It is quite a while since I authored the patch, but I believe this
is the situation:

1. htmlParseRefName: existing routines weren't really appropriate
   for this.
    a. For character references (e.g. & &) the existing
       routine (htmlParseCharRef) returns the numeric value (e.g.
       38).
    b. For named references (e.g. &) the existing routine
       (htmlParseEntityRef) again fully interprets the reference and
       returns the associated htmlEntityDescPtr.
   A routine was needed that merely parsed syntactically (not
   semantically) merely returning the name as a string, so that:
    a. Character references aren't 'simplified' so, for example,
       & stays that way and doesn't turn into & (or
       vice versa, nor hex references turn into decimal, etc.)
    b. Named references could be represented even if the entity is
       technically undefined; indeed, a use case for this new
       functionality is finding entities in the DOM tree and
       validating them in client code.

2. The code following the line success = htmlParseRefName(ctxt, &name);
   is modelled on the code directly below it (the existing entity
   parsing code). If htmlParseRefName fails the text removed from
   the input stream prior to failure is returned in 'name' (the
   existing htmlParseEntityRef function behaves the same way; so if
   we encounter &foo&bar; then it will consume &foo and then fail, so
   'name' will contain 'foo' and the parser will continue from
   &bar;); the '&' as well 'name' need to be added to the DOM as
   text (hence the call to SAX method 'characters'). On the other
   hand, if htmlParseRefName succeeds, the entity reference is added
   to the DOM (via the SAX method 'reference'). Happy to add some
   code comments to clarify this if desired, but not sure if it's
   warranted, as the existing code is similarly sparse.

Let me know if you'd like any further information or action from me.

Best regards,

Ben
Comment 3 Ben Schmidt 2017-06-08 09:14:02 UTC
Would there be any chance in getting this patch included? Any way I can help?
Comment 4 GNOME Infrastructure Team 2021-07-05 13:27:07 UTC
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new ticket at
  https://gitlab.gnome.org/GNOME/libxml2/-/issues/

Thank you for your understanding and your help.