Bug 746126 – Support not substituting entities in HTML parser

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 746126 - Support not substituting entities in HTML parser


Summary:	Support not substituting entities in HTML parser


Status:	RESOLVED OBSOLETE

Product:	libxml2
Classification:	Platform
Component:	htmlparser
Version:	git master
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2015-03-13 01:54 UTC by Ben Schmidt
Modified:	2021-07-05 13:27 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Patch to add support for retaining entities in HTML parser (7.75 KB, text/plain) 2015-03-13 01:54 UTC, Ben Schmidt	Details

Description Ben Schmidt 2015-03-13 01:54:30 UTC

Created attachment 299257 [details]
Patch to add support for retaining entities in HTML parser

The HTML parser currently converts entities to UTF8 text, e.g. &nbsp; is converted to 0xC2A0 in a text node. When programming HTML transformations, this can be undesirable. I have also found it to be an unhelpful behaviour when working with incorrectly encoded documents. In these cases, keeping the entity references is preferable. The attached patch introduces the option to do this; the DOM will include entity reference nodes. I would be pleased if the patch could be included in libxml2. Happy to assist however I can.

Comment 1 Daniel Veillard 2015-03-15 01:29:43 UTC

That sounds a reasonable feature addition indeed, the patch looks mostly
good but:
  1/ where is that name parsing code in htmlParseRefName coming from,
     if we parse a name we already have routine for that I would prefer
     reuse, than reimplementing another segement.
  2/ code after
     success = htmlParseRefName(ctxt, &name);
     is unclear, what are you attempting to do there

  thanks for the patch, I need to understand a bit better and
possibly change those parts before adding.

Daniel

Comment 2 Ben Schmidt 2015-03-17 03:47:55 UTC

Thanks for your prompt attention to this, Daniel.

It is quite a while since I authored the patch, but I believe this
is the situation:

1. htmlParseRefName: existing routines weren't really appropriate
   for this.
    a. For character references (e.g. &#x26; &#38;) the existing
       routine (htmlParseCharRef) returns the numeric value (e.g.
       38).
    b. For named references (e.g. &amp;) the existing routine
       (htmlParseEntityRef) again fully interprets the reference and
       returns the associated htmlEntityDescPtr.
   A routine was needed that merely parsed syntactically (not
   semantically) merely returning the name as a string, so that:
    a. Character references aren't 'simplified' so, for example,
       &#x00000026; stays that way and doesn't turn into &#x26; (or
       vice versa, nor hex references turn into decimal, etc.)
    b. Named references could be represented even if the entity is
       technically undefined; indeed, a use case for this new
       functionality is finding entities in the DOM tree and
       validating them in client code.

2. The code following the line success = htmlParseRefName(ctxt, &name);
   is modelled on the code directly below it (the existing entity
   parsing code). If htmlParseRefName fails the text removed from
   the input stream prior to failure is returned in 'name' (the
   existing htmlParseEntityRef function behaves the same way; so if
   we encounter &foo&bar; then it will consume &foo and then fail, so
   'name' will contain 'foo' and the parser will continue from
   &bar;); the '&' as well 'name' need to be added to the DOM as
   text (hence the call to SAX method 'characters'). On the other
   hand, if htmlParseRefName succeeds, the entity reference is added
   to the DOM (via the SAX method 'reference'). Happy to add some
   code comments to clarify this if desired, but not sure if it's
   warranted, as the existing code is similarly sparse.

Let me know if you'd like any further information or action from me.

Best regards,

Ben

Comment 3 Ben Schmidt 2017-06-08 09:14:02 UTC

Would there be any chance in getting this patch included? Any way I can help?

Comment 4 GNOME Infrastructure Team 2021-07-05 13:27:07 UTC

GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new ticket at
  https://gitlab.gnome.org/GNOME/libxml2/-/issues/

Thank you for your understanding and your help.