Bug 611655 – missing html attribute value inconsistently represented in SAX callback

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 611655 - missing html attribute value inconsistently represented in SAX callback


Summary:	missing html attribute value inconsistently represented in SAX callback


Status:	RESOLVED OBSOLETE

Product:	libxml2
Classification:	Platform
Component:	htmlparser
Version:	git master
Hardware:	Other Mac OS

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2010-03-02 23:57 UTC by Joshua Marantz
Modified:	2021-07-05 13:25 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Testcase to reproduce with unspecified attribute values (1.11 KB, application/x-gzip) 2010-03-02 23:58 UTC, Joshua Marantz	Details

Description Joshua Marantz 2010-03-02 23:57:40 UTC

For some reason bugzilla is not letting me set the version, but it's 2.7.6.

This bug shows up parsing HTML using Sax mode, in the presence of 
an element containing two attributes that lack values.  The first
attribute value comes out as NULL, which seems fine.  The second
attribute value comes out as a copy of the attribute name, which
seems broken and hard to work around.

Testcase is attached.  Type 'make test'.

testcase is this html file:

<html> 
  <head> 
    <title>Test case for selected bug</title> 
  </head> 
  <body> 
    <select> 
      <option value="&cat=244">Other option</option> 
      <option value selected style="color: #ccc;">Default option</option> 
    </select> 
  </body> 
</html> 


SAX callback should print:

element html
element head
element title
element body
element select
element option
  value=&cat=244
element option
  value=(null)
  selected=(null)
  style=color: #ccc;

but it prints

element html
element head
element title
element body
element select
element option
  value=&cat=244
element option
  value=(null)
  selected=selected
  style=color: #ccc;


The difference is in the 'selected' attribute for the last option.  It should be "(null)", but is "selected"

Comment 1 Joshua Marantz 2010-03-02 23:58:54 UTC

Created attachment 155089 [details]
Testcase to reproduce with unspecified attribute values

Comment 2 Joshua Marantz 2010-03-03 00:11:40 UTC

I dug a little deeper in the debugger.  It appears that htmlParseAttribute classifies "selected" as a "boolean" attribute via the function htmlIsBooleanAttr, and therefore strdups the attr name to return as the value.

This makes it hard to reproduce the original HTML from the SAX parser.  I could not tell whether the original source had   <tag xxx="xxx"> or <tag xxx>

Why is this done?  Would it be acceptable to add an option to turn it off?

Comment 3 Daniel Veillard 2010-03-03 08:02:25 UTC

"This makes it hard to reproduce the original HTML from the SAX parser."

Honnestly I think that goal is an impossible one. Even with XML it's
hard, but with HTML complete fuziness on parsing I think you can forgot
about the idea, we just can't implement that *and* parse "real HTML"
i.e. the random crap that people generate as web pages. Completely
conflicting goal, one requiring no interpretation and the second one
requiring guessing everytime there is a non-conformance. Since parsing
the actual web is what makes the HTML parser useful, it's obvious that
the second one is the one which matters.

Daniel

Comment 4 Joshua Marantz 2010-03-03 12:31:37 UTC

OK I will back off from that goal.

Before I go on I will note that your system parses this construct just
fine.  It then destroys information when it special-cases that boolean
attribute and mutates the attr value.   You could instead expose the
predicate for boolean values and let the caller decide whether to make
that transformation.

My real goal is to parse HTML on arbitrary web sites, mutate it, and write
out HTML that will work in the same browsers as the original HTML.

This is not an easy goal but it is certainly not impossible.

For example, unbalanced tags are corrected by libhtml and that is consistent
with the goal of preserving functionality.

But I still don't understand why you are strdup-ing the attr names for
booleans, or whether browsers or javascript dom inspection will see
    selected
and
    selected="selected"
as equivalent.

Comment 5 Daniel Veillard 2010-03-03 13:26:08 UTC

--------------------
http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.3.4.2

Boolean attributes may legally take a single value: the name of the attribute itself (e.g., selected="selected").
--------------------

In a sense SAX expects a value, passing NULL is a bit of an exception.
The behaviour of libxml2 on boolean attribute has been that way for
a very long time, it's the first time someone question that behaviour.
It's also extremely useful to be able to use XPath and the whole set of
XML tools on HTML parsed trees from libxml2, and this model expects
attributes with a value, the only one possible being the attribute name
itself. I think for non boolean attributes when building a tree an empty
attribute is provided, but that's wrong from the HTML4 spec point of view,
which is why we complete this at the parser level.
 I think overall that behaviour makes sense from a libxml2 toolbox
standpoint, even if it's not optimal for your goal,

Daniel

Comment 6 Joshua Marantz 2010-03-03 13:40:23 UTC

From the w3.org doc you referenced:

  Authors should be aware that many user agents only recognize the
  minimized form of boolean attributes and not the full form.


Making this transformation may cause me to break web pages for "many user agents".  Why does the *parser* need to make this transformation?


You are already providing NULL values for non-boolean attributes, so I don't understand what it means to be a "bit of an exception."

I don't have a good handle on the libxml2 toolbox.  But I can say for sure that if the parser generates NULL for unspecified attribute values, it's easy to have a second pass that adds a value for boolean attributes.  But it's impossible to go the other direction.

Would it be acceptable to add a new option to libxml2 to avoid making this transformation?  I'd be happy to do it and submit a patch if you thought it could be approved.

Comment 7 Daniel Veillard 2010-03-03 13:58:34 UTC

Don't make a patch, because one more option is not a proper handling for
this but please raise the issue on the mailing list. We can probably
defer that transformation to the SAX2.c module building the tree, but
this need to be raised publicly on-list, and it's always crappy to delegate
HTML knowledge on the tree building routines, so I would not do that without
allowing people to object.

Daniel

Comment 8 Joshua Marantz 2010-03-03 14:32:01 UTC

OK I sent the mail to the mailing list.  Your proposal seems like an acceptable solution.

Comment 9 Joshua Marantz 2010-03-10 13:04:58 UTC

Hi Daniel,

I sent the query to the mailing list a week ago.  How long is a reasonable period of time to wait after the query before we decide to go ahead and make the change?


This was the email I sent:

Joshua Marantz to xml
show details Mar 3 (7 days ago)
In https://bugzilla.gnome.org/show_bug.cgi?id=611655 I reported what looked like a bug where a tag like:

  <option selected>

would be transformed, in the parser to

  <option selected="selected">

This is consistent with http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.3.4.2 but that spec also warns:

   Authors should be aware that many user agents only recognize the minimized form of boolean attributes and not the full form.

By making this transformation in the parser, it is not possible to use libxml2 to process HTML without potentially breaking behavior in some browsers.


Currently this transformation is implemented in HTMLparser.c in the static function htmlParseAttribute, based on htmlIsBooleanAttr(name) .  Daniel Veillard explains that some downstream tools expect this transformation to be done. However, I would like to propose that this transformation be moved out of the parser and done in a later phase.  Daniel suggested "the SAX2.c module building the tree". This would be OK for my purposes, as I am using my own SAX bindings. and not relying on the tree-building code.

So I'm proposing this change to see if there are objections.

Thanks!

Comment 10 GNOME Infrastructure Team 2021-07-05 13:25:38 UTC

GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new ticket at
  https://gitlab.gnome.org/GNOME/libxml2/-/issues/

Thank you for your understanding and your help.