GNOME Bugzilla – Bug 611655
missing html attribute value inconsistently represented in SAX callback
Last modified: 2021-07-05 13:25:38 UTC
For some reason bugzilla is not letting me set the version, but it's 2.7.6. This bug shows up parsing HTML using Sax mode, in the presence of an element containing two attributes that lack values. The first attribute value comes out as NULL, which seems fine. The second attribute value comes out as a copy of the attribute name, which seems broken and hard to work around. Testcase is attached. Type 'make test'. testcase is this html file: <html> <head> <title>Test case for selected bug</title> </head> <body> <select> <option value="&cat=244">Other option</option> <option value selected style="color: #ccc;">Default option</option> </select> </body> </html> SAX callback should print: element html element head element title element body element select element option value=&cat=244 element option value=(null) selected=(null) style=color: #ccc; but it prints element html element head element title element body element select element option value=&cat=244 element option value=(null) selected=selected style=color: #ccc; The difference is in the 'selected' attribute for the last option. It should be "(null)", but is "selected"
Created attachment 155089 [details] Testcase to reproduce with unspecified attribute values
I dug a little deeper in the debugger. It appears that htmlParseAttribute classifies "selected" as a "boolean" attribute via the function htmlIsBooleanAttr, and therefore strdups the attr name to return as the value. This makes it hard to reproduce the original HTML from the SAX parser. I could not tell whether the original source had <tag xxx="xxx"> or <tag xxx> Why is this done? Would it be acceptable to add an option to turn it off?
"This makes it hard to reproduce the original HTML from the SAX parser." Honnestly I think that goal is an impossible one. Even with XML it's hard, but with HTML complete fuziness on parsing I think you can forgot about the idea, we just can't implement that *and* parse "real HTML" i.e. the random crap that people generate as web pages. Completely conflicting goal, one requiring no interpretation and the second one requiring guessing everytime there is a non-conformance. Since parsing the actual web is what makes the HTML parser useful, it's obvious that the second one is the one which matters. Daniel
OK I will back off from that goal. Before I go on I will note that your system parses this construct just fine. It then destroys information when it special-cases that boolean attribute and mutates the attr value. You could instead expose the predicate for boolean values and let the caller decide whether to make that transformation. My real goal is to parse HTML on arbitrary web sites, mutate it, and write out HTML that will work in the same browsers as the original HTML. This is not an easy goal but it is certainly not impossible. For example, unbalanced tags are corrected by libhtml and that is consistent with the goal of preserving functionality. But I still don't understand why you are strdup-ing the attr names for booleans, or whether browsers or javascript dom inspection will see selected and selected="selected" as equivalent.
-------------------- http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.3.4.2 Boolean attributes may legally take a single value: the name of the attribute itself (e.g., selected="selected"). -------------------- In a sense SAX expects a value, passing NULL is a bit of an exception. The behaviour of libxml2 on boolean attribute has been that way for a very long time, it's the first time someone question that behaviour. It's also extremely useful to be able to use XPath and the whole set of XML tools on HTML parsed trees from libxml2, and this model expects attributes with a value, the only one possible being the attribute name itself. I think for non boolean attributes when building a tree an empty attribute is provided, but that's wrong from the HTML4 spec point of view, which is why we complete this at the parser level. I think overall that behaviour makes sense from a libxml2 toolbox standpoint, even if it's not optimal for your goal, Daniel
From the w3.org doc you referenced: Authors should be aware that many user agents only recognize the minimized form of boolean attributes and not the full form. Making this transformation may cause me to break web pages for "many user agents". Why does the *parser* need to make this transformation? You are already providing NULL values for non-boolean attributes, so I don't understand what it means to be a "bit of an exception." I don't have a good handle on the libxml2 toolbox. But I can say for sure that if the parser generates NULL for unspecified attribute values, it's easy to have a second pass that adds a value for boolean attributes. But it's impossible to go the other direction. Would it be acceptable to add a new option to libxml2 to avoid making this transformation? I'd be happy to do it and submit a patch if you thought it could be approved.
Don't make a patch, because one more option is not a proper handling for this but please raise the issue on the mailing list. We can probably defer that transformation to the SAX2.c module building the tree, but this need to be raised publicly on-list, and it's always crappy to delegate HTML knowledge on the tree building routines, so I would not do that without allowing people to object. Daniel
OK I sent the mail to the mailing list. Your proposal seems like an acceptable solution.
Hi Daniel, I sent the query to the mailing list a week ago. How long is a reasonable period of time to wait after the query before we decide to go ahead and make the change? This was the email I sent: Joshua Marantz to xml show details Mar 3 (7 days ago) In https://bugzilla.gnome.org/show_bug.cgi?id=611655 I reported what looked like a bug where a tag like: <option selected> would be transformed, in the parser to <option selected="selected"> This is consistent with http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.3.4.2 but that spec also warns: Authors should be aware that many user agents only recognize the minimized form of boolean attributes and not the full form. By making this transformation in the parser, it is not possible to use libxml2 to process HTML without potentially breaking behavior in some browsers. Currently this transformation is implemented in HTMLparser.c in the static function htmlParseAttribute, based on htmlIsBooleanAttr(name) . Daniel Veillard explains that some downstream tools expect this transformation to be done. However, I would like to propose that this transformation be moved out of the parser and done in a later phase. Daniel suggested "the SAX2.c module building the tree". This would be OK for my purposes, as I am using my own SAX bindings. and not relying on the tree-building code. So I'm proposing this change to see if there are objections. Thanks!
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.