After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 761534 - HTML5 tags nav, section, article etc bring warnings: Tag invalid in entity
HTML5 tags nav, section, article etc bring warnings: Tag invalid in entity
Status: RESOLVED OBSOLETE
Product: libxml2
Classification: Platform
Component: htmlparser
git master
Other Linux
: Normal enhancement
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2016-02-04 07:45 UTC by Christian Weiske
Modified: 2021-07-05 13:23 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Christian Weiske 2016-02-04 07:45:25 UTC
HTML5[1] brings several new tags[2] in addition to the HTML4 ones.

Among them are article, aside, audio, canvas, figure, header, footer, meter, nav, section, time and video.

When using libxml2 to load a file with such tags, I get the following warning:

> Tag nav invalid in Entity, line: 40

Apart from that, the content of the tags is not included in DOM element textContent property (in PHP, no idea how this is in C).

Please add support for the additional tags to libxml2's HTML parser.


[1] https://www.w3.org/TR/html5/ 
[2] http://www.tutorialspoint.com/html5/html5_new_tags.htm
Comment 1 onemanbanddan 2018-11-05 10:08:01 UTC
Nearly the end of 2018 and this hasn't been sorted.
There is a bug posted for PHP DOMDocument which is getting old but the problem is here in the libxml2 library.

A quick look at the code shows it comes down to the html40ElementTable array which begins at line 770 in HTMLParser.c

There are 6 references to the array from within that file.

I would propose updating the array to include html5 tags, and renaming it to something like html45ElementTable (we don't want to break anything).

There are elements in the array which I don't understand. e.g. 7 integers which all seem to be 0,1 or 2.

If someone can point me to documentation which explains the structure of the array, I can update, test and do a pull request.

I'm not really a c programmer, but seriously, this seems pretty simple and html5 has been around for a while now. I reckon libxml2 should be able to parse html5 without throwing the toys out of the pram.

Hope someone replies, Thanks, dan.
Comment 2 hxtree 2020-01-31 14:08:45 UTC
HTML5 has been the standard for years. This is negatively impacting PHP. Huge issue with keeping PHP relevant and modern. It seems like such as achievable fix. A few entities, a different doc tag.
Comment 3 philippe-leon 2020-06-04 15:26:34 UTC
Hi,
As a reply to “Comment 1” from Dan(iel?), I don't think what he proposes is a good solution, I think the best way to deal with it is, at least, to use a DTD detection to apply this or that set of HTML element.
Something like html45ElementTable would a mix of different set of HTML element with some element that in the case of HTML 4 didn't exist and with some obsolete elements on the other cases. How could we rely on the error displaying if it mixes element that can be used in a certain context with element that can't use in that same context but can use in another context?
Moreover, there is not just HTML4 and HTML5, without considering the .x HTML versions, there's XHTML 1, and HTML5 can be SGML or XML, that is why a DTD detection is needed (and it certainly would not be enough with HTML5 to distinguish between SGML and XML, so that might also be detected another way).
In case it would be useful, in the PHP bug, someone talks about html5lib but this is beyond my comprehension to know how it would be useful for libxml2's HTML parser (maybe it's not, maybe it's only intended to be used instead of libxml2's HTML parser, I don't really understand what I'm talking about on this sentence).
Bye.
Comment 4 GNOME Infrastructure Team 2021-07-05 13:23:05 UTC
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new ticket at
  https://gitlab.gnome.org/GNOME/libxml2/-/issues/

Thank you for your understanding and your help.