GNOME Bugzilla – Bug 761534
HTML5 tags nav, section, article etc bring warnings: Tag invalid in entity
Last modified: 2021-07-05 13:23:05 UTC
HTML5[1] brings several new tags[2] in addition to the HTML4 ones. Among them are article, aside, audio, canvas, figure, header, footer, meter, nav, section, time and video. When using libxml2 to load a file with such tags, I get the following warning: > Tag nav invalid in Entity, line: 40 Apart from that, the content of the tags is not included in DOM element textContent property (in PHP, no idea how this is in C). Please add support for the additional tags to libxml2's HTML parser. [1] https://www.w3.org/TR/html5/ [2] http://www.tutorialspoint.com/html5/html5_new_tags.htm
Nearly the end of 2018 and this hasn't been sorted. There is a bug posted for PHP DOMDocument which is getting old but the problem is here in the libxml2 library. A quick look at the code shows it comes down to the html40ElementTable array which begins at line 770 in HTMLParser.c There are 6 references to the array from within that file. I would propose updating the array to include html5 tags, and renaming it to something like html45ElementTable (we don't want to break anything). There are elements in the array which I don't understand. e.g. 7 integers which all seem to be 0,1 or 2. If someone can point me to documentation which explains the structure of the array, I can update, test and do a pull request. I'm not really a c programmer, but seriously, this seems pretty simple and html5 has been around for a while now. I reckon libxml2 should be able to parse html5 without throwing the toys out of the pram. Hope someone replies, Thanks, dan.
HTML5 has been the standard for years. This is negatively impacting PHP. Huge issue with keeping PHP relevant and modern. It seems like such as achievable fix. A few entities, a different doc tag.
Hi, As a reply to “Comment 1” from Dan(iel?), I don't think what he proposes is a good solution, I think the best way to deal with it is, at least, to use a DTD detection to apply this or that set of HTML element. Something like html45ElementTable would a mix of different set of HTML element with some element that in the case of HTML 4 didn't exist and with some obsolete elements on the other cases. How could we rely on the error displaying if it mixes element that can be used in a certain context with element that can't use in that same context but can use in another context? Moreover, there is not just HTML4 and HTML5, without considering the .x HTML versions, there's XHTML 1, and HTML5 can be SGML or XML, that is why a DTD detection is needed (and it certainly would not be enough with HTML5 to distinguish between SGML and XML, so that might also be detected another way). In case it would be useful, in the PHP bug, someone talks about html5lib but this is beyond my comprehension to know how it would be useful for libxml2's HTML parser (maybe it's not, maybe it's only intended to be used instead of libxml2's HTML parser, I don't really understand what I'm talking about on this sentence). Bye.
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.