GNOME Bugzilla – Bug 654146
HTML parser strips pseudo-namespaces (fb:like, g:plusone etc)
Last modified: 2021-07-05 13:20:59 UTC
When parsing HTML documents that contain XFBML (facebook markup), the parser remove the namespace required as part of that markup is stripped, so <fb:like> turns into <like>. (Same thing happens to <g:plusone></g:plusone>) This makes processing pages that incorporate XFBML with lxml much harder. I'm not 100% positive, but it looks like it's impossible to preserve that information when parsing with HTMLParser. However XMLParser is not an option unfortunately -- those tags are used in all kinds of real-world HTML documents and for that reason I hope you can add an ability to preserve those pseudo-namespaces (not even necessarily by default). Thank you. See also: http://stackoverflow.com/questions/6597271/how-to-preserve-namespace-information-when-parsing-html-with-lxml @ubuntu:~$ echo '<fb:like/>'|xmllint --html - -:1: namespace warning : Namespace prefix fb is not defined <fb:like/> ^ -:1: HTML parser error : Tag fb:like invalid <fb:like/> ^ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><like></like></body></html> @ubuntu:~$ echo '<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml"><fb:like/></html>'|xmllint --html - -:1: namespace warning : Namespace prefix fb is not defined p://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml"><fb:like ^ -:1: HTML parser error : Tag fb:like invalid p://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml"><fb:like ^ <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml"><body><like></like></body></html>
I came across this undesirable behaviour also, in version 2.9.2. Attached patch fixes it. Since attributes starting with xmlsn are not parsed in HTML (cf. SAX2.c:1699 and SAX2.c:1740) it makes sense to include the full name, not just the local part, as the element name in the DOM tree.
Created attachment 301506 [details] [review] Keep 'prefixes' in HTML
Would there be any chance in getting this patch included? Any way I can help?
*** Bug 711670 has been marked as a duplicate of this bug. ***
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.