GNOME Bugzilla – Bug 600410
Script node with embeded tag text
Last modified: 2009-11-02 15:45:40 UTC
I first reported this to PHP bug system here: http://bugs.php.net/bug.php?id=49984 But they say it is a libxml2 issue, so I report it here. I have read your forum and did look at the page http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data I am trying to parse pages downloaded from the Internet. Unfortunately, most pages do not abide by this specification. So could you make your engine more robust and work on pages that do not 100% comply to the official specification? It should be fairly easy using regex for instance. Bug report: The DOM node returns only partial contents of the script node, as if the node was mistakenly truncated when reaching the '</div>' text. Reproduce code: --------------- <?php $html = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"><title>Title</title></head><body><div><script type="text/javascript" id="script1">function dummy { object.innerHTML="<div>text</div>"; } function dummy2 { alert("hello"); } </script> </div> </body> </html>'; $dom = new DOMDocument('1.0', 'UTF-8'); @$dom->loadHTML($html); $script_node = $dom->getElementById('script1'); Echo "<![CDATA[$script_node->nodeValue]]>"; ?> Expected result: ---------------- function dummy { object.innerHTML="<div>text</div>"; } function dummy2 { alert("hello"); } I expect to see the whole content of the script node. Actual result: -------------- function dummy { object.innerHTML="<div>text The script node has been truncated.
the input is broken and doesn't follow HTML 4 syntax. You recognize it, fine. You may try to add the HTML_PARSE_RECOVER option when creating the parsing context and it's likely to work as you expect. No idea how to do this from PHP Daniel