Bug 600410 – Script node with embeded tag text

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 600410 - Script node with embeded tag text


Summary:	Script node with embeded tag text


Status:	RESOLVED NOTGNOME

Product:	libxml2
Classification:	Platform
Component:	general
Version:	git master
Hardware:	Other Windows

Importance:	Normal blocker
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2009-11-02 12:40 UTC by ppass
Modified:	2009-11-02 15:45 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description ppass 2009-11-02 12:40:47 UTC

I first reported this to PHP bug system here: 

http://bugs.php.net/bug.php?id=49984
But they say it is a libxml2 issue, so I report it here.

I have read your forum and did look at the page
http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data

I am trying to parse pages downloaded from the Internet. Unfortunately, most pages do not abide by this specification. So could you make your engine more robust and work on pages that do not 100% comply to the official specification? It should be fairly easy using regex for instance.


Bug report:

The DOM node returns only partial contents of the script node, as if the
node was mistakenly truncated when reaching the '</div>' text.

Reproduce code:
---------------
<?php

    $html = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN"><html><head><meta http-equiv="content-type"
content="text/html;
charset=utf-8"><title>Title</title></head><body><div><script
type="text/javascript" id="script1">function dummy {
object.innerHTML="<div>text</div>"; } function dummy2 { alert("hello");
} </script> </div> </body> </html>';
 
    $dom = new DOMDocument('1.0', 'UTF-8');
    @$dom->loadHTML($html);

    $script_node = $dom->getElementById('script1');
    Echo  "<![CDATA[$script_node->nodeValue]]>"; 

?>

Expected result:
----------------
function dummy { object.innerHTML="<div>text</div>"; } function dummy2 {
alert("hello"); } 

I expect to see the whole content of the script node.

Actual result:
--------------
function dummy { object.innerHTML="<div>text

The script node has been truncated.

Comment 1 Daniel Veillard 2009-11-02 15:45:40 UTC

the input is broken and doesn't follow HTML 4 syntax. You recognize it,
fine. You may try to add the HTML_PARSE_RECOVER option when creating the
parsing context and it's likely to work as you expect. No idea how to do this
from PHP

Daniel