After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 600410 - Script node with embeded tag text
Script node with embeded tag text
Status: RESOLVED NOTGNOME
Product: libxml2
Classification: Platform
Component: general
git master
Other Windows
: Normal blocker
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2009-11-02 12:40 UTC by ppass
Modified: 2009-11-02 15:45 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description ppass 2009-11-02 12:40:47 UTC
I first reported this to PHP bug system here: 

http://bugs.php.net/bug.php?id=49984
But they say it is a libxml2 issue, so I report it here.

I have read your forum and did look at the page
http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data

I am trying to parse pages downloaded from the Internet. Unfortunately, most pages do not abide by this specification. So could you make your engine more robust and work on pages that do not 100% comply to the official specification? It should be fairly easy using regex for instance.


Bug report:

The DOM node returns only partial contents of the script node, as if the
node was mistakenly truncated when reaching the '</div>' text.

Reproduce code:
---------------
<?php

    $html = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN"><html><head><meta http-equiv="content-type"
content="text/html;
charset=utf-8"><title>Title</title></head><body><div><script
type="text/javascript" id="script1">function dummy {
object.innerHTML="<div>text</div>"; } function dummy2 { alert("hello");
} </script> </div> </body> </html>';
 
    $dom = new DOMDocument('1.0', 'UTF-8');
    @$dom->loadHTML($html);

    $script_node = $dom->getElementById('script1');
    Echo  "<![CDATA[$script_node->nodeValue]]>"; 

?>

Expected result:
----------------
function dummy { object.innerHTML="<div>text</div>"; } function dummy2 {
alert("hello"); } 

I expect to see the whole content of the script node.

Actual result:
--------------
function dummy { object.innerHTML="<div>text

The script node has been truncated.
Comment 1 Daniel Veillard 2009-11-02 15:45:40 UTC
the input is broken and doesn't follow HTML 4 syntax. You recognize it,
fine. You may try to add the HTML_PARSE_RECOVER option when creating the
parsing context and it's likely to work as you expect. No idea how to do this
from PHP

Daniel