After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 444994 - HTML chunked parsing failure when attribute contains <>
HTML chunked parsing failure when attribute contains <>
Status: RESOLVED FIXED
Product: libxml2
Classification: Platform
Component: general
git master
Other All
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2007-06-07 05:19 UTC by James Bursa
Modified: 2009-08-25 12:44 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Source that tests htmlParseChunk with a given file and chunk size (993 bytes, text/plain)
2007-06-07 05:21 UTC, James Bursa
Details
HTML file that triggers the parse error (55 bytes, text/plain)
2007-06-07 05:23 UTC, James Bursa
Details
C++ program to show the bug. (1.75 KB, text/plain)
2008-07-02 13:22 UTC, T. Manske
Details

Description James Bursa 2007-06-07 05:19:45 UTC
The input

<td onmouseover="ChangeText('<b>Trouble at sea</b>')">

causes HTML parser errors when using htmlParseChunk if a chunk split occurs at some places in the attribute.

The attached source and html file reproduce the problem.

Parsing in 100 byte chunks succeeds:

$ ./html_chunk_test 100 test3.html
htmlParseChunk 55
htmlParseChunk 0
HTML DOCUMENT
standalone=true
  DTD(html), PUBLIC -//W3C//DTD HTML 4.0 Transitional//EN, SYSTEM http://www.w3.org/TR/REC-html40/loose.dtd
  ELEMENT html
    ELEMENT body
      ELEMENT td
        ATTRIBUTE onmouseover
          TEXT
            content=ChangeText('<b>Trouble at sea</b>')

Parsing in 10 byte chunks fails:

$ ./html_chunk_test 10 test3.html
htmlParseChunk 10
htmlParseChunk 10
htmlParseChunk 10
htmlParseChunk 10
HTML parser error : AttValue: " expected
<td onmouseover="ChangeText('<b>Trouble
                                        ^
HTML parser error : Couldn't find end of Start Tag td
<td onmouseover="ChangeText('<b>Trouble
                                        ^
htmlParseChunk 10
HTML parser error : Unexpected end tag : b
<td onmouseover="ChangeText('<b>Trouble at sea</b>
                                                  ^
htmlParseChunk 5
htmlParseChunk 0
HTML DOCUMENT
standalone=true
  DTD(html), PUBLIC -//W3C//DTD HTML 4.0 Transitional//EN, SYSTEM http://www.w3.org/TR/REC-html40/loose.dtd
  ELEMENT html
    ELEMENT body
      ELEMENT td
        ATTRIBUTE onmouseover
          TEXT
            content=ChangeText('<b>Trouble
      TEXT
        content=at sea')">
Comment 1 James Bursa 2007-06-07 05:21:56 UTC
Created attachment 89529 [details]
Source that tests htmlParseChunk with a given file and chunk size
Comment 2 James Bursa 2007-06-07 05:23:03 UTC
Created attachment 89530 [details]
HTML file that triggers the parse error
Comment 3 T. Manske 2008-07-02 13:16:19 UTC
I can confirm this, it's still present in libxml2 2.6.32. It happens when chunks are split at or after an '>' in an attribute value and before the tag's closing '>'. I guess it's because htmlParseLookupSequence(ctxt, '>', 0, 0, 0), which is used to scan forward to the end of the tag in htmlParseTryOrFinish() ignores quotations.
Comment 4 T. Manske 2008-07-02 13:22:44 UTC
Created attachment 113856 [details]
C++ program to show the bug.

This C++ program uses the same simple HTML document and parses it multiple times in two chunks, with the split traversing the critical section.
Comment 5 Steve Madsen 2008-11-11 22:14:00 UTC
I am experiencing this behavior, as well.  One additional side-effect is that the SAX startElement callback will be invoked with an incomplete attribute list in this case.
Comment 6 Daniel Veillard 2009-08-25 12:44:18 UTC
Okay found, it was in htmlParseLookupSequence() basically if the chunck
ended in an attribute value, then ctxt->checkIndex would still be saved
but without knowledge of being within the attribute amd on next call
we would restart the parsing from within the attribute but without
that knowledge.

paphio:~/XML -> ./tst 100 test3.html
htmlParseChunk 55
htmlParseChunk 0
HTML DOCUMENT
standalone=true
  DTD(html), PUBLIC -//W3C//DTD HTML 4.0 Transitional//EN, SYSTEM http://www.w3.org/TR/REC-html40/loose.dtd
  ELEMENT html
    ELEMENT body
      ELEMENT td
        ATTRIBUTE onmouseover
          TEXT
            content=ChangeText('<b>Trouble at sea</b>')
paphio:~/XML -> ./tst 10 test3.html
htmlParseChunk 10
htmlParseChunk 10
htmlParseChunk 10
htmlParseChunk 10
htmlParseChunk 10
htmlParseChunk 5
htmlParseChunk 0
HTML DOCUMENT
standalone=true
  DTD(html), PUBLIC -//W3C//DTD HTML 4.0 Transitional//EN, SYSTEM http://www.w3.org/TR/REC-html40/loose.dtd
  ELEMENT html
    ELEMENT body
      ELEMENT td
        ATTRIBUTE onmouseover
          TEXT
            content=ChangeText('<b>Trouble at sea</b>')
paphio:~/XML -> 

  thanks for html_chunk.c and the reproducer !

Daniel