Bug 169834 – [PATCH] Add option to HTML parser to behave more like web browsers

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 169834 - [PATCH] Add option to HTML parser to behave more like web browsers


Summary:	[PATCH] Add option to HTML parser to behave more like web browsers


Status:	VERIFIED FIXED

Product:	libxml2
Classification:	Platform
Component:	general
Version:	2.6.17
Hardware:	Other FreeBSD

Importance:	High enhancement
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2005-03-10 13:57 UTC by Paul Loberg
Modified:	2009-08-15 18:40 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Patch for more relaxed parsing of script blocks (3.63 KB, patch) 2005-04-13 14:21 UTC, Paul Loberg	none	Details \| Review
Patch to allow attributes on end-tags (943 bytes, patch) 2005-05-30 14:35 UTC, Paul Loberg	none	Details \| Review

Description Paul Loberg 2005-03-10 13:57:14 UTC

When the libxml2 HTML parser encounter a SCRIPT or STYLE tag it will continue
parsing the content of the tag and will only handle it as CDATA if the tag is
directly followed by a comment (i.e. "<script><!-- "). This is according to the
HTML4 reccomendation as far as I can see.

However, many web browsers seem to implicitly add comments to script and style
tags, and treat the data between <script> and </script> as CDATA without parsing it.

It would be nice if the HTML parser in libxml2 had an option to behave that way too.

Comment 1 Paul Loberg 2005-04-13 14:21:33 UTC

Created attachment 45216 [details] [review]
Patch for more relaxed parsing of script blocks

Suggested patch that add a "HTML_PARSE_RELAXED" option flag for the HTMLparser.
When enabled, other end tags are ignored inside a script/style block.

Comment 2 Paul Loberg 2005-05-30 14:35:41 UTC

Created attachment 47036 [details] [review]
Patch to allow attributes on end-tags

Some web sites also put attributes on the end tags in their HTML. This patch
will, if the HTML_PARSE_RELAXED option is set, ignore these and skip to the '>'
instead of including the attributes and the '>' of the end tag as a text node.

Comment 3 Daniel Veillard 2005-08-23 16:06:51 UTC

Okay, I looked at both patches. They are not acceptable as-is as they
dismiss the error and don't report them. Also the first patch adds a field
in the middle of a public structure, it's an ABI breaker unacceptable as is,
also I did not want to "invent" a new option while there is a RECOVER one 
in the  XML parser. I reused your patches in the following way:
   - use the recovery ctxt field
   - create a HTML_PARSE_RECOVER flags using same value as its XML counter part
   - rewrite the patches to use those and always emit an error message 

Note that your second patch may be worse than the current one, as you may loose
the following tag. Example:

paphio:~/XML -> cat tst.html
<html>
<head>
<script>
  "</foo>"
</script>
</head>
<body>
  <p> this is really </p <hr />
</body>
</html>
paphio:~/XML ->

  When parsed with recovery:

paphio:~/XML -> xmllint --recover --html tst.html
tst.html:4: HTML parser error : Element script embbeds close tag
  "</foo>"
   ^
tst.html:8: HTML parser error : End tag : expected '>'
  <p> this is really </p <hr />
                         ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><script>
  "</foo>"
</script></head>
<body><p> this is really </p></body>
</html>
paphio:~/XML ->

  At least the errors are signalled. The default behaviour of the parser 
remains the same:

paphio:~/XML -> xmllint --html tst.html
tst.html:4: HTML parser error : Unexpected end tag : foo
  "</foo>"
         ^
tst.html:8: HTML parser error : End tag : expected '>'
  <p> this is really </p <hr />
                         ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><script>
  ""
</script></head>
<body>
<p> this is really </p>
<hr>
</body>
</html>
paphio:~/XML ->

  The changes are in CVS,

Daniel

Comment 4 Daniel Veillard 2005-09-05 08:59:20 UTC

This should be closed by release of libxml2-2.6.21,

  thanks,

Daniel