After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 304637 - Unnecessary stripping of blank CDATA in HTML parser
Unnecessary stripping of blank CDATA in HTML parser
Status: VERIFIED FIXED
Product: libxml2
Classification: Platform
Component: general
2.6.19
Other Linux
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2005-05-18 10:43 UTC by Gary Coady
Modified: 2009-08-15 18:40 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Do not remove blank CDATA nodes from tree (1.05 KB, patch)
2005-05-18 10:45 UTC, Gary Coady
none Details | Review
Do not remove blank CDATA children of the body tag (424 bytes, patch)
2005-05-18 10:56 UTC, Gary Coady
needs-work Details | Review
Allow CDATA as child of html element, except for strict DTDs (1.53 KB, patch)
2005-08-23 23:56 UTC, Gary Coady
none Details | Review

Description Gary Coady 2005-05-18 10:43:24 UTC
Hi there,
looking at the implementation of htmlParseCharData(), it will strip the CDATA
node if it contains only whitespace. If you take the following snippet of HTML,
the resulting DOM tree will be missing several blank spaces:

<b>This</b><i> string</i> <u>has</u> <i><b>several </b></i><u><b>formatting</b></u>
<u><i><b>options</b></i></u> <u><i>applied</i></u>

Locally, I've removed the check for areBlanks() from this function, but I'm not
clear on whether there are any other unintentional (bad) side-effects.
Comment 1 Gary Coady 2005-05-18 10:45:18 UTC
Created attachment 46594 [details] [review]
Do not remove blank CDATA nodes from tree

Not sure if this has any unintentional (bad) side-effects in the general case,
but it's what I'm using locally.
Comment 2 Gary Coady 2005-05-18 10:56:14 UTC
Created attachment 46595 [details] [review]
Do not remove blank CDATA children of the body tag

I looked at areBlanks() and it had a slightly different definition to what I'd
expected. So here's a different patch that just allows spaces in CDATA subnodes
of the <body> tag.

(for reference, the snippet I sent in the original bug report was a direct
child of the body node).
Comment 3 Daniel Veillard 2005-05-18 12:26:20 UTC
1/ when posting a bug report provide *complete* informations
2/ provide reproductible informations

Provide the HTML example and the output of xmllint --html and
explain why you think it's in errror.
I believe that the HTML DTD makes blanks nodes ignorable under 
<html> <body> and <head>. I may be wrong, in which case please
point to  the informations in the spec.

I am not thrilled about changing the default behaviour for something
that massive without even a start of a debate on the mailing-list
so I won't apply those patches without justification and minimal debate.

Adding an HTML parser option which can optionally change that behaviour
might be easier to accept, but in any case this need to be argumented,
and on the archived list.

Daniel
Comment 4 Daniel Veillard 2005-08-23 16:36:22 UTC
not getting the requested informations,

Daniel
Comment 5 Gary Coady 2005-08-23 16:55:12 UTC
Sorry, I forgot to update the bugzilla entry here (after forwarding the relevant
details to the libxml2 list):

If I can refer to your mail <20050518132044.GY20018@redhat.com> to the libxml2
list, dated Wed, 18 May 2005 09:20:45 -0400, at that point you felt that this
change seemed okay, but wanted to see if there were more comments.

If you want to leave the behaviour as-is, please feel free to re-close the bug;
and I apologise for not updating bugzilla.

The contents of your mail were:

On Wed, May 18, 2005 at 02:09:20PM +0100, Gary Coady wrote:
>> Hi there,
>> I've found that blank content nodes are stripped underneath the BODY
>> level. Given the attached HTML file message.html, the result of running
>> "xmllint --html" is in the file message_xmllint.html (xmllint reported
>> no errors).
>> 
>> Any blank nodes which are children of the html element are stripped,
>> causing the removal of spaces from the message.
>> 
>> According to the HTML 4.01 loose DTD, the body element can contain
>> PCDATA; relevant lines from http://www.w3.org/TR/REC-html40/loose.dtd are:
>> 
>> <!ELEMENT BODY O O (%flow;)* +(INS|DEL) -- document body -->
>> <!ENTITY % flow "%block; | %inline;">
>> <!ENTITY % inline "#PCDATA | %fontstyle; | %phrase; | %special; |
>> %formctrl;">
>> 
>> This is not true of the strict DTD; the body element in that case cannot
>> contain PCDATA.
>> 
>> The bug I filed is at http://bugzilla.gnome.org/show_bug.cgi?id=304637
>> and an initial suggested patch at
>> http://bugzilla.gnome.org/attachment.cgi?id=46595&action=view
>> 
>> Daniel did say in the bug that it might be better if this change was
>> added as a HTML parser option.

  okay, thanks for coming back with complete informations.
Based on the loose DTD fragment I think it would be fine to remove the
blanks text node stripping of body elements, even by default. But others
may care and disagree.
  We will just have to make sure that parsing/serializing repeatedly 
such an instance with libxml2 does not break (should be fine I think).

Daniel
Comment 6 Daniel Veillard 2005-08-23 22:31:06 UTC
I just lost track of the issue too.
So let's reopen it, if there is technical reasons to change the behaviour
we should do it. I try to made the fix from the second patch removing <html> tag.
But it looks wrong, basically all text elements are then replaced with
paragraphed tests:

paphio:~/XML -> xmllint  --html ./test/HTML/Down.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><title>This service is temporary down</title></head>
<body bgcolor="#FFFFFF">
<p>
</p>
<h1 align="center">Sorry, this service is temporary down</h1>
<p>
We are doing our best to get it back on-line,

</p>
<p>The W3C system administrators</p>
<p>
</p>
</body>
</html>

instead of the current output:

/usr/bin/xmllint  --html ./test/HTML/Down.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><title>This service is temporary down</title></head>
<body bgcolor="#FFFFFF">
<h1 align="center">Sorry, this service is temporary down</h1>
<p>
We are doing our best to get it back on-line,

</p>
<p>The W3C system administrators</p>
</body>
</html>

  which look quite better to me. Maybe empty CDATA should not be removed
but the patch as-is replaces them with <p> elements and that doesn't sound
right at all, so reopened but pending,

Daniel
Comment 7 Gary Coady 2005-08-23 23:56:24 UTC
Created attachment 51223 [details] [review]
Allow CDATA as child of html element, except for strict DTDs

I found that editing the array htmlNoContentElements will probably fix this - 

However, when looking up the origin of that array, I came across this thread:
http://mail.gnome.org/archives/xml/2004-September/msg00133.html
And indeed, as the thread summarises (and as my earlier mail also mentioned),
this change is fine for the loose DTD, but not for the strict DTD.

So one way is to just remove "body" from the list of htmlNoContentElements
strings.

Alternatively, libxml could check the document DTD.

I'm attaching a patch which checks for the HTML 4 or HTML 4.01 strict DTD. This
assumes that no other DTDs need to behave like these particular strict DTDs. It
also only checks the external ID, but not the system identifier.

It behaves well for me in local testing. With no DTD or the loose DTD, it adds
CDATA as children of the html element. With the strict DTD, they are not
present.
Comment 8 Daniel Veillard 2005-09-01 09:53:58 UTC
Okay I applied that last patch. It still required to change the
output of a number of tests but the changes are marginally 
restrained to handling of formatting spaces in body and head.
If someone complains about this I will balme you :-)

  thanks,

Daniel
Comment 9 Daniel Veillard 2005-09-05 08:58:56 UTC
This should be closed by release of libxml2-2.6.21,

  thanks,

Daniel