Bug 87235 – html parser removes relevant whitespace

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 87235 - html parser removes relevant whitespace


Summary:	html parser removes relevant whitespace


Status:	VERIFIED FIXED

Product:	libxml2
Classification:	Platform
Component:	general
Version:	2.4.22
Hardware:	Other Linux

Importance:	Normal minor
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	Daniel Veillard

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2002-07-03 12:56 UTC by Morus Walter
Modified:	2009-08-15 18:40 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
patch (2.29 KB, patch) 2002-07-04 14:41 UTC, Morus Walter	none	Details \| Review

Description Morus Walter 2002-07-03 12:56:55 UTC

libxmls html parser seems to clean a bit too much whitespace:

Given a html file like 
<html><body bgcolor="white">
<p>ab<b> </b>cd ab<i> </i>cd ab<em> </em>cd</p>
</body></html>

xmllint --html outputs
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body bgcolor="white"><p>ab<b/>cd ab<i/>cd ab<em/>cd</p></body></html>

so all whitespace within <b>, <i>, <em> (and so on) is removed, if the
elements contain only whitespace (one ore more; that does not matter).

Whitespace is handled correctly, if the elements contain non whitespace
also:
<html><body bgcolor="white">
<p>ab<b> b </b>cd ab<i> i </i>cd ab<em> e </em>cd</p>
</body></html>
gives
<html><body bgcolor="white"><p>ab<b> b </b>cd ab<i> i </i>cd ab<em> e
</em>cd</p></body></html>

Of course it does not make much sense to have bold or italic WS and in
my case of a html to xml conversion, I simply choose to remove these 
elements before parsing. Don't know if there are constructs where it
might matter.
Anyway it's a pitfall for html to xml conversions.

xmllint --version
xmllint: using libxml version 20422
on linux

Comment 1 Morus Walter 2002-07-04 09:04:42 UTC

It's even worse:
the html parser removes whitespace in situations like
<html><body bgcolor="white">
<p><a href="1">text</a> <a href="2">other text</a></p>
</body></html>
or
<html><body bgcolor="white">
<p><span>text</span> <a href="2">other text</a></p>
</body></html>
also :-(

Comment 2 Morus Walter 2002-07-04 14:41:48 UTC

Created attachment 9637 [details] [review]
patch

Comment 3 Morus Walter 2002-07-04 14:43:19 UTC

the attached patch fixes the bug (IMHO).

Since it might be a good idea to see if anyone has a problem
with the changes, I sent it to the mailing list also.

Comment 4 Daniel Veillard 2002-07-05 18:14:09 UTC

The patch seems wrong:

------------------
    if (lastChild == NULL) {
        if ((ctxt->node->type != XML_ELEMENT_NODE) &&
            (ctxt->node->content != NULL)) return(0);
        /* keep ws in constructs like ...<b> </b>... 
           for all tags "b" allowing PCDATA */
        if ( xmlStrEqual(ctxt->name, BAD_CAST allowPCData[i]) ) {
            return(0);
        }
    }
-------------------

   Can you tell me what "i" is supposed to be initialized to in that
line ??? Seems it misses a for loop like a few lines below.
I did that as well as some other cleanup, the whole thing may need
a bit of profiling too to see if it's not too expensive, but I don't
care too much about the HTML parser speed. I also hope nobody will
complain about it, I will ship this and wait for the complains :-)

   Thanks,


Daniel

Comment 5 Daniel Veillard 2002-07-05 18:17:36 UTC

Hum, small problem, it seems to break the round-trip from/to HTML:

From make HTMLtests:

Testing test2.html
12a13
> </p>
Testing test3.html
6a7
> </p>
19a21
> </p>
Testing wired.html

can you have a look at it ?

Daniel

Comment 6 Morus Walter 2002-07-08 08:50:36 UTC

Ups. Yes that should have been a for loop as well.
I did a test, but it must have been very sloppy (I can only assume,
that I tested with an a element, in which case it might have worked
by chance).
Sorry for that and thanks for beeing more accurate.

The problem with the test cases is a bit more complicated:
A small html fragment that shows the problem is 
<TD><font></font><p></TD>
As far as I can see the problem is the following:

when parsing <p></TD> the parser sees an empty <p>.
Now an empty <p> is written as <p> without endtag and - since format
is 1 in testHTML - a new line is added.

So we get - apart from header/footer
<td>
<font></font><p>
</td>

When this is parsed, my proposed changes see a <p> followed by
whitespace. Since the <p> element isn't empty any longer,
in this case a </p> is added.

My suggestion to fix this, is to add end tags for <p> (and
consequently for <li>) in each case, by changing the 
saveEndTag flag from 1 to 0 (HTMLparser.c line 413 and 424).

In this case 
<TD><font></font><p></TD>
will get
<td>
<font></font><p></p>
</td>
which is reproduced by testHTML.

Of course this changes the html serializer in general.

Actually I already suggested this change for other reasons.
See http://mail.gnome.org/archives/xslt/2001-November/msg00023.html

I didn't insist in that issue then, since it wasn't that important
to me.

Morus

Comment 7 Daniel Veillard 2002-11-22 13:50:17 UTC

Okay I finally made that last step, and commited the change
to make the HTML serializer output </p>

http://cvs.gnome.org/bonsai/cvsquery.cgi?module=gnome-xml&branch=HEAD&branchtype=match&dir=gnome-xml&file=&filetype=match&who=veillard&whotype=match&sortby=Date&hours=&date=explicit&mindate=11%2F22%2F02+08%3A17&maxdate=11%2F22%2F02+08%3A19&cvsroot=%2Fcvs%2Fgnome

  thanks, I think we will be able to close this issue now

Daniel

Comment 8 Daniel Veillard 2002-11-22 17:52:42 UTC

Okay I think we can close it with release 2.4.28,

Daniel