After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 87235 - html parser removes relevant whitespace
html parser removes relevant whitespace
Status: VERIFIED FIXED
Product: libxml2
Classification: Platform
Component: general
2.4.22
Other Linux
: Normal minor
: ---
Assigned To: Daniel Veillard
Daniel Veillard
Depends on:
Blocks:
 
 
Reported: 2002-07-03 12:56 UTC by Morus Walter
Modified: 2009-08-15 18:40 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
patch (2.29 KB, patch)
2002-07-04 14:41 UTC, Morus Walter
none Details | Review

Description Morus Walter 2002-07-03 12:56:55 UTC
libxmls html parser seems to clean a bit too much whitespace:

Given a html file like 
<html><body bgcolor="white">
<p>ab<b> </b>cd ab<i> </i>cd ab<em> </em>cd</p>
</body></html>

xmllint --html outputs
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body bgcolor="white"><p>ab<b/>cd ab<i/>cd ab<em/>cd</p></body></html>

so all whitespace within <b>, <i>, <em> (and so on) is removed, if the
elements contain only whitespace (one ore more; that does not matter).

Whitespace is handled correctly, if the elements contain non whitespace
also:
<html><body bgcolor="white">
<p>ab<b> b </b>cd ab<i> i </i>cd ab<em> e </em>cd</p>
</body></html>
gives
<html><body bgcolor="white"><p>ab<b> b </b>cd ab<i> i </i>cd ab<em> e
</em>cd</p></body></html>

Of course it does not make much sense to have bold or italic WS and in
my case of a html to xml conversion, I simply choose to remove these 
elements before parsing. Don't know if there are constructs where it
might matter.
Anyway it's a pitfall for html to xml conversions.

xmllint --version
xmllint: using libxml version 20422
on linux
Comment 1 Morus Walter 2002-07-04 09:04:42 UTC
It's even worse:
the html parser removes whitespace in situations like
<html><body bgcolor="white">
<p><a href="1">text</a> <a href="2">other text</a></p>
</body></html>
or
<html><body bgcolor="white">
<p><span>text</span> <a href="2">other text</a></p>
</body></html>
also :-(
Comment 2 Morus Walter 2002-07-04 14:41:48 UTC
Created attachment 9637 [details] [review]
patch
Comment 3 Morus Walter 2002-07-04 14:43:19 UTC
the attached patch fixes the bug (IMHO).

Since it might be a good idea to see if anyone has a problem
with the changes, I sent it to the mailing list also.
Comment 4 Daniel Veillard 2002-07-05 18:14:09 UTC
The patch seems wrong:

------------------
    if (lastChild == NULL) {
        if ((ctxt->node->type != XML_ELEMENT_NODE) &&
            (ctxt->node->content != NULL)) return(0);
        /* keep ws in constructs like ...<b> </b>... 
           for all tags "b" allowing PCDATA */
        if ( xmlStrEqual(ctxt->name, BAD_CAST allowPCData[i]) ) {
            return(0);
        }
    }
-------------------

   Can you tell me what "i" is supposed to be initialized to in that
line ??? Seems it misses a for loop like a few lines below.
I did that as well as some other cleanup, the whole thing may need
a bit of profiling too to see if it's not too expensive, but I don't
care too much about the HTML parser speed. I also hope nobody will
complain about it, I will ship this and wait for the complains :-)

   Thanks,


Daniel
Comment 5 Daniel Veillard 2002-07-05 18:17:36 UTC
Hum, small problem, it seems to break the round-trip from/to HTML:

From make HTMLtests:

Testing test2.html
12a13
> </p>
Testing test3.html
6a7
> </p>
19a21
> </p>
Testing wired.html

can you have a look at it ?

Daniel
Comment 6 Morus Walter 2002-07-08 08:50:36 UTC
Ups. Yes that should have been a for loop as well.
I did a test, but it must have been very sloppy (I can only assume,
that I tested with an a element, in which case it might have worked
by chance).
Sorry for that and thanks for beeing more accurate.

The problem with the test cases is a bit more complicated:
A small html fragment that shows the problem is 
<TD><font></font><p></TD>
As far as I can see the problem is the following:

when parsing <p></TD> the parser sees an empty <p>.
Now an empty <p> is written as <p> without endtag and - since format
is 1 in testHTML - a new line is added.

So we get - apart from header/footer
<td>
<font></font><p>
</td>

When this is parsed, my proposed changes see a <p> followed by
whitespace. Since the <p> element isn't empty any longer,
in this case a </p> is added.

My suggestion to fix this, is to add end tags for <p> (and
consequently for <li>) in each case, by changing the 
saveEndTag flag from 1 to 0 (HTMLparser.c line 413 and 424).

In this case 
<TD><font></font><p></TD>
will get
<td>
<font></font><p></p>
</td>
which is reproduced by testHTML.

Of course this changes the html serializer in general.

Actually I already suggested this change for other reasons.
See http://mail.gnome.org/archives/xslt/2001-November/msg00023.html

I didn't insist in that issue then, since it wasn't that important
to me.

Morus
Comment 7 Daniel Veillard 2002-11-22 13:50:17 UTC
Okay I finally made that last step, and commited the change
to make the HTML serializer output </p>

http://cvs.gnome.org/bonsai/cvsquery.cgi?module=gnome-xml&branch=HEAD&branchtype=match&dir=gnome-xml&file=&filetype=match&who=veillard&whotype=match&sortby=Date&hours=&date=explicit&mindate=11%2F22%2F02+08%3A17&maxdate=11%2F22%2F02+08%3A19&cvsroot=%2Fcvs%2Fgnome

  thanks, I think we will be able to close this issue now

Daniel
Comment 8 Daniel Veillard 2002-11-22 17:52:42 UTC
Okay I think we can close it with release 2.4.28,

Daniel