GNOME Bugzilla – Bug 87235
html parser removes relevant whitespace
Last modified: 2009-08-15 18:40:50 UTC
libxmls html parser seems to clean a bit too much whitespace: Given a html file like <html><body bgcolor="white"> <p>ab<b> </b>cd ab<i> </i>cd ab<em> </em>cd</p> </body></html> xmllint --html outputs <?xml version="1.0" standalone="yes"?> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body bgcolor="white"><p>ab<b/>cd ab<i/>cd ab<em/>cd</p></body></html> so all whitespace within <b>, <i>, <em> (and so on) is removed, if the elements contain only whitespace (one ore more; that does not matter). Whitespace is handled correctly, if the elements contain non whitespace also: <html><body bgcolor="white"> <p>ab<b> b </b>cd ab<i> i </i>cd ab<em> e </em>cd</p> </body></html> gives <html><body bgcolor="white"><p>ab<b> b </b>cd ab<i> i </i>cd ab<em> e </em>cd</p></body></html> Of course it does not make much sense to have bold or italic WS and in my case of a html to xml conversion, I simply choose to remove these elements before parsing. Don't know if there are constructs where it might matter. Anyway it's a pitfall for html to xml conversions. xmllint --version xmllint: using libxml version 20422 on linux
It's even worse: the html parser removes whitespace in situations like <html><body bgcolor="white"> <p><a href="1">text</a> <a href="2">other text</a></p> </body></html> or <html><body bgcolor="white"> <p><span>text</span> <a href="2">other text</a></p> </body></html> also :-(
Created attachment 9637 [details] [review] patch
the attached patch fixes the bug (IMHO). Since it might be a good idea to see if anyone has a problem with the changes, I sent it to the mailing list also.
The patch seems wrong: ------------------ if (lastChild == NULL) { if ((ctxt->node->type != XML_ELEMENT_NODE) && (ctxt->node->content != NULL)) return(0); /* keep ws in constructs like ...<b> </b>... for all tags "b" allowing PCDATA */ if ( xmlStrEqual(ctxt->name, BAD_CAST allowPCData[i]) ) { return(0); } } ------------------- Can you tell me what "i" is supposed to be initialized to in that line ??? Seems it misses a for loop like a few lines below. I did that as well as some other cleanup, the whole thing may need a bit of profiling too to see if it's not too expensive, but I don't care too much about the HTML parser speed. I also hope nobody will complain about it, I will ship this and wait for the complains :-) Thanks, Daniel
Hum, small problem, it seems to break the round-trip from/to HTML: From make HTMLtests: Testing test2.html 12a13 > </p> Testing test3.html 6a7 > </p> 19a21 > </p> Testing wired.html can you have a look at it ? Daniel
Ups. Yes that should have been a for loop as well. I did a test, but it must have been very sloppy (I can only assume, that I tested with an a element, in which case it might have worked by chance). Sorry for that and thanks for beeing more accurate. The problem with the test cases is a bit more complicated: A small html fragment that shows the problem is <TD><font></font><p></TD> As far as I can see the problem is the following: when parsing <p></TD> the parser sees an empty <p>. Now an empty <p> is written as <p> without endtag and - since format is 1 in testHTML - a new line is added. So we get - apart from header/footer <td> <font></font><p> </td> When this is parsed, my proposed changes see a <p> followed by whitespace. Since the <p> element isn't empty any longer, in this case a </p> is added. My suggestion to fix this, is to add end tags for <p> (and consequently for <li>) in each case, by changing the saveEndTag flag from 1 to 0 (HTMLparser.c line 413 and 424). In this case <TD><font></font><p></TD> will get <td> <font></font><p></p> </td> which is reproduced by testHTML. Of course this changes the html serializer in general. Actually I already suggested this change for other reasons. See http://mail.gnome.org/archives/xslt/2001-November/msg00023.html I didn't insist in that issue then, since it wasn't that important to me. Morus
Okay I finally made that last step, and commited the change to make the HTML serializer output </p> http://cvs.gnome.org/bonsai/cvsquery.cgi?module=gnome-xml&branch=HEAD&branchtype=match&dir=gnome-xml&file=&filetype=match&who=veillard&whotype=match&sortby=Date&hours=&date=explicit&mindate=11%2F22%2F02+08%3A17&maxdate=11%2F22%2F02+08%3A19&cvsroot=%2Fcvs%2Fgnome thanks, I think we will be able to close this issue now Daniel
Okay I think we can close it with release 2.4.28, Daniel