After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 319716 - HTMLparser bug with space around <img> tags
HTMLparser bug with space around <img> tags
Status: RESOLVED DUPLICATE of bug 681822
Product: libxml2
Classification: Platform
Component: htmlparser
2.6.x
Other Linux
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2005-10-25 09:57 UTC by Michael Day
Modified: 2017-06-17 10:50 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Michael Day 2005-10-25 09:57:52 UTC
HTMLparser removes space between elements in some situations like this:

    <p><img src="foo"> <img src="bar"></p>

so that the output will be like this:

    <p><img src="foo"><img src="bar"></p>

However, this does not seem to be correct; at least it is not the way that
web browsers parse this kind of HTML. (If <span> elements are used instead
of <img> elements then HTMLparser correctly preserves the space).
Comment 1 Daniel Veillard 2006-10-17 19:19:55 UTC
There is an heuristic in areBlanks() which is called when blank strings
have been parsed and checking if need to be ignored or not:

    lastChild = xmlGetLastChild(ctxt->node);
    while ((lastChild) && (lastChild->type == XML_COMMENT_NODE))
        lastChild = lastChild->prev;
    if (lastChild == NULL) {
        if ((ctxt->node->type != XML_ELEMENT_NODE) &&
            (ctxt->node->content != NULL)) return(0);
        /* keep ws in constructs like ...<b> </b>...
           for all tags "b" allowing PCDATA */
        for ( i = 0; i < sizeof(allowPCData)/sizeof(allowPCData[0]); i++ ) {
            if ( xmlStrEqual(ctxt->name, BAD_CAST allowPCData[i]) ) {
                return(0);
            }
        }
    } else if (xmlNodeIsText(lastChild)) {
        return(0);
    } else {
        /* keep ws in constructs like <p><b>xy</b> <i>z</i><p>
           for all tags "p" allowing PCDATA */
        for ( i = 0; i < sizeof(allowPCData)/sizeof(allowPCData[0]); i++ ) {
            if ( xmlStrEqual(lastChild->name, BAD_CAST allowPCData[i]) ) {
                return(0);
            }
        }
    }
    return(1);

  I have no idea where this comes from, and what the theorical or practical
behaviour should be. It seems that theorical one should be just to check
for ctxt->name ("p" in this case) in  allowPCData, and if yes then
return 0. But I have no idea where this come from and I don't want to change
this unilateraly, that should be discussed in the mailing-list I guess.

Daniel
Comment 2 Alejandro Lapeyre 2008-07-01 16:15:32 UTC
<body><a>a</a> <b>b</b></body>
results in (wrong):
<body><a>a</a><b>b</b></body>

while,

<p><a>a</a> <b>b</b></p>
results in (ok):
<p><a>a</a> <b>b</b></p>

Looks like there is an extra check to do, or include "body" to the allowPCData array.

PHP, libxml Version 2.6.26 
Comment 3 Nick Wellnhofer 2017-06-17 10:50:08 UTC

*** This bug has been marked as a duplicate of bug 681822 ***