Bug 319716 – HTMLparser bug with space around <img> tags

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 319716 - HTMLparser bug with space around <img> tags


Summary:	HTMLparser bug with space around <img> tags


Status:	RESOLVED DUPLICATE of bug 681822

Product:	libxml2
Classification:	Platform
Component:	htmlparser
Version:	2.6.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2005-10-25 09:57 UTC by Michael Day
Modified:	2017-06-17 10:50 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Michael Day 2005-10-25 09:57:52 UTC

HTMLparser removes space between elements in some situations like this:

    <p><img src="foo"> <img src="bar"></p>

so that the output will be like this:

    <p><img src="foo"><img src="bar"></p>

However, this does not seem to be correct; at least it is not the way that
web browsers parse this kind of HTML. (If <span> elements are used instead
of <img> elements then HTMLparser correctly preserves the space).

Comment 1 Daniel Veillard 2006-10-17 19:19:55 UTC

There is an heuristic in areBlanks() which is called when blank strings
have been parsed and checking if need to be ignored or not:

    lastChild = xmlGetLastChild(ctxt->node);
    while ((lastChild) && (lastChild->type == XML_COMMENT_NODE))
        lastChild = lastChild->prev;
    if (lastChild == NULL) {
        if ((ctxt->node->type != XML_ELEMENT_NODE) &&
            (ctxt->node->content != NULL)) return(0);
        /* keep ws in constructs like ...<b> </b>...
           for all tags "b" allowing PCDATA */
        for ( i = 0; i < sizeof(allowPCData)/sizeof(allowPCData[0]); i++ ) {
            if ( xmlStrEqual(ctxt->name, BAD_CAST allowPCData[i]) ) {
                return(0);
            }
        }
    } else if (xmlNodeIsText(lastChild)) {
        return(0);
    } else {
        /* keep ws in constructs like <p><b>xy</b> <i>z</i><p>
           for all tags "p" allowing PCDATA */
        for ( i = 0; i < sizeof(allowPCData)/sizeof(allowPCData[0]); i++ ) {
            if ( xmlStrEqual(lastChild->name, BAD_CAST allowPCData[i]) ) {
                return(0);
            }
        }
    }
    return(1);

  I have no idea where this comes from, and what the theorical or practical
behaviour should be. It seems that theorical one should be just to check
for ctxt->name ("p" in this case) in  allowPCData, and if yes then
return 0. But I have no idea where this come from and I don't want to change
this unilateraly, that should be discussed in the mailing-list I guess.

Daniel

Comment 2 Alejandro Lapeyre 2008-07-01 16:15:32 UTC

<body><a>a</a> <b>b</b></body>
results in (wrong):
<body><a>a</a><b>b</b></body>

while,

<p><a>a</a> <b>b</b></p>
results in (ok):
<p><a>a</a> <b>b</b></p>

Looks like there is an extra check to do, or include "body" to the allowPCData array.

PHP, libxml Version 2.6.26

Comment 3 Nick Wellnhofer 2017-06-17 10:50:08 UTC


*** This bug has been marked as a duplicate of bug 681822 ***