After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 728997 - Blank nodes are sometimes removed when parsing HTML
Blank nodes are sometimes removed when parsing HTML
Status: RESOLVED DUPLICATE of bug 681822
Product: libxml2
Classification: Platform
Component: htmlparser
git master
Other Linux
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2014-04-26 04:53 UTC by Krizalys
Modified: 2017-06-17 10:51 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Test HTML file + program (837 bytes, application/gzip)
2014-04-26 04:53 UTC, Krizalys
Details

Description Krizalys 2014-04-26 04:53:58 UTC
Created attachment 275188 [details]
Test HTML file + program

libxml2 sometimes removes blank nodes (eg. text nodes with blank chars as contents) when parsing HTML.

This may be problematic as blank nodes sometimes have an effect on the output (eg. horizontally separate two inline elements in a web browser).

Attached is an example to demonstrate this. The archive contains:
- A test HTML file (valid HTML 5 according to the W3C validator)
- A C program that parses this file and dumps elements and text nodes as they were parsed (all elements will be closed, eg. <meta> -> <meta></meta>).
- A Makefile to compile it

In the HTML file, note the blank node (eg. new line + tab characters) between both <input>s.

The output of the program (compiled against libxml2-2.9.1, latest stable version at the time of this submission) will be:

$ ./main
<html><head><meta></meta><title>Test case</title></head><body>
		
	<form>
			<input></input><input></input></form>
	</body></html>$

As you can see, the blank node between both <input>s has been removed by libxml2. A web browser would now stick the two fields together, whereas they would be horizontally separated if rendered from the original file.
Comment 1 Nick Wellnhofer 2017-06-17 10:51:55 UTC

*** This bug has been marked as a duplicate of bug 681822 ***