Bug 728997 – Blank nodes are sometimes removed when parsing HTML

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 728997 - Blank nodes are sometimes removed when parsing HTML


Summary:	Blank nodes are sometimes removed when parsing HTML


Status:	RESOLVED DUPLICATE of bug 681822

Product:	libxml2
Classification:	Platform
Component:	htmlparser
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-04-26 04:53 UTC by Krizalys
Modified:	2017-06-17 10:51 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Test HTML file + program (837 bytes, application/gzip) 2014-04-26 04:53 UTC, Krizalys	Details

Description Krizalys 2014-04-26 04:53:58 UTC

Created attachment 275188 [details]
Test HTML file + program

libxml2 sometimes removes blank nodes (eg. text nodes with blank chars as contents) when parsing HTML.

This may be problematic as blank nodes sometimes have an effect on the output (eg. horizontally separate two inline elements in a web browser).

Attached is an example to demonstrate this. The archive contains:
- A test HTML file (valid HTML 5 according to the W3C validator)
- A C program that parses this file and dumps elements and text nodes as they were parsed (all elements will be closed, eg. <meta> -> <meta></meta>).
- A Makefile to compile it

In the HTML file, note the blank node (eg. new line + tab characters) between both <input>s.

The output of the program (compiled against libxml2-2.9.1, latest stable version at the time of this submission) will be:

$ ./main
<html><head><meta></meta><title>Test case</title></head><body>
		
	<form>
			<input></input><input></input></form>
	</body></html>$

As you can see, the blank node between both <input>s has been removed by libxml2. A web browser would now stick the two fields together, whereas they would be horizontally separated if rendered from the original file.

Comment 1 Nick Wellnhofer 2017-06-17 10:51:55 UTC


*** This bug has been marked as a duplicate of bug 681822 ***