GNOME Bugzilla – Bug 728997
Blank nodes are sometimes removed when parsing HTML
Last modified: 2017-06-17 10:51:55 UTC
Created attachment 275188 [details] Test HTML file + program libxml2 sometimes removes blank nodes (eg. text nodes with blank chars as contents) when parsing HTML. This may be problematic as blank nodes sometimes have an effect on the output (eg. horizontally separate two inline elements in a web browser). Attached is an example to demonstrate this. The archive contains: - A test HTML file (valid HTML 5 according to the W3C validator) - A C program that parses this file and dumps elements and text nodes as they were parsed (all elements will be closed, eg. <meta> -> <meta></meta>). - A Makefile to compile it In the HTML file, note the blank node (eg. new line + tab characters) between both <input>s. The output of the program (compiled against libxml2-2.9.1, latest stable version at the time of this submission) will be: $ ./main <html><head><meta></meta><title>Test case</title></head><body> <form> <input></input><input></input></form> </body></html>$ As you can see, the blank node between both <input>s has been removed by libxml2. A web browser would now stick the two fields together, whereas they would be horizontally separated if rendered from the original file.
*** This bug has been marked as a duplicate of bug 681822 ***