Bug 166777 – windows linefeeds sometimes ignored when noblanks specified

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 166777 - windows linefeeds sometimes ignored when noblanks specified


Summary:	windows linefeeds sometimes ignored when noblanks specified


Status:	VERIFIED FIXED

Product:	libxml2
Classification:	Platform
Component:	general
Version:	git master
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Duplicates:	169838 (view as bug list)
Depends on:
Blocks:

Reported:	2005-02-09 12:34 UTC by Rob Richards
Modified:	2009-08-15 18:40 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
xml file used for xmllint input (41 bytes, text/xml) 2005-02-09 13:18 UTC, Rob Richards	Details

Description Rob Richards 2005-02-09 12:34:19 UTC

using the following (note that the end of the <C/> line contains some additional
spaces - or even a character will do - and the linefeeds are windows linefeeds
(also reproduced under MAC OS when file saved as WIN format), the cr is not
recognized as whitespace.
xmlfile.xml contains:
<a>
    <B>
        <C/> 
</B>
</a>

running: xmllint --noblanks xmlfile.xml produces:
<?xml version="1.0"?>
<a><B><C/>
</B></a>

within the areBlanks function, it hits 
    /*
     * Otherwise, heuristic :-\
     */
    if (RAW != '<') return(0);

The 0xD char is not taken into account in xmlParseCharData, however I got stuck
trying to find a fix cause if "in" gets incremented for the 0xD, it outputs the
cr as an entity when blanks are not stripped out

Comment 1 Daniel Veillard 2005-02-09 13:04:09 UTC

Please provide the input as an attachment

Daniel

Comment 2 Rob Richards 2005-02-09 13:18:47 UTC

Created attachment 37237 [details]
xml file used for xmllint input

Comment 3 Daniel Veillard 2005-02-09 13:35:41 UTC

Okay I can reproduce this on Linux, the relevant part is:

  http://www.w3.org/TR/REC-xml/#NT-S
"The presence of #xD in the above production is maintained purely for backward
 compatibility with the First Edition. As explained in 2.11 End-of-Line Handling,
 all #xD characters literally present in an XML document are either removed or
 replaced by #xA characters before any other processing is done. The only way to
 get a #xD character to match this production is to use a character reference in
 an entity value literal."

so all \r  \n sequences should be replaced by just \n, maybe the speedup
algorithm confuses the --noblanks handling.

Daniel

Comment 4 Rob Richards 2005-02-09 20:13:19 UTC

Did more testing on this. Issues looks like its more than just windows linefeeds
that behaves this way.
I tried changing the areBlanks function to test if (RAW != '<' && RAW != 0xD)
return(0); Things seemed to work okay after this, so I tried another test where
instead of spaces after <C/>, I placed a single character. This failed with
output being (note all indenting are really tabs not spaces):
<?xml version="1.0"?>
<a><b><C/>a
        </b></a>

Fails due to the xmlNodeIsText(lastChild) test in areBlanks. The lastChild is
the node with the contents "a". This however is reproduceable wether using
windows or unix linefeeds as I had the same results on Linux. - And no I wasnt
using the file from the windows machine :)

Comment 5 Daniel Veillard 2005-02-09 23:30:04 UTC

that is normal and expected. if there is a non blank
character in the text node, then the content model is mixed and
no node type should be dropped at all since we are sure they are
significant in an XML sense.

Daniel

Comment 6 Rob Richards 2005-02-10 00:14:48 UTC

uhm, yea forget i said that :)
Is that change to areBlanks then reasonable? Within xmlParseCharData after the
areBlanks test as ctxt->input->cur would be positioned at the /r, around line
3294 it would then skip over the /r:
	    if (*in == 0xD) {
		in++;
		if (*in == 0xA) {
		    ctxt->input->cur = in;

Comment 7 Daniel Veillard 2005-02-10 09:53:56 UTC

Sounds reasonnable but I need to think more about it. That code is core
and extremely sensible, so I really need to check well first before 
doing such a change.

Daniel

Comment 8 Daniel Veillard 2005-07-06 15:19:45 UTC

I found the problem in areBlanks() an heuristic failed in that specific case.
It's fixed in CVS and I added the test file to the regression suite.

  thanks for the report !

Daniel

Comment 9 Daniel Veillard 2005-07-06 15:20:13 UTC

*** Bug 169838 has been marked as a duplicate of this bug. ***

Comment 10 Daniel Veillard 2005-09-05 08:59:23 UTC

This should be closed by release of libxml2-2.6.21,

  thanks,

Daniel