GNOME Bugzilla – Bug 166777
windows linefeeds sometimes ignored when noblanks specified
Last modified: 2009-08-15 18:40:50 UTC
using the following (note that the end of the <C/> line contains some additional spaces - or even a character will do - and the linefeeds are windows linefeeds (also reproduced under MAC OS when file saved as WIN format), the cr is not recognized as whitespace. xmlfile.xml contains: <a> <B> <C/> </B> </a> running: xmllint --noblanks xmlfile.xml produces: <?xml version="1.0"?> <a><B><C/> </B></a> within the areBlanks function, it hits /* * Otherwise, heuristic :-\ */ if (RAW != '<') return(0); The 0xD char is not taken into account in xmlParseCharData, however I got stuck trying to find a fix cause if "in" gets incremented for the 0xD, it outputs the cr as an entity when blanks are not stripped out
Please provide the input as an attachment Daniel
Created attachment 37237 [details] xml file used for xmllint input
Okay I can reproduce this on Linux, the relevant part is: http://www.w3.org/TR/REC-xml/#NT-S "The presence of #xD in the above production is maintained purely for backward compatibility with the First Edition. As explained in 2.11 End-of-Line Handling, all #xD characters literally present in an XML document are either removed or replaced by #xA characters before any other processing is done. The only way to get a #xD character to match this production is to use a character reference in an entity value literal." so all \r \n sequences should be replaced by just \n, maybe the speedup algorithm confuses the --noblanks handling. Daniel
Did more testing on this. Issues looks like its more than just windows linefeeds that behaves this way. I tried changing the areBlanks function to test if (RAW != '<' && RAW != 0xD) return(0); Things seemed to work okay after this, so I tried another test where instead of spaces after <C/>, I placed a single character. This failed with output being (note all indenting are really tabs not spaces): <?xml version="1.0"?> <a><b><C/>a </b></a> Fails due to the xmlNodeIsText(lastChild) test in areBlanks. The lastChild is the node with the contents "a". This however is reproduceable wether using windows or unix linefeeds as I had the same results on Linux. - And no I wasnt using the file from the windows machine :)
that is normal and expected. if there is a non blank character in the text node, then the content model is mixed and no node type should be dropped at all since we are sure they are significant in an XML sense. Daniel
uhm, yea forget i said that :) Is that change to areBlanks then reasonable? Within xmlParseCharData after the areBlanks test as ctxt->input->cur would be positioned at the /r, around line 3294 it would then skip over the /r: if (*in == 0xD) { in++; if (*in == 0xA) { ctxt->input->cur = in;
Sounds reasonnable but I need to think more about it. That code is core and extremely sensible, so I really need to check well first before doing such a change. Daniel
I found the problem in areBlanks() an heuristic failed in that specific case. It's fixed in CVS and I added the test file to the regression suite. thanks for the report ! Daniel
*** Bug 169838 has been marked as a duplicate of this bug. ***
This should be closed by release of libxml2-2.6.21, thanks, Daniel