GNOME Bugzilla – Bug 109564
XML attribute normalization not done for SAX
Last modified: 2009-08-15 18:40:50 UTC
This is with 2.5.4 and with libxml2 in CVS. According to http://www.w3.org/TR/REC-xml#AVNormalize all the attribute values must be normalized before returning to the app. As far as I can tell, this isn't done for the SAX API as given by the callback startElement. To test, build testSAX and do: ./testSAX test/c14n/with-comments/example-4.xml and you can see that you get: SAX.startElement(norm, attr=' ' ' ') SAX.endElement(norm) where the spaces and newlines aren't normalized after ' (Aside: these C14N tests don't match the ones in the C14N REC, they look older)
Later ... this can't be fixed at the application level since the type of the attribute (CDATA, ID, ...) is unknown once it passes the SAX API. Since only certain types get this normalization, it can't be fixed above the library level. If this helps to encourage you: expat gets it right :)
Right, the SAX pseudo API makes an horrible mess because it doesn't allow to preserve entities references in attributes. Which is something I wanted to do for libxml which is an editing toolkit. SAX doesn't exist as a reliable C API. It's provided for "compatibility only" within libxml2 bacause there is NO SAX API for C. So I have no interest in fixing problems at that level honnestly. The XMLReader streaming interface will get those right. You can also try to enable entity substitution to get a behaviour similar to expat one, but this also mean the parser will fetch external subset. The fact that "expat gets it right" also mean that no toolkit based on expat can save back entities from attribute values, and honnestly I don't consider this a feature. For me SAX is an horribly broken API, I don't claim full conformance to it because it's not possible, Daniel
Okay looking at it anyway. The output from testSAX can't be trusted as is. Running xmllint --noent test/c14n/with-comments/example-4.xml under GDB the following is received: Breakpoint 1, startElement (ctx=0x8120628, fullname=0x8130420 "norm", atts=0x8130478) at SAX.c:1255 1255 xmlParserCtxtPtr ctxt = (xmlParserCtxtPtr) ctx; (gdb) p atts[0] $1 = (const xmlChar *) 0x81306d0 "attr" (gdb) p atts[1] $2 = (const xmlChar *) 0x8130700 " ' \r\n\t ' " (gdb) \r\n\t must not be "normalized" because they appeared as character references in the serialization, and this explicitely to bypass that layer: <norm attr=' '   
	 ' '/> So what do you mean by "get it right" ? Daniel
Okay this should be fixed in CVS as I'm migrating to SAX2, attribute type is looked at before the callback and normalization is now done before the callback. Aleksey, I'm Cc'ing you on that bug report because the change affects two tests: test/c14n/with-comments/example-4.xml test/c14n/without-comments/example-4.xml it fixes a normalization problem <normId id=' '   
	 ' '/> the value was wrongly normalized as "' 
	 '" instead of "' 
	 '" i.e. it was removing the space induced by  and that's just wrong so the result of the C14Ntests is now slightly different for those two tests, All this should be fixed in CVS now, Daniel
Thanks, Daniel! These sounds good to me. Dave, you said that the C14N tests in LibXML2 are not correct. But the W3C C14N/ExcC14N interop tests were not changed for quite a long time and I just checked the web page and file names seems to be the same as I have used. I had to slightly tweak tests before putting them in LibXML2 because original tests have used signatures a lot. But the original signature tests are part of xmlsec package anyway. Can you explain what did you mean?
This should be fixed in release libxml2-2.6.0, thanks, Daniel