GNOME Bugzilla – Bug 709171
Streaming API schema validation behaves differently (for xs:anyURI)
Last modified: 2016-05-26 11:54:54 UTC
Steps to reproduce: test.xml: <test url="http://example.com/?a=b&c=d&e=f"/> test.xsd: <?xml version="1.0"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="test"> <xs:complexType> <xs:attribute name="url" type="xs:anyURI" use="required"> </xs:attribute> </xs:complexType> </xs:element> </xs:schema> $ xmllint --schema test.xsd test.xml --noout test.xml validates $ xmllint --schema test.xsd test.xml --noout --stream test.xml:1: Schemas validity error : Element 'test', attribute 'url': 'http://example.com/?a=b&c=d&e=f' is not a valid value of the atomic type 'xs:anyURI'. test.xml fails to validate Actual results: xmllint reports different validation errors, depending on whether the streaming API is used Expected result: The file should successfully validate with and without the streaming API, as the URI is acceptable.
I am having the same problem when using --stream. It seems to manifest itself when there are 3 or more name=value pairs I just tried compiling 2.9.2 and it is still a problem. Are there any plans to address this?
Somehow, the ampersands are being reescaped as & when --stream is specified. This does not happen without --stream. The validator then complains when it sees multiple # characters in the URI.
Created attachment 296815 [details] [review] [PATCH] xmllint: Fix validation of URIs with ampersands This patch fixes the problem. I couldn't figure out how to include the test case in the test suite, sorry. If it's important, someone else will have to do it.
Ran a test with the offending xml file and it worked. Thank you.
Thanks, the patch works for me too.
Created attachment 314864 [details] [review] [PATCH] Fix XSD validation of URIs with ampersands This patch is my second attempt at fixing the bug. It should work whether you invoke the schema validator from xmllint or by linking to libxml2 directly.
I am seeing the issue specified in this bug report when using libxml2 directly with the SAX2 parser and schema validation. After applying your second patch to Yocto's libxml2_2.9.2 recipe, the anyURI schema validation is working as expected, but we are seeing a sporadic SIGSEGV out of libxml2. The error out of libxml2 is from parser.c xmlStringLenDecodeEntities(...): XML_ERR_ENTITY_LOOP - "Detected an entity reference loop" In gdb I am seeing that ctxt->depth is often 48 in the case of failure, but I saw it once with the value 134516574. This smacks of stack corruption to me, but I am not sure if there is a more simple explanation. For context, we are building an embedded linux image using the yocto project. We are compiling with gcc 4.9.3, and are targeting libxml2_2.9.2 (with all the patches that poky's fido branch). Also, I saw this fail in two of our internal unrelated recipes which each use libxml2.
Created attachment 318307 [details] [review] [PATCH v2] Fix XSD validation of URIs with ampersands Good catch, Nathan! Valgrind revealed that I was using the wrong context pointer. A corrected version of the patch is attached.
Wow, thanks for the quick response! I just tested the new patch, and everything seems to be fine now. I am not sure if this is the right place for comments on the patch, but the only issue I see is this comment: ", changing & to a literal ampersand." This seems too restrictive to me; that is to say, even though the reason for the patch is the issue with ampersands, your patch fixes more cases than just that by decoding all XML entities. For example, before your change the following URLs would also fail to validate: "https://foo.org?q=X#fragment", or "https://foo.org#fragmentX" but now each validate properly. Something like "Duplicate the value, decoding any XML entities." rings a bit more accurate to me. Again, thanks a lot for fixing this issue.
(In reply to Nathan Turner from comment #9) > For example, before your change the following URLs would also fail to > validate: "https://foo.org?q=X#fragment", or > "https://foo.org#fragmentX" but now each validate properly. These both validate fine for me in the latest stable version (2.9.3). The streaming parser already decodes all entities, but then it reencodes ampersands as "&". So by the time an attribute string is passed to xmlSchemaSAXHandleStartElementNs, the only entity that it can have is "&". Your examples become "https://foo.org?q=X#fragment" and "https://foo.org#fragmentX". See https://git.gnome.org/browse/libxml2/tree/parser.c?id=6657afe83a38278f124ace71dc85f60420beb2d5#n3980
You are correct, sorry about that; this was a mistake on my part. Turns out my unit-tests were failing a local constraint on the scheme, which requires that urls are HTTPS only, and that the character references I added were immaterial to the failures to validate. My unit-tests had those two examples as "http://" and I transcribed them improperly when writing my post here as starting with "https://". When I changed them to "https://" for our unit-tests, they indeed validated against our schema with libxml2_2.9.2. Sorry for the confusion, and thanks for the education!
I am not sure the patch is right. My recollection is there is a specific processing for & &x38; at the SAX interface level thinkpad2:~/XML -> cat tst.xml <?xml version="1.0" encoding="utf-8"?> <root> <test url="http://example.com/?<"/> <test url="http://example.com/?a=b&c=d"/> </root> thinkpad2:~/XML -> ./testSAX tst.xml SAX.setDocumentLocator() SAX.startDocument() SAX.startElement(root) SAX.characters( , 3) SAX.startElement(test, url='http://example.com/?<') SAX.endElement(test) SAX.characters( , 3) SAX.startElement(test, url='http://example.com/?a=b&c=d') SAX.endElement(test) SAX.characters( , 1) SAX.endElement(root) SAX.endDocument() thinkpad2:~/XML -> only the '&' character is converted back to a charref before calling SAX I don;t remember why that was needed and that's what you hit when validating with SAX streaming I think. Now the issue is that if you call xmlStringLenDecodeEntities directly this may have side effects it might be safer to just replace the sequence '&' by '&' in the copy At least that doesn't open the pandora box of generic entities replacement this should be way safer.... need to think a little bit about this Daniel
Hey Daniel, I thought about that same question and concluded that calling xmlStringLenDecodeEntities is not a problem because the only entity that can be in the URL when xmlStringLenDecodeEntities is called is "&". For example, if we have <test url="http://example.com/?&lt;"/> it is converted to <test url="http://example.com/?<"/> and then to <test url="http://example.com/?&lt;"/> In other words, you need an ampersand in order to have a character entity, but all of the ampersands are themselves escaped as "&", so the only entities that xmlStringLenDecodeEntities decodes are the "&" entities. Nothing can be decoded to "<" by accident.
Yes I have been thinking about it since I commented here and came to the same conclusion. I think we need to amend the patch to give the explanation and push. BTW the reason that libxml2 differs from normal SAX here and doesn't deliver the raw converted string is that libxml2 was designed as an editing toolkit and SAX was added later not as a primary API. SAX API does not allow to pass entities so will loose the editing capability for attribute content, and doing that escaping and tweaking of the API was the only way to keep editing capacity for those on top of SAX. That escaping should not happen if entity substitution is done --noent and if doing schemas validation we should really ask for entity substitution. But now is too late to change the parser behaviour if that's not the case we can only work around bugs that may arise like this one. thanks Daniel
Created attachment 327193 [details] [review] [PATCH v3] Fix XSD validation of URIs with ampersands Thanks for the explanation. I have expanded the comment to better explain the reason for decoding entities here.
Excellent, so pushed and commited as https://git.gnome.org/browse/libxml2/commit/?id=f6599c5164800e9b6fe8a85e6e2bc54fdf2f952a thanks a lot ! Daniel
BTW Nathan any way you could try with what is in libxml2 git head to make sure we didn't introduced regressions ? thanks ! Daniel
I believe this introduced regression(s); see https://bugzilla.gnome.org/show_bug.cgi?id=766834#c5