After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 709171 - Streaming API schema validation behaves differently (for xs:anyURI)
Streaming API schema validation behaves differently (for xs:anyURI)
Status: RESOLVED FIXED
Product: libxml2
Classification: Platform
Component: general
git master
Other Linux
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2013-10-01 10:32 UTC by bjwebb67
Modified: 2016-05-26 11:54 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
[PATCH] xmllint: Fix validation of URIs with ampersands (920 bytes, patch)
2015-02-14 05:54 UTC, Alex Henrie
none Details | Review
[PATCH] Fix XSD validation of URIs with ampersands (1.12 KB, patch)
2015-11-05 01:17 UTC, Alex Henrie
none Details | Review
[PATCH v2] Fix XSD validation of URIs with ampersands (1.14 KB, patch)
2016-01-06 07:53 UTC, Alex Henrie
none Details | Review
[PATCH v3] Fix XSD validation of URIs with ampersands (1.61 KB, patch)
2016-05-03 04:58 UTC, Alex Henrie
none Details | Review

Description bjwebb67 2013-10-01 10:32:33 UTC
Steps to reproduce:

test.xml:
<test url="http://example.com/?a=b&amp;c=d&amp;e=f"/>

test.xsd:
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="test">
    <xs:complexType>
      <xs:attribute name="url" type="xs:anyURI" use="required">
      </xs:attribute>
    </xs:complexType>
  </xs:element>
</xs:schema>

$ xmllint --schema test.xsd test.xml --noout         
test.xml validates
$ xmllint --schema test.xsd test.xml --noout --stream
test.xml:1: Schemas validity error : Element 'test', attribute 'url': 'http://example.com/?a=b&#38;c=d&#38;e=f' is not a valid value of the atomic type 'xs:anyURI'.
test.xml fails to validate



Actual results:
xmllint reports different validation errors, depending on whether the streaming API is used

Expected result:
The file should successfully validate with and without the streaming API, as the URI is acceptable.
Comment 1 Doug 2015-02-04 16:37:30 UTC
I am having the same problem when using --stream.  It seems to manifest itself when there are 3 or more name=value pairs  I just tried compiling 2.9.2 and it is still a problem.  Are there any plans to address this?
Comment 2 Alex Henrie 2015-02-14 03:46:02 UTC
Somehow, the ampersands are being reescaped as &#38; when --stream is specified. This does not happen without --stream. The validator then complains when it sees multiple # characters in the URI.
Comment 3 Alex Henrie 2015-02-14 05:54:25 UTC
Created attachment 296815 [details] [review]
[PATCH] xmllint: Fix validation of URIs with ampersands

This patch fixes the problem.

I couldn't figure out how to include the test case in the test suite, sorry. If it's important, someone else will have to do it.
Comment 4 Doug 2015-02-17 14:52:01 UTC
Ran a test with the offending xml file and it worked.  
Thank you.
Comment 5 bjwebb67 2015-02-22 22:42:42 UTC
Thanks, the patch works for me too.
Comment 6 Alex Henrie 2015-11-05 01:17:46 UTC
Created attachment 314864 [details] [review]
[PATCH] Fix XSD validation of URIs with ampersands

This patch is my second attempt at fixing the bug. It should work whether you invoke the schema validator from xmllint or by linking to libxml2 directly.
Comment 7 Nathan Turner 2016-01-05 22:44:13 UTC
I am seeing the issue specified in this bug report when using libxml2 directly with the SAX2 parser and schema validation.

After applying your second patch to Yocto's libxml2_2.9.2 recipe, the anyURI schema validation is working as expected, but we are seeing a sporadic SIGSEGV out of libxml2. The error out of libxml2 is from parser.c xmlStringLenDecodeEntities(...): XML_ERR_ENTITY_LOOP - "Detected an entity reference loop"

In gdb I am seeing that ctxt->depth is often 48 in the case of failure, but I saw it once with the value 134516574. This smacks of stack corruption to me, but I am not sure if there is a more simple explanation.

For context, we are building an embedded linux image using the yocto project. We are compiling with gcc 4.9.3, and are targeting libxml2_2.9.2 (with all the patches that poky's fido branch). Also, I saw this fail in two of our internal unrelated recipes which each use libxml2.
Comment 8 Alex Henrie 2016-01-06 07:53:07 UTC
Created attachment 318307 [details] [review]
[PATCH v2] Fix XSD validation of URIs with ampersands

Good catch, Nathan! Valgrind revealed that I was using the wrong context pointer. A corrected version of the patch is attached.
Comment 9 Nathan Turner 2016-01-06 16:29:05 UTC
Wow, thanks for the quick response!

I just tested the new patch, and everything seems to be fine now.

I am not sure if this is the right place for comments on the patch, but the only issue I see is this comment:

", changing &#38; to a literal ampersand."

This seems too restrictive to me; that is to say, even though the reason for the patch is the issue with ampersands, your patch fixes more cases than just that by decoding all XML entities.

For example, before your change the following URLs would also fail to validate: "https://foo.org?q=&#88;#fragment", or "https://foo.org#fragment&#88;" but now each validate properly.

Something like "Duplicate the value, decoding any XML entities." rings a bit more accurate to me.

Again, thanks a lot for fixing this issue.
Comment 10 Alex Henrie 2016-01-06 18:21:43 UTC
(In reply to Nathan Turner from comment #9)
> For example, before your change the following URLs would also fail to
> validate: "https://foo.org?q=&#88;#fragment", or
> "https://foo.org#fragment&#88;" but now each validate properly.

These both validate fine for me in the latest stable version (2.9.3). The streaming parser already decodes all entities, but then it reencodes ampersands as "&#38;". So by the time an attribute string is passed to xmlSchemaSAXHandleStartElementNs, the only entity that it can have is "&#38;". Your examples become "https://foo.org?q=X#fragment" and "https://foo.org#fragmentX".

See https://git.gnome.org/browse/libxml2/tree/parser.c?id=6657afe83a38278f124ace71dc85f60420beb2d5#n3980
Comment 11 Nathan Turner 2016-01-06 19:12:19 UTC
You are correct, sorry about that; this was a mistake on my part.

Turns out my unit-tests were failing a local constraint on the scheme, which requires that urls are HTTPS only, and that the character references I added were immaterial to the failures to validate.

My unit-tests had those two examples as "http://" and I transcribed them improperly when writing my post here as starting with "https://". When I changed them to "https://" for our unit-tests, they indeed validated against our schema with libxml2_2.9.2. Sorry for the confusion, and thanks for the education!
Comment 12 Daniel Veillard 2016-04-28 08:27:01 UTC
I am not sure the patch is right. My recollection is there
is a specific processing for &amp; &x38; at the SAX interface level

thinkpad2:~/XML -> cat tst.xml
<?xml version="1.0" encoding="utf-8"?>
<root>
  <test url="http://example.com/?&lt;"/>
  <test url="http://example.com/?a=b&amp;c=d"/>
</root>

thinkpad2:~/XML -> ./testSAX tst.xml
SAX.setDocumentLocator()
SAX.startDocument()
SAX.startElement(root)
SAX.characters(
  , 3)
SAX.startElement(test, url='http://example.com/?<')
SAX.endElement(test)
SAX.characters(
  , 3)
SAX.startElement(test, url='http://example.com/?a=b&#38;c=d')
SAX.endElement(test)
SAX.characters(
, 1)
SAX.endElement(root)
SAX.endDocument()
thinkpad2:~/XML -> 

  only the '&' character is converted back to a charref before calling SAX
I don;t remember why that was needed and that's what you hit when
validating with SAX streaming I think. Now the issue is that if you
call xmlStringLenDecodeEntities directly this may have side effects
it might be safer to just replace the sequence '&#38;'  by '&' in the copy

  At least that doesn't open the pandora box of generic entities replacement
this should be way safer....

    need to think a little bit about this

Daniel
Comment 13 Alex Henrie 2016-05-02 21:18:35 UTC
Hey Daniel, I thought about that same question and concluded that calling xmlStringLenDecodeEntities is not a problem because the only entity that can be in the URL when xmlStringLenDecodeEntities is called is "&#38;".

For example, if we have

<test url="http://example.com/?&amp;lt;"/>

it is converted to

<test url="http://example.com/?&lt;"/>

and then to

<test url="http://example.com/?&#38;lt;"/>

In other words, you need an ampersand in order to have a character entity, but all of the ampersands are themselves escaped as "&#38;", so the only entities that xmlStringLenDecodeEntities decodes are the "&#38;" entities. Nothing can be decoded to "<" by accident.
Comment 14 Daniel Veillard 2016-05-03 02:18:37 UTC
Yes I have been thinking about it since I commented here and came to the same conclusion. I think we need to amend the patch to give the explanation and push.

BTW the reason that libxml2 differs from normal SAX here and doesn't deliver
the raw converted string is that libxml2 was designed as an editing toolkit and
SAX was added later not as a primary API. SAX API does not allow to pass entities
so will loose the editing capability for attribute content, and doing that
escaping and tweaking of the API was the only way to keep editing capacity
for those on top of SAX.

That escaping should not happen if entity substitution is done --noent and
if doing schemas validation we should really ask for entity substitution.
But now is too late to change the parser behaviour if that's not the case
we can only work around bugs that may arise like this one.

   thanks

Daniel
Comment 15 Alex Henrie 2016-05-03 04:58:20 UTC
Created attachment 327193 [details] [review]
[PATCH v3] Fix XSD validation of URIs with ampersands

Thanks for the explanation. I have expanded the comment to better explain the reason for decoding entities here.
Comment 16 Daniel Veillard 2016-05-03 05:30:37 UTC
  Excellent, so pushed and commited as

 https://git.gnome.org/browse/libxml2/commit/?id=f6599c5164800e9b6fe8a85e6e2bc54fdf2f952a

  thanks a lot !

Daniel
Comment 17 Daniel Veillard 2016-05-03 07:08:05 UTC
BTW Nathan any way you could try with what is in libxml2 git head to make sure
we didn't introduced regressions ? 

   thanks !

Daniel
Comment 18 Vladimír Čunát 2016-05-26 11:54:54 UTC
I believe this introduced regression(s); see https://bugzilla.gnome.org/show_bug.cgi?id=766834#c5