GNOME Bugzilla – Bug 565747
libxml2 should not reject anyURI data having special characters (apostrophe, space...)
Last modified: 2009-08-07 14:45:17 UTC
Please describe the problem: libxml2 (at least when doing Relax NG validation) rejects anyURI values that contain an apostrophe or a space character. Steps to reproduce: 1. Consider the following XML file anyuri.xml <?xml version="1.0"?> <root> <uri>http://localhost/foobar</uri> <uri>http://localhost/foo bar</uri> <uri>http://localhost/foo'bar</uri> </root> 2. And a file anyuri.rng <?xml version="1.0" encoding="UTF-8"?> <grammar ns="" xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"> <start> <element name="root"> <zeroOrMore> <element name="uri"> <data type="anyURI"/> </element> </zeroOrMore> </element> </start> </grammar> 3. Type "xmllint --relaxng anyuri.rng anyuri.xml" Actual results: The last two URI's are not accepted: anyuri.xml:4: element uri: Relax-NG validity error : Type anyURI doesn't allow value 'http://localhost/foo'bar' anyuri.xml:4: element uri: Relax-NG validity error : Error validating datatype anyURI anyuri.xml:4: element uri: Relax-NG validity error : Element uri failed to validate content anyuri.xml:5: element uri: Relax-NG validity error : Type anyURI doesn't allow value 'http://localhost/foo bar' anyuri.xml:5: element uri: Relax-NG validity error : Error validating datatype anyURI anyuri.xml:5: element uri: Relax-NG validity error : Element uri failed to validate content anyuri.xml fails to validate Expected results: According to http://www.w3.org/TR/xmlschema-2/#anyURI, the file is valid: spaces are explicitly accepted though discouraged (this is quite strange though, because I don't see where spaces are accepted in xlink or RFC2396) and the apostrophe is also accepted by RFC2396. See below. BTW, note that the apostrophe is generated unescaped by Firefox (when copying a URL from the location bar). Does this happen every time? Yes. Other information: http://www.w3.org/TR/xmlschema-2/#anyURI says: 3.2.17.1 Lexical representation The ·lexical space· of anyURI is finite-length character sequences which, when the algorithm defined in Section 5.4 of [XML Linking Language] is applied to them, result in strings which are legal URIs according to [RFC 2396], as amended by [RFC 2732]. Note: Spaces are, in principle, allowed in the ·lexical space· of anyURI, however, their use is highly discouraged (unless they are encoded by %20). Section 5.4 of http://www.w3.org/TR/xlink/#link-locators says: Some characters are disallowed in URI references, even if they are allowed in XML; the disallowed characters include all non-ASCII characters, plus the excluded characters listed in Section 2.4 of [IETF RFC 2396], except for the number sign (#) and percent sign (%) and the square bracket characters re-allowed in [IETF RFC 2732]. Disallowed characters must be escaped as follows: [...] Sections 2.3 and 2.4 of http://www.ietf.org/rfc/rfc2396.txt say: 2.3. Unreserved Characters Data characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include upper and lower case letters, decimal digits, and a limited set of punctuation marks and symbols. unreserved = alphanum | mark mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" Unreserved characters can be escaped without changing the semantics of the URI, but this should not be done unless the URI is being used in a context that does not allow the unescaped character to appear. 2.4. Escape Sequences Data must be escaped if it does not have a representation using an unreserved character; [...] Section 2.4.3 lists the excluded US-ASCII characters, but this is consistent with 2.3.
I said: "his is quite strange though, because I don't see where spaces are accepted in xlink or RFC2396". I think I misread the specs. The point is that "the algorithm defined in Section 5.4 of [XML Linking Language]" encodes/escapes these disallowed characters to form a valid URI. I suppose that the spaces are discouraged only because they are somewhat fragile (they can easily be munged by some software). The apostrophe and the space are not the only concerned characters: non-ASCII characters such as é are also rejected by libxml2.
Created attachment 129918 [details] [review] patch A patch quickly written. The idea is to apply the algorithm defined in Section 5.4 of [XML Linking Language] just before the URI is parsed. Instead of replacing a byte from a disallowed character by a %HH sequence, I've replaced it by a single byte '_' (but one has to check whether this is equivalent).
I'm not sure I agree ... An URI in an XML document should be URI-escaped what's the point of validation if not to raise problem in avance ? Which is clearly the point here, if you use a non-URI-escaped URI in an XML document for example as an href or xinclude things are likely to break down the pipe. Libxml2 has switched to 3986 since 2396 is deprecated and the new one is the proper URI description now. 3986 might be more stringent on character checking than 2396, that's not surprizing considering the proper I18N work being done to clean things up. W.r.t. non-ascii character that's even more of a danger, the IRI specification is there (and somehow parts have been integrated in various revisions of XLink) to define how to properly embbed non-ascii in URIs and the proper way is to URI escape the UTF-8 byte encoding, so clearly "characters such as é are also rejected by libxml2" sounds like the proper behaviour considering how specifications are evolving in that domain. I find this patch hazardous, somehow more a regression than an improvement as this just decrease the quality of checking. I tried to get a more factual viewpoint and ran the ./runsuite test against NIST test suite for Schemas version NIST2004-01-14 Sun test suite for Schemas version Sun2002-01-16 Microsoft test suite for Schemas version MS2002-01-16 this didn't changed the output, so apparently none of the 30,000 tests or so try to test things like anyURI with spaces .... One think for I really object in the patch is *cur >= 127 that is just wrong, garanteed to lead to failures. I'm still not 100% decided on the proper way though, Daniel
(In reply to comment #3) > I'm not sure I agree ... An URI in an XML document should be URI-escaped Where did you see that? Everything I've read implies that URI-escaping is not needed in an XML document. Something else in http://www.w3.org/TR/2001/REC-xlink-20010627/#link-locators I haven't cited above: The value of the href attribute must be a URI reference as defined in [IETF RFC 2396], or must result in a URI reference after the escaping ^^^^^^^^^^^^^^^^^^ procedure described below is applied. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The above spec requires that the escaping procedure be applied for what's in an XML document in order to resolve a URI (unless the href value is already a URI reference). This makes more clear that a URI in an XML document need not be escaped. > Libxml2 has switched to 3986 since 2396 is deprecated and the new one is > the proper URI description now. The W3C specs still point at 2396 (IMHO, the W3C specs should have been self-contained to avoid such problems). They probably need to be updated, but this doesn't mean that they would require URI's to be escaped. > 3986 might be more stringent on character checking than 2396, that's not > surprizing considering the proper I18N work being done to clean things up. > > W.r.t. non-ascii character that's even more of a danger, the IRI specification > is there (and somehow parts have been integrated in various revisions of > XLink) to define how to properly embbed non-ascii in URIs and the proper way > is to URI escape the UTF-8 byte encoding, This is part of the URI-escaping procedure described by XLink: Each disallowed character is converted to UTF-8 [IETF RFC 2279] as one or more bytes. So, I don't see any problem with non-ASCII characters in locator attributes, as long as the escaping procedure is performed correctly by the XLink processor. > I find this patch hazardous, somehow more a regression than an improvement > as this just decrease the quality of checking. I tried to get a more factual > viewpoint and ran the ./runsuite test against > NIST test suite for Schemas version NIST2004-01-14 > Sun test suite for Schemas version Sun2002-01-16 > Microsoft test suite for Schemas version MS2002-01-16 > this didn't changed the output, so apparently none of the 30,000 tests or so > try to test things like anyURI with spaces .... I'd say that these tests are not complete (perhaps unescaped URI's are not generated by typical applications -- this does not include copy-paste from the Firefox address bar, in particular).
After a search with Google, it seems that the W3C confirms that libxml2 is buggy: http://markmail.org/message/76qjt4myckr4dfw4 Here's the excerpt containing the answer from the W3C: From: Martin Duerst [mailto:due...@w3.org] Sent: Wednesday, April 21, 2004 3:19 AM To: Von Riegen, Claus Subject: Re: FW: UDDI: Interop issues relating to the XML Schema datatype anyURI Hello Claus, I think the answer to your question is quite clear: XML Schema allows a very wide variety of characters as lexical values in attributes/elements of type anyURI. XML Schema Part 2: Datatypes, 3.2.17, anyURI (http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#anyURI), is quite clear about this. If you see anything in this section that would indicate something different, I'd be interested to know. It would be rather useless to specify the transformation to escaped characters if anyURI was restricted in such a way that no such escaping would actually be needed. That non-ASCII (Unicode) characters are allowed is also clear from 3.2.17.1, Lexical representation, which says: "The .lexical space. of anyURI is finite-length character sequences which, when the algorithm defined in Section 5.4 of [XML Linking Language] is applied to them, result in strings which are legal URIs according to [RFC 2396], as amended by [RFC 2732]." Obviously, in XML Schema, "character" means any Unicode character. Also, for example in the path component of a http: scheme anyURI, you can start with any Unicode character and apply the conversion procedure and get a legal URI. [please note that the current URI spec wouldn't allow this for the host part, but this is being fixed in an update to the URI spec, but minimally conformant XML Schema processors are not required to check this] So with respect to the tools mentioned below by Luc, .Net is correct, and xerxes is wrong. Regards, Martin.
Okay, if even Martin Duerst is dropping the ball ... bahh, after all XSD is all about validating but being unsure of what it meant. A bit more a bit less ... that won't change much about it ! So applied and commited, thanks ! Daniel