Bug 565747 – libxml2 should not reject anyURI data having special characters (apostrophe, space...)

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 565747 - libxml2 should not reject anyURI data having special characters (apostrophe, space...)


Summary:	libxml2 should not reject anyURI data having special characters (apostrophe, ...


Status:	RESOLVED FIXED

Product:	libxml2
Classification:	Platform
Component:	general
Version:	2.7.1
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2008-12-27 04:11 UTC by Vincent Lefevre
Modified:	2009-08-07 14:45 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
patch (996 bytes, patch) 2009-03-03 04:36 UTC, Vincent Lefevre	none	Details \| Review

Description Vincent Lefevre 2008-12-27 04:11:16 UTC

Please describe the problem:
libxml2 (at least when doing Relax NG validation) rejects anyURI values that contain an apostrophe or a space character.

Steps to reproduce:
1. Consider the following XML file anyuri.xml
<?xml version="1.0"?>
<root>
  <uri>http://localhost/foobar</uri>
  <uri>http://localhost/foo bar</uri>
  <uri>http://localhost/foo'bar</uri>
</root>
2. And a file anyuri.rng
<?xml version="1.0" encoding="UTF-8"?>
<grammar ns="" xmlns="http://relaxng.org/ns/structure/1.0"
         datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <start>
    <element name="root">
      <zeroOrMore>
        <element name="uri">
          <data type="anyURI"/>
        </element>
      </zeroOrMore>
    </element>
  </start>
</grammar>
3. Type "xmllint --relaxng anyuri.rng anyuri.xml"

Actual results:
The last two URI's are not accepted:

anyuri.xml:4: element uri: Relax-NG validity error : Type anyURI doesn't allow value 'http://localhost/foo'bar'
anyuri.xml:4: element uri: Relax-NG validity error : Error validating datatype anyURI
anyuri.xml:4: element uri: Relax-NG validity error : Element uri failed to validate content
anyuri.xml:5: element uri: Relax-NG validity error : Type anyURI doesn't allow value 'http://localhost/foo bar'
anyuri.xml:5: element uri: Relax-NG validity error : Error validating datatype anyURI
anyuri.xml:5: element uri: Relax-NG validity error : Element uri failed to validate content
anyuri.xml fails to validate

Expected results:
According to http://www.w3.org/TR/xmlschema-2/#anyURI, the file is valid: spaces are explicitly accepted though discouraged (this is quite strange though, because I don't see where spaces are accepted in xlink or RFC2396) and the apostrophe is also accepted by RFC2396. See below.

BTW, note that the apostrophe is generated unescaped by Firefox (when copying a URL from the location bar).

Does this happen every time?
Yes.

Other information:
http://www.w3.org/TR/xmlschema-2/#anyURI says:

  3.2.17.1 Lexical representation

  The ·lexical space· of anyURI is finite-length character sequences which,
  when the algorithm defined in Section 5.4 of [XML Linking Language] is
  applied to them, result in strings which are legal URIs according to
  [RFC 2396], as amended by [RFC 2732].

    Note: Spaces are, in principle, allowed in the ·lexical space· of
    anyURI, however, their use is highly discouraged (unless they are
    encoded by %20).

Section 5.4 of http://www.w3.org/TR/xlink/#link-locators says:

  Some characters are disallowed in URI references, even if they are
  allowed in XML; the disallowed characters include all non-ASCII
  characters, plus the excluded characters listed in Section 2.4 of
  [IETF RFC 2396], except for the number sign (#) and percent sign (%)
  and the square bracket characters re-allowed in [IETF RFC 2732].
  Disallowed characters must be escaped as follows: [...]

Sections 2.3 and 2.4 of http://www.ietf.org/rfc/rfc2396.txt say:

  2.3. Unreserved Characters

    Data characters that are allowed in a URI but do not have a reserved
    purpose are called unreserved.  These include upper and lower case
    letters, decimal digits, and a limited set of punctuation marks and
    symbols.

      unreserved  = alphanum | mark

      mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

    Unreserved characters can be escaped without changing the semantics
    of the URI, but this should not be done unless the URI is being used
    in a context that does not allow the unescaped character to appear.

  2.4. Escape Sequences

    Data must be escaped if it does not have a representation using an
    unreserved character; [...]

Section 2.4.3 lists the excluded US-ASCII characters, but this is consistent with 2.3.

Comment 1 Vincent Lefevre 2008-12-28 01:54:33 UTC

I said: "his is quite strange though, because I don't see where spaces are accepted in xlink or RFC2396". I think I misread the specs. The point is that "the algorithm defined in Section 5.4 of [XML Linking Language]" encodes/escapes these disallowed characters to form a valid URI.

I suppose that the spaces are discouraged only because they are somewhat fragile (they can easily be munged by some software).

The apostrophe and the space are not the only concerned characters: non-ASCII characters such as é are also rejected by libxml2.

Comment 2 Vincent Lefevre 2009-03-03 04:36:58 UTC

Created attachment 129918 [details] [review]
patch

A patch quickly written. The idea is to apply the algorithm defined in Section 5.4 of [XML Linking Language] just before the URI is parsed. Instead of replacing a byte from a disallowed character by a %HH sequence, I've replaced it by a single byte '_' (but one has to check whether this is equivalent).

Comment 3 Daniel Veillard 2009-07-29 10:50:56 UTC

I'm not sure I agree ... An URI in an XML document should be URI-escaped
what's the point of validation if not to raise problem in avance ? Which
is clearly the point here, if you use a non-URI-escaped URI in an XML document
for example as an href or xinclude things are likely to break down the pipe.

Libxml2 has switched to 3986 since 2396 is deprecated and the new one is
the proper URI description now.
3986 might be more stringent on character checking than 2396, that's not surprizing considering the proper I18N work being done to clean things up.

W.r.t. non-ascii character that's even more of a danger, the IRI specification
is there (and somehow parts have been integrated in various revisions of
XLink) to define how to properly embbed non-ascii in URIs and the proper way
is to URI escape the UTF-8 byte encoding, so clearly "characters such as é
are also rejected by libxml2" sounds like the proper behaviour considering
how specifications are evolving in that domain.

I find this patch hazardous, somehow more a regression than an improvement
as this just decrease the quality of checking. I tried to get a more factual
viewpoint and ran the ./runsuite test against
NIST test suite for Schemas version NIST2004-01-14
Sun test suite for Schemas version Sun2002-01-16
Microsoft test suite for Schemas version MS2002-01-16
this didn't changed the output, so apparently none of the 30,000 tests or so
try to test things like anyURI with spaces ....

One think for I really object in the patch is *cur >= 127 that is
just wrong, garanteed to lead to failures.

I'm still not 100% decided on the proper way though,

Daniel

Comment 4 Vincent Lefevre 2009-07-29 11:27:30 UTC

(In reply to comment #3)
> I'm not sure I agree ... An URI in an XML document should be URI-escaped

Where did you see that? Everything I've read implies that URI-escaping is not needed in an XML document. Something else in http://www.w3.org/TR/2001/REC-xlink-20010627/#link-locators I haven't cited above:

    The value of the href attribute must be a URI reference as defined in
    [IETF RFC 2396], or must result in a URI reference after the escaping
                                                       ^^^^^^^^^^^^^^^^^^
    procedure described below is applied. 
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The above spec requires that the escaping procedure be applied for what's in an XML document in order to resolve a URI (unless the href value is already a URI reference). This makes more clear that a URI in an XML document need not be escaped.

> Libxml2 has switched to 3986 since 2396 is deprecated and the new one is
> the proper URI description now.

The W3C specs still point at 2396 (IMHO, the W3C specs should have been self-contained to avoid such problems). They probably need to be updated, but this doesn't mean that they would require URI's to be escaped.

> 3986 might be more stringent on character checking than 2396, that's not
> surprizing considering the proper I18N work being done to clean things up.
> 
> W.r.t. non-ascii character that's even more of a danger, the IRI specification
> is there (and somehow parts have been integrated in various revisions of
> XLink) to define how to properly embbed non-ascii in URIs and the proper way
> is to URI escape the UTF-8 byte encoding,

This is part of the URI-escaping procedure described by XLink:

    Each disallowed character is converted to UTF-8 [IETF RFC 2279] as
    one or more bytes.

So, I don't see any problem with non-ASCII characters in locator attributes, as long as the escaping procedure is performed correctly by the XLink processor.

>   I find this patch hazardous, somehow more a regression than an improvement
> as this just decrease the quality of checking. I tried to get a more factual
> viewpoint and ran the ./runsuite test against 
>   NIST test suite for Schemas version NIST2004-01-14
>   Sun test suite for Schemas version Sun2002-01-16
>   Microsoft test suite for Schemas version MS2002-01-16
> this didn't changed the output, so apparently none of the 30,000 tests or so
> try to test things like anyURI with spaces ....

I'd say that these tests are not complete (perhaps unescaped URI's are not generated by typical applications -- this does not include copy-paste from the Firefox address bar, in particular).

Comment 5 Vincent Lefevre 2009-07-29 11:59:29 UTC

After a search with Google, it seems that the W3C confirms that libxml2 is buggy:

  http://markmail.org/message/76qjt4myckr4dfw4

Here's the excerpt containing the answer from the W3C:

From: Martin Duerst [mailto:due...@w3.org]
Sent: Wednesday, April 21, 2004 3:19 AM
To: Von Riegen, Claus
Subject: Re: FW: UDDI: Interop issues relating to the XML Schema datatype anyURI

Hello Claus,

I think the answer to your question is quite clear: XML Schema allows a very wide variety of characters as lexical values in attributes/elements of type anyURI.

XML Schema Part 2: Datatypes, 3.2.17, anyURI (http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#anyURI), is quite clear about this. If you see anything in this section that would indicate something different, I'd be interested to know. It would be rather useless to specify the transformation to escaped characters if anyURI was restricted in such a way that no such escaping would actually be needed.

That non-ASCII (Unicode) characters are allowed is also clear from 3.2.17.1, Lexical representation, which says: "The .lexical space. of anyURI is finite-length character sequences which, when the algorithm defined in Section 5.4 of [XML Linking Language] is applied to them, result in strings which are legal URIs according to [RFC 2396], as amended by [RFC 2732]." Obviously, in XML Schema, "character" means any Unicode character. Also, for example in the path component of a http: scheme anyURI, you can start with any Unicode character and apply the conversion procedure and get a legal URI. [please note that the current URI spec wouldn't allow this for the host part, but this is being fixed in an update to the URI spec, but minimally conformant XML Schema processors are not required to check this]

So with respect to the tools mentioned below by Luc, .Net is correct, and xerxes is wrong.

Regards, Martin.

Comment 6 Daniel Veillard 2009-08-07 14:45:03 UTC

Okay, if even Martin Duerst is dropping the ball ... bahh, after all
XSD is all about validating but being unsure of what it meant. A bit more
a bit less ... that won't change much about it !
So applied and commited,

 thanks !

Daniel