GNOME Bugzilla – Bug 784894
URI Escaping does not follow RFC 3986
Last modified: 2021-07-05 13:25:51 UTC
Created attachment 355487 [details] [review] Add RFC3986 compatibe URI Escape function URI Escaping still follows the obsolete RFC2396, the main issue is that it does not escape chracters that have been since moved into the reserved chracters set causing interop issues in some dependent softare as evidenced here: https://bugzilla.redhat.com/show_bug.cgi?id=1458237 Attached find a draft patch (compiles not tested yet), that adds support for RFC3986 conformant escaping. I added it as a separate function to avoid breaking applications that may depend on the old escaping for interoperability.
I'd suggest to simply change xmlURIEscapeStr to use ISA_UNRESERVED instead of IS_UNRESERVED. It seems that the ISA_* macros are for RFC3986 and the IS_* macros for RFC2396. This will make xmlURIEscapeStr escape the characters !*'() unless overridden by the 'list' argument.
The original bug report was erroneous, RFC 3986 mentions the reserved character set in Section 2.2, but that does not tell you what characters must be escaped because what needs to be escaped depends upon the URI component. The only way to know the escaping rules for a specific part of a URI, you have to read the "Collected ABNF for URI" in Appendix A. But the current libxml2 API does not provide a public entry point that allows you to specify the URI component you need to escape. I think the best you can do with the existing API is to call xmlURIEscapeStr() with a non-NULL second parameter consisting of the characters not to escape in addition to the characters it won't escape. But what characters are those for specific components of a URI? Well, it's pretty hard to figure out without looking at the source, even then it's not easy. Maybe the bug report really needs to be "libxml2 does not provide an API to escape component specific parts of a URI according to RFC-3986.
Created attachment 358149 [details] Small Python script illustrating the character classes I found it difficult to evaluate exactly what characters were subject to escape in the the various RFC's, what libxml2 implements and what the differences were. At least I found it difficult to do without the inevitable human error that occurs when reading specs and code. The little Python script builds "sets" of characters and allows you to perform set operations on them (e.g. union, intersection, difference). It's also the only way I was confident I could come up with the right set of exceptions to pass in the 2nd parameter of xmlURIEscapeStr().
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.