GNOME Bugzilla – Bug 649244
Empty branch not handled correctly in regular expression
Last modified: 2019-09-25 13:53:23 UTC
Created attachment 187091 [details] Schema document with empty branches in the pattern facets of several datatypes When a choice in a regular expression ends with an empty branch, the empty branch is not handled correctly. Leading empty branches appear also to have issues, but different ones. Example: "(a|)" This is a choice between the expression on the left ("a") and the one on the right (""), and should match any string that matches either of them. But when I validate strings against a type defined with this regex, a single "a" is accepted but the empty string "" is not. Strings containing characters other than "a" and strings with length greater than 1 are correctly rejected. Example: "(|a)" This is the same choice as the previous example and defines the same language. Validating against a type with this pattern, however, libxml 20703 correctly accepts "" and "a", and correctly rejects "b" and "c", but also accepts "aa". The regular expressions "", "|", and "(|)" correctly accept the empty string as input and correctly reject "a" in the input. Rewriting "(a|)" to "(a)?" is of course possible, but "(a|)" is in fact a legal regex, libxml2 correctly accepts it, and you probably want to interpret it correctly. (Perhaps few people would write (a|) by hand, but a simple translation of the ABNF for URIs and IRIs in RFC 3986 and RFC 3987 does produce an expression of this form for the non-terminal 'relative-part'.) I attach a schema document and an XML document with sample test strings.
Created attachment 187092 [details] XML instance document using the types defined in empty-branch.xsd
The schema document and XML are also available at http://www.blackmesatech.com/2011/05/regex-examples/ The documents on the Black Mesa web site may change as I continue to try to understand the bug. I should also apologize for not trying to reproduce this bug with a later version of libxml2; I encountered it in the version of libxml2 that ships with the Oxygen XML editor, and that reports itself as 20703. If I get a chance to reproduce it (or not) with a more recent release, I'll report back here.
Created attachment 187094 [details] Log from running xmllint 2.7.8 OK, I've downloaded the git snapshot of 2.7.8 and made xmllint. The attached log file shows that the behavior described is present in 2.7.8. Note in particular: line 27 of input: The value '' is not accepted by the pattern '(a|)'. And there is no error on line 40 of the input; 'aa' is accepted as a match against (|a) (although on line 31 it's correctly rejected against (a|).
*** Bug 705087 has been marked as a duplicate of this bug. ***
Fixed here: https://gitlab.gnome.org/GNOME/libxml2/commit/c2b0a184a9e052d445bedda817b233c05424062e