After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 649244 - Empty branch not handled correctly in regular expression
Empty branch not handled correctly in regular expression
Status: RESOLVED FIXED
Product: libxml2
Classification: Platform
Component: regexp
2.7.3
Other Mac OS
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
: 705087 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2011-05-03 00:42 UTC by C. M. Sperberg-McQueen
Modified: 2019-09-25 13:53 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Schema document with empty branches in the pattern facets of several datatypes (1.59 KB, application/octet-stream)
2011-05-03 00:42 UTC, C. M. Sperberg-McQueen
Details
XML instance document using the types defined in empty-branch.xsd (1.45 KB, application/xml)
2011-05-03 00:44 UTC, C. M. Sperberg-McQueen
Details
Log from running xmllint 2.7.8 (6.54 KB, text/plain)
2011-05-03 01:25 UTC, C. M. Sperberg-McQueen
Details

Description C. M. Sperberg-McQueen 2011-05-03 00:42:47 UTC
Created attachment 187091 [details]
Schema document with empty branches in the pattern facets of several datatypes

When a choice in a regular expression ends with an empty branch, the empty branch is not handled correctly.  Leading empty branches appear also to have issues, but different ones.

Example:  "(a|)"

This is a choice between the expression on the left ("a") and the one on the right (""), and should match any string that matches either of them.  But when I validate strings against a type defined with this regex, a single "a" is accepted but the empty string "" is not.  Strings containing characters other than "a" and strings with length greater than 1 are correctly rejected.

Example:  "(|a)"

This is the same choice as the previous example and defines the same language.  Validating against a type with this pattern, however, libxml 20703 correctly accepts "" and "a", and correctly rejects "b" and "c", but also accepts "aa".  

The regular expressions "", "|", and "(|)" correctly accept the empty string as input and correctly reject "a" in the input.

Rewriting "(a|)" to "(a)?" is of course possible, but "(a|)" is in fact a legal regex, libxml2 correctly accepts it, and you probably want to interpret it correctly.  (Perhaps few people would write (a|) by hand, but a simple translation of the ABNF for URIs and IRIs in RFC 3986 and RFC 3987 does produce an expression of this form for the non-terminal 'relative-part'.)

I attach a schema document and an XML document with sample test strings.
Comment 1 C. M. Sperberg-McQueen 2011-05-03 00:44:12 UTC
Created attachment 187092 [details]
XML instance document using the types defined in empty-branch.xsd
Comment 2 C. M. Sperberg-McQueen 2011-05-03 00:47:32 UTC
The schema document and XML are also available at

  http://www.blackmesatech.com/2011/05/regex-examples/

The documents on the Black Mesa web site may change as I continue to try to understand the bug.

I should also apologize for not trying to reproduce this bug with a later version of libxml2; I encountered it in the version of libxml2 that ships with the Oxygen XML editor, and that reports itself as 20703.  If I get a chance to reproduce it (or not) with a more recent release, I'll report back here.
Comment 3 C. M. Sperberg-McQueen 2011-05-03 01:25:40 UTC
Created attachment 187094 [details]
Log from running xmllint 2.7.8

OK, I've downloaded the git snapshot of 2.7.8 and made xmllint.  The attached log file shows that the behavior described is present in 2.7.8.  Note in particular:

  line 27 of input:  The value '' is not accepted by the pattern '(a|)'.
  
And there is no error on line 40 of the input; 'aa' is accepted as a match against (|a) (although on line 31 it's correctly rejected against (a|).
Comment 4 Nick Wellnhofer 2019-09-25 12:00:12 UTC
*** Bug 705087 has been marked as a duplicate of this bug. ***