After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 747567 - Plus quantifier (+) in patterns (regular expressions) sometimes matches 0 times
Plus quantifier (+) in patterns (regular expressions) sometimes matches 0 times
Status: RESOLVED OBSOLETE
Product: libxml2
Classification: Platform
Component: regexp
git master
Other Linux
: Normal normal
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2015-04-09 14:48 UTC by Sandor Oroszi
Modified: 2021-07-05 13:24 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Patch to forbid epsilon-reduction of final states (1.19 KB, patch)
2015-04-27 15:01 UTC, Arne Becker
none Details | Review

Description Sandor Oroszi 2015-04-09 14:48:33 UTC
Overview:
In patterns, the + quantifier matches 0 or more times instead of 1 or more times when it follows a closing parenthesis. Example: "(a)+"

Steps to Reproduce: 
Use the following xsd:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:simpleType name="fooType">
        <xs:restriction base="xs:string">
            <xs:pattern value="(a)+"/>
        </xs:restriction>
    </xs:simpleType>

    <xs:element name="foo" type="fooType"/>
</xs:schema>

Validate the following xml:
<?xml version="1.0" encoding="UTF-8"?>
<foo/>

Actual Results:
libxml validation accepts this xml

Expected Results:
libxml validation should reject the xml with the error:

test.xml:1.7: Element 'foo': [facet 'pattern'] The value '' is not accepted by the pattern '(a)+'.
test.xml:1.7: Element 'foo': '' is not a valid value of the atomic type 'fooType'.


Build Date & Hardware:
2.9.1+dfsg1-3ubuntu4.4

Additional Information:
Replacing the restriction pattern with any of the following works as expected, and rejects the xml:
"a+"
"(a){1,}"

This is a highly simplified example of course, but it looks like if + follows a closing parenthesis, the + works as if it was a *, matching 0 or more times.

I performed the same validation using MSXML, it properly recognizes that the empty <foo/> is invalid.
Comment 1 Arne Becker 2015-04-27 15:01:03 UTC
Created attachment 302457 [details] [review]
Patch to forbid epsilon-reduction of final states

When building the internal representation of a regexp, it is possible that a lot of empty transitions are created. Therefore there is a step to reduce them in the function xmlFAEliminateSimpleEpsilonTransitions.

There is an error there for this case:

* State 1 has a transition with an atom (in this case "a") to state 2.
* State 2 is final and has an epsilon transition to state 1.

After reduction it looked like:
* State 1 has a transition with an atom (in this case "a") to itself and is final.

In other words, the empty string is accepted when it shouldn't be.

The attached patch skips the reduction step for final states.
An alternative would be to insert or increment counters when reducing a final state, but this seemed error prone and unnecessary, since there aren't that many final states.

I'm not affiliated with the libxml2 team, so use this at your own risk :-)
Comment 2 GNOME Infrastructure Team 2021-07-05 13:24:31 UTC
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org.
As part of that, we are mass-closing older open tickets in bugzilla.gnome.org
which have not seen updates for a longer time (resources are unfortunately
quite limited so not every ticket can get handled).

If you can still reproduce the situation described in this ticket in a recent
and supported software version, then please follow
  https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines
and create a new ticket at
  https://gitlab.gnome.org/GNOME/libxml2/-/issues/

Thank you for your understanding and your help.