GNOME Bugzilla – Bug 690846
regex issue ([a-z]+ interpreted correctly, ([a-z])+ incorrectly)
Last modified: 2021-07-05 13:26:17 UTC
Created attachment 232360 [details] a sample schema defining types with patterns [a-z]+, ([a-z])+, and ([a-z]+) See http://stackoverflow.com/questions/14060308/xmllint-validation-succeeds-on-invalid-input for full account. An XSD simple type using a pattern of [a-z]+ correctly rejects the empty string); a pattern of ([a-z])+ accepts the empty string. I attach a schema document in case it's helpful. A simple test of the following form suggests that the problem is visible only on the empty string, not on the other tests. for string in "" "test" "test2012" "2012"; do for gi in bare parens parens2; do echo "................................................................"; echo "<$gi>$string</$gi>"; echo "<$gi>$string</$gi>" | xmllint --schema user5112.xsd -; echo ; done; done
When using function xmlFAEliminateSimpleEpsilonTransitions in line 1865, xmlRegexp.c, there is a step to reduce the internal representation of a regexp. But may cause an error for this case: State X has a transition from an atom to state Y. State Y is final state and has an epsilon transition to state X. After reduce the internal representation of a regexp. State X has a transition from an atom to itself and is final. In this case, the pattern accepts the empty string while it shouldn't be. So the solution to this error is fix the reduce steps. In line 1875: if (state->type == XML_REGEXP_UNREACH_STATE ) modified as follows: if (state->type == XML_REGEXP_UNREACH_STATE || state->type == XML_REGEXP_FINAL_STATE) Then the test results as follows: root@oss-0017:~/libxml2-fix/libxml2-v2.9.9# ./testRegexp "([a-z])+" "" Testing ([a-z])+: : Fail Results correctly shows ([a-z])+ correctly rejects the empty string. Combined with issue 57, the test results of https://gitlab.gnome.org/GNOME/libxml2/issues/57 would be ok! root@oss-0017:~/libxml2-fix/libxml2-v2.9.9# ./testRegexp --debug "(([a-zA-Z0-9_]+)(;[a-zA-Z0-9_]+))|" "a1;a2" Testing (([a-zA-Z0-9_]+)(;[a-zA-Z0-9_]+))|: regexp: '(([a-zA-Z0-9_]+)(;[a-zA-Z0-9_]+))|' 6 atoms: 00 atom: ranges once 4 entries range: charval a - z range: charval A - Z range: charval 0 - 9 range: charval _ - _ 01 atom: subexpr once start 4 end 5 02 atom: charval once char ; 03 atom: ranges once 4 entries range: charval a - z range: charval A - Z range: charval 0 - 9 range: charval _ - _ 04 atom: subexpr once start 0 end 10 05 atom: subexpr once start 2 end 10 12 states: state: FINAL 0, 5 transitions: trans: removed trans: removed trans: removed trans: removed trans: atom 0, to 6 state: NULL state: NULL state: NULL state: NULL state: NULL state: 6, 5 transitions: trans: removed trans: atom 0, to 6 trans: removed trans: removed trans: char ; atom 2, to 9 state: NULL state: NULL state: 9, 1 transitions: trans: atom 3, to 11 state: NULL state: FINAL 11, 2 transitions: trans: removed trans: atom 3, to 11 0 counters: a1;a2: Ok
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/libxml2/-/issues/ Thank you for your understanding and your help.