After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 316338 - testRegexp hangs on correct regular expression and data
testRegexp hangs on correct regular expression and data
Status: RESOLVED FIXED
Product: libxml2
Classification: Platform
Component: general
2.6.22
Other All
: Normal blocker
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2005-09-14 20:42 UTC by abolgar
Modified: 2017-05-23 21:05 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
A Windows batch file to reproduce the problem. (179 bytes, application/octet-stream)
2005-09-14 20:45 UTC, abolgar
  Details
A7.BAT test file (545 bytes, application/octet-stream)
2005-09-15 13:51 UTC, abolgar
  Details
A7B.BAT test file (528 bytes, application/octet-stream)
2005-09-15 13:52 UTC, abolgar
  Details
A7.SH test file (replacement of A7.BAT one) (547 bytes, application/octet-stream)
2005-09-15 14:27 UTC, abolgar
  Details
A7B.SH test file (replacement of A7B.BAT one) (530 bytes, application/octet-stream)
2005-09-15 14:28 UTC, abolgar
  Details
A7.PL test Perl script with the regexp (works fine!) (603 bytes, application/octet-stream)
2005-09-15 14:28 UTC, abolgar
  Details
This is rhe Microsoft XML schema I mentioned with the complicated regexp (34.55 KB, application/octet-stream)
2005-09-15 14:48 UTC, abolgar
  Details
This is the XML file for the schema (66.79 KB, application/octet-stream)
2005-09-15 14:49 UTC, abolgar
  Details
Patch for Bug 316338 for Libxml 2.6.23 (3.93 KB, patch)
2006-03-20 21:50 UTC, abolgar
none Details | Review

Description abolgar 2005-09-14 20:42:54 UTC
Please describe the problem:
xmlRegexpExec function hangs on correct regular expression and data. Regular
expression is as follows:
(((C|c)(([\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*))+

String value (data) is as follows:

C 433.12 433.12 433.12 724.48 428.96 723.68  433.12 433.12 433.12 724.48 428.96
723.68

It looks like an infinite loop somewhere inside the xmlRegexpExec function. But
if we reduce the data being matched, then testRegexp works about 2 minutes and
then exists with correct result:

C 1 2 3 444.44 555.55 776.111


Steps to reproduce:
1. Run the following command from the libxml2 'bin' sub-directory:

testRegexp.exe
"(((C|c)(([\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*))+" "C 433.12
433.12 433.12 724.48 428.96 723.68  433.12 433.12 433.12 724.48 428.96 723.68"
--debug



Actual results:
testRegexp hangs. I was running it during several hours and then interrupted
manually.

Expected results:
It should return and print result.

Does this happen every time?
Yes.

Other information:
My system is P4 2.5GHz, RAM 640Mb, Windows XP SP2, VC6 SP6
Comment 1 abolgar 2005-09-14 20:45:13 UTC
Created attachment 52244 [details]
A Windows batch file to reproduce the problem.
Comment 2 Daniel Veillard 2005-09-14 21:26:36 UTC
Where did you get that regexp from ? 
It is *highly* indeterminist, because you use optional separators
[\s]* instead of mandatory separator [\s]+, so basically in a lot of places where
a digit is encountered it could be part of number N or number N+1, and as 
you pile up digits the combinatory possibilities explodes.
I'm not sure this can be considered as a bug.
Replacing the first [\s]* to an [\s]+, i.e. forcing at least one space as 
separator breaks the indeterminism and validation is instantaneous

paphio:~/XML -> time ./testRegexp
"(((C|c)(([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*))+" "C 433.12
433.12 433.12 724.48 428.96 723.68  433.12 433.12 433.12 724.48 428.96 723.68"
Testing (((C|c)(([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*))+:
C 433.12 433.12 433.12 724.48 428.96 723.68  433.12 433.12 433.12 724.48 428.96
723.68: Ok

real    0m0.006s
user    0m0.003s
sys     0m0.003s
paphio:~/XML ->

Daniel
Comment 3 abolgar 2005-09-14 23:56:19 UTC
Hi Daniel,

Thanks for a VERY quick response, as usual :)
This regular expression is used in the XML schema of one of our customer as a
pattern restriction for the string type. He said, this schema is from Microsoft.
I will try to figure out exactly, what is the schema and how did he get it. I
will also forward him your kind reply. I'll let you know shortly.

Sincerely, 
   Artyom Bolgar.
Comment 4 Daniel Veillard 2005-09-15 09:49:42 UTC
The best I can do in the short term is make this fail with an internal error
to avoid potential DoS, I have a patch ready to this intend but not yet commited.
Maybe the way the regexp is compiled or interpreted could be enhanced too,
but it's IMHO broken anyway, the regexp is not testing what you/they expect !

paphio:~/XML -> ./testRegexp
"(((C|c)(([\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*))+" "C 433"
Testing (((C|c)(([\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*))+:
C 433: Ok
paphio:~/XML ->

  there is 3 digit, the spaces being optional, the cardinality {3} is
considered okay, same for "C 433 12" or nearly any combination (and hence the
explosion). Someone didn't do regression tests or Microsoft has a bug.

  I won't close it as not a bug since I want to at least detect combinatory
explosion at runtime and return an internal error if found.

Daniel
Comment 5 abolgar 2005-09-15 13:50:47 UTC
Hi Daniel,

Actually, the original regexp is much more complicated, I just tried to reduce
it for easier understanding of the problem. Here is the regexp which is more
close to original one:

(((C|c)(([\s]+\-?[0-9]+(\.[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*))+

I have attached A7.BAT file with this regexp. As you can see, there is a comma
in the repeating part, so, actually three pairs of numbers are expected. Also,
the hang goes away if I modify the regexp by the following way:

(((C|c)(([\s]+\-?[0-9]+(\.[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?){3})+[\s]*))+

See A7B.BAT file attached.

The original regexp is very complicated:

(((M|m)([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?)[\s]*)|((L|l)([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?)+[\s]*)|((H|h)([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?)+[\s]*)|((V|v)([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?)+[\s]*)|((C|c)(([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*)|((Q|q)(([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){2})+[\s]*)|((S|s)(([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){2})+[\s]*)|((A|a)([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?[\s]+[0-1][\s]+[0-1][\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?)+[\s]*)|((Z|z)[\s]*))*
Comment 6 abolgar 2005-09-15 13:51:43 UTC
Created attachment 52274 [details]
A7.BAT test file
Comment 7 abolgar 2005-09-15 13:52:27 UTC
Created attachment 52275 [details]
A7B.BAT test file
Comment 8 Daniel Veillard 2005-09-15 14:08:32 UTC
Windows batch files are useless, I don't even have a windows box around.
I'm lost with what you are reporting at this point, I will have to focuse on
this at some point.
The version in CVS fails with a -1 internal error due to the combinatory 
explosion since I commited my patch against it. Whether there is a bug in 
the regexp engine resulting in this is unclear.

Daniel
Comment 9 abolgar 2005-09-15 14:26:38 UTC
Hi Daniel,

Sorry about Windows batch files, I will submit shell scripts instead (A7.SH and
A7B.SH).
I have synchronized Libxml2 with the CVS. Currently, A7.SH reports -1 (as you
said), and A7B.SH works just fine. A7.SH looks like this:

./testRegexp.exe
"(((C|c)(([\s]+\-?[0-9]+(\.[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*))+"
"C 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12
711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12
718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12
724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12
711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12
718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12" --debug

A7B.SH:
./testRegexp.exe
"(((C|c)(([\s]+\-?[0-9]+(\.[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?){3})+[\s]*))+"
"C 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12
711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12
718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12
724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12
711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12
718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12" --debug

I also made a Perl script with the regexp and data from the A7.SH (which causes
error -1 in current version of Libxml2):

$data = "C 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12
711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12
718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12
724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12
711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12
718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12";
if ($data =~
/^(((C|c)(([\s]+\-?[0-9]+(\.[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*))+$/)

{
   print "OK\n";
}
else {
   print "NOK\n";
}

It works just fine.
Comment 10 abolgar 2005-09-15 14:27:30 UTC
Created attachment 52279 [details]
A7.SH test file (replacement of A7.BAT one)
Comment 11 abolgar 2005-09-15 14:28:07 UTC
Created attachment 52280 [details]
A7B.SH test file (replacement of A7B.BAT one)
Comment 12 abolgar 2005-09-15 14:28:58 UTC
Created attachment 52281 [details]
A7.PL test Perl script with the regexp (works fine!)
Comment 13 abolgar 2005-09-15 14:48:28 UTC
Created attachment 52285 [details]
This is rhe Microsoft XML schema I mentioned with the complicated regexp

See ST_AbbrGeom type.
Comment 14 abolgar 2005-09-15 14:49:27 UTC
Created attachment 52286 [details]
This is the XML file for the schema
Comment 15 Daniel Veillard 2005-09-15 14:53:51 UTC
Okay, thanks I will look at this when I have some time,

Daniel
Comment 16 kbuchcik 2005-10-05 13:04:18 UTC
I observed that the test "A7B.sh" takes 0.025s (real) with the current
revision of xmlregexp.c.

Daniel fixed a bug related to the following mail-thread in the xmlregexp
module:
http://mail.gnome.org/archives/xml/2005-September/msg00115.html

So it appears that this also fixed your problem.
Could you give us feedback?

kbu@librax:/data/home/kbuchcik/gnomecvs/libxml2$ time A7B.sh
Testing
(((C|c)(([\s]+\-?[0-9]+(\.[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?){3})+[\s]*))+:
C 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12
711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12
724.64,433.12718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12
711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12
718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12
724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12
711.52,433.12: Ok

real    0m0.025s
user    0m0.011s
sys     0m0.011s
Comment 17 abolgar 2005-10-05 14:35:41 UTC
Thanks for the reply! Though, the main issue is "A7.sh" script. "A7B.sh" worked
fine.
"A7.sh" does not hang anymore; though, it still returns "-1" instead of real
result (I just have sync-ed with CVS). PERL script with the same regular
expression ("A7.pl") works just fine and very fast. 

Artyom.
Comment 18 kbuchcik 2005-10-06 16:16:30 UTC
Ah, so a part of the problem still exists. Unfortunately I'm not into
the regex module, thus I'm only able to make strange observations :-)
AFAIK Daniel is currently away, so being patient until he returns seems
to be the only option currently.
Comment 19 abolgar 2005-12-20 15:49:47 UTC
Hi Daniel!

Did you have a chance to look on this problem?
Comment 20 Daniel Veillard 2005-12-20 16:09:34 UTC
I tried to fix problems in various ways but could not solve it. CVS
should be a bit better but there is something broken which escapes me
and I hadn't had nuch time lately to work on it.
So yes and no ...

Daniel
Comment 21 Jan Oelschlägel 2006-01-23 22:28:41 UTC
Maybe this bug is related to the following discovery. If using the testRegExp program in 2.6.23 one will get:

Testing .*test.*:
xxxtestxxx: Fail

Testing .*.test.*:
xxxtestxxx: Ok

This behaviour is not as expected I would say. Seems that expressions where a character follows optional chars always fail.
Comment 22 kbuchcik 2006-02-03 09:48:45 UTC
Jan, I don't know if this bug is related to your discovery, but your discovery is related to bug #327167.
Comment 23 abolgar 2006-03-20 21:50:49 UTC
Created attachment 61646 [details] [review]
Patch for Bug 316338 for Libxml 2.6.23

Patch for Bug 316338 for Libxml 2.6.23 made by Objective Systems.
Comment 24 Daniel Veillard 2006-03-20 22:37:23 UTC
Thanks, but could you explain what it does ?
It is relatively hard to evalutate the changes just looking at the diff
even with the inner comments.

Daniel
Comment 25 abolgar 2006-03-21 21:15:42 UTC
Hi Daniel,

Here are some explanation. Note, the author of this fix is Youri Golovanov <yourigt@gmail.com>, I am just translating from Russian to English:

1. xmlFAGenerateTransitions
For example, there is a regexp containing the following fragment: ((e|E)\\-?[0-9]+)?){3}. For the second '?' there is the state with two transitions to the end and to the begining of this sub-expression. The transition to the end goes to [0-9] check. During the compilation, after function xmlFAEliminateEpsilonTransitions is called, we have a state with transitions toward and backward by the same condition: [0-9]. If the string being tested is not matched because of repeatings number (the number of such blocks is not divisible by 3) then the ininite number of rollbacks occurs and the regexp execution is interrupted by the limit of rollbacks.

I (YG) added a new state after the end of the block and made transitions to it from the beginning and from the last state of the block. 

2. xmlFAGenerateTransitions
I (YG) have changed the order of transitions-by-the-counter insertions. Initially, the counter was set and after that the transition back occured. Then, there was a counter's test for min/max and the transition toward occured. Usually, after xmlFAEliminateEpsilonTransitions call, for simple expressions these transitions were swapped and everything worked. For more complex expressions the order of transitions was not swapped and the infinite loop occured.

3. xmlFARegExec
I (YG) added the counter's reset to zero when transition-by-the-condition occurs. Without the counter's reset doesn't allow correct test of the following expressions:
((abc){2})*
After transition back by "*", the counter was not reset and the regexp matching worked incorrectly.

4. xmlFAParseBranch & xmlFAParseRegExp
This is just a simple optimization in order to decrease the number of transitions in automata. For expressions like (a|b)(c|d) the branches after (a|b) was not joined, and for each path we had separated branch (c|d). It is not a bug, but increases the size of automata. 

Regards,
Youri.
Comment 26 Daniel Veillard 2006-03-21 23:22:03 UTC
Okay I went though the patch, makes sense, but far from trivial, very
good job ! Applied, I just changed the comment to integrate them better
and redocument the changed function. I also added tests coming from the 
bug report in the regressions. Could you just provide a non-trivial
expression which passes the full regexp provided in comment #5, you can
probably find one from an XML test case.
Commited this in CVS, this will be in the next release,

   Thanks a lot !

Daniel