GNOME Bugzilla – Bug 316338
testRegexp hangs on correct regular expression and data
Last modified: 2017-05-23 21:05:21 UTC
Please describe the problem: xmlRegexpExec function hangs on correct regular expression and data. Regular expression is as follows: (((C|c)(([\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*))+ String value (data) is as follows: C 433.12 433.12 433.12 724.48 428.96 723.68 433.12 433.12 433.12 724.48 428.96 723.68 It looks like an infinite loop somewhere inside the xmlRegexpExec function. But if we reduce the data being matched, then testRegexp works about 2 minutes and then exists with correct result: C 1 2 3 444.44 555.55 776.111 Steps to reproduce: 1. Run the following command from the libxml2 'bin' sub-directory: testRegexp.exe "(((C|c)(([\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*))+" "C 433.12 433.12 433.12 724.48 428.96 723.68 433.12 433.12 433.12 724.48 428.96 723.68" --debug Actual results: testRegexp hangs. I was running it during several hours and then interrupted manually. Expected results: It should return and print result. Does this happen every time? Yes. Other information: My system is P4 2.5GHz, RAM 640Mb, Windows XP SP2, VC6 SP6
Created attachment 52244 [details] A Windows batch file to reproduce the problem.
Where did you get that regexp from ? It is *highly* indeterminist, because you use optional separators [\s]* instead of mandatory separator [\s]+, so basically in a lot of places where a digit is encountered it could be part of number N or number N+1, and as you pile up digits the combinatory possibilities explodes. I'm not sure this can be considered as a bug. Replacing the first [\s]* to an [\s]+, i.e. forcing at least one space as separator breaks the indeterminism and validation is instantaneous paphio:~/XML -> time ./testRegexp "(((C|c)(([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*))+" "C 433.12 433.12 433.12 724.48 428.96 723.68 433.12 433.12 433.12 724.48 428.96 723.68" Testing (((C|c)(([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*))+: C 433.12 433.12 433.12 724.48 428.96 723.68 433.12 433.12 433.12 724.48 428.96 723.68: Ok real 0m0.006s user 0m0.003s sys 0m0.003s paphio:~/XML -> Daniel
Hi Daniel, Thanks for a VERY quick response, as usual :) This regular expression is used in the XML schema of one of our customer as a pattern restriction for the string type. He said, this schema is from Microsoft. I will try to figure out exactly, what is the schema and how did he get it. I will also forward him your kind reply. I'll let you know shortly. Sincerely, Artyom Bolgar.
The best I can do in the short term is make this fail with an internal error to avoid potential DoS, I have a patch ready to this intend but not yet commited. Maybe the way the regexp is compiled or interpreted could be enhanced too, but it's IMHO broken anyway, the regexp is not testing what you/they expect ! paphio:~/XML -> ./testRegexp "(((C|c)(([\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*))+" "C 433" Testing (((C|c)(([\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*))+: C 433: Ok paphio:~/XML -> there is 3 digit, the spaces being optional, the cardinality {3} is considered okay, same for "C 433 12" or nearly any combination (and hence the explosion). Someone didn't do regression tests or Microsoft has a bug. I won't close it as not a bug since I want to at least detect combinatory explosion at runtime and return an internal error if found. Daniel
Hi Daniel, Actually, the original regexp is much more complicated, I just tried to reduce it for easier understanding of the problem. Here is the regexp which is more close to original one: (((C|c)(([\s]+\-?[0-9]+(\.[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*))+ I have attached A7.BAT file with this regexp. As you can see, there is a comma in the repeating part, so, actually three pairs of numbers are expected. Also, the hang goes away if I modify the regexp by the following way: (((C|c)(([\s]+\-?[0-9]+(\.[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?){3})+[\s]*))+ See A7B.BAT file attached. The original regexp is very complicated: (((M|m)([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?)[\s]*)|((L|l)([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?)+[\s]*)|((H|h)([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?)+[\s]*)|((V|v)([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?)+[\s]*)|((C|c)(([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*)|((Q|q)(([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){2})+[\s]*)|((S|s)(([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){2})+[\s]*)|((A|a)([\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?[\s]+[0-1][\s]+[0-1][\s]+\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?)+[\s]*)|((Z|z)[\s]*))*
Created attachment 52274 [details] A7.BAT test file
Created attachment 52275 [details] A7B.BAT test file
Windows batch files are useless, I don't even have a windows box around. I'm lost with what you are reporting at this point, I will have to focuse on this at some point. The version in CVS fails with a -1 internal error due to the combinatory explosion since I commited my patch against it. Whether there is a bug in the regexp engine resulting in this is unclear. Daniel
Hi Daniel, Sorry about Windows batch files, I will submit shell scripts instead (A7.SH and A7B.SH). I have synchronized Libxml2 with the CVS. Currently, A7.SH reports -1 (as you said), and A7B.SH works just fine. A7.SH looks like this: ./testRegexp.exe "(((C|c)(([\s]+\-?[0-9]+(\.[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*))+" "C 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12" --debug A7B.SH: ./testRegexp.exe "(((C|c)(([\s]+\-?[0-9]+(\.[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?){3})+[\s]*))+" "C 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12" --debug I also made a Perl script with the regexp and data from the A7.SH (which causes error -1 in current version of Libxml2): $data = "C 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12"; if ($data =~ /^(((C|c)(([\s]+\-?[0-9]+(\.[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?((e|E)\-?[0-9]+)?){3})+[\s]*))+$/) { print "OK\n"; } else { print "NOK\n"; } It works just fine.
Created attachment 52279 [details] A7.SH test file (replacement of A7.BAT one)
Created attachment 52280 [details] A7B.SH test file (replacement of A7B.BAT one)
Created attachment 52281 [details] A7.PL test Perl script with the regexp (works fine!)
Created attachment 52285 [details] This is rhe Microsoft XML schema I mentioned with the complicated regexp See ST_AbbrGeom type.
Created attachment 52286 [details] This is the XML file for the schema
Okay, thanks I will look at this when I have some time, Daniel
I observed that the test "A7B.sh" takes 0.025s (real) with the current revision of xmlregexp.c. Daniel fixed a bug related to the following mail-thread in the xmlregexp module: http://mail.gnome.org/archives/xml/2005-September/msg00115.html So it appears that this also fixed your problem. Could you give us feedback? kbu@librax:/data/home/kbuchcik/gnomecvs/libxml2$ time A7B.sh Testing (((C|c)(([\s]+\-?[0-9]+(\.[0-9]+)?[\s]*,[\s]*\-?[0-9]+(\.[0-9]+)?){3})+[\s]*))+: C 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12 724.64,433.12 718.08,433.12 711.52,433.12: Ok real 0m0.025s user 0m0.011s sys 0m0.011s
Thanks for the reply! Though, the main issue is "A7.sh" script. "A7B.sh" worked fine. "A7.sh" does not hang anymore; though, it still returns "-1" instead of real result (I just have sync-ed with CVS). PERL script with the same regular expression ("A7.pl") works just fine and very fast. Artyom.
Ah, so a part of the problem still exists. Unfortunately I'm not into the regex module, thus I'm only able to make strange observations :-) AFAIK Daniel is currently away, so being patient until he returns seems to be the only option currently.
Hi Daniel! Did you have a chance to look on this problem?
I tried to fix problems in various ways but could not solve it. CVS should be a bit better but there is something broken which escapes me and I hadn't had nuch time lately to work on it. So yes and no ... Daniel
Maybe this bug is related to the following discovery. If using the testRegExp program in 2.6.23 one will get: Testing .*test.*: xxxtestxxx: Fail Testing .*.test.*: xxxtestxxx: Ok This behaviour is not as expected I would say. Seems that expressions where a character follows optional chars always fail.
Jan, I don't know if this bug is related to your discovery, but your discovery is related to bug #327167.
Created attachment 61646 [details] [review] Patch for Bug 316338 for Libxml 2.6.23 Patch for Bug 316338 for Libxml 2.6.23 made by Objective Systems.
Thanks, but could you explain what it does ? It is relatively hard to evalutate the changes just looking at the diff even with the inner comments. Daniel
Hi Daniel, Here are some explanation. Note, the author of this fix is Youri Golovanov <yourigt@gmail.com>, I am just translating from Russian to English: 1. xmlFAGenerateTransitions For example, there is a regexp containing the following fragment: ((e|E)\\-?[0-9]+)?){3}. For the second '?' there is the state with two transitions to the end and to the begining of this sub-expression. The transition to the end goes to [0-9] check. During the compilation, after function xmlFAEliminateEpsilonTransitions is called, we have a state with transitions toward and backward by the same condition: [0-9]. If the string being tested is not matched because of repeatings number (the number of such blocks is not divisible by 3) then the ininite number of rollbacks occurs and the regexp execution is interrupted by the limit of rollbacks. I (YG) added a new state after the end of the block and made transitions to it from the beginning and from the last state of the block. 2. xmlFAGenerateTransitions I (YG) have changed the order of transitions-by-the-counter insertions. Initially, the counter was set and after that the transition back occured. Then, there was a counter's test for min/max and the transition toward occured. Usually, after xmlFAEliminateEpsilonTransitions call, for simple expressions these transitions were swapped and everything worked. For more complex expressions the order of transitions was not swapped and the infinite loop occured. 3. xmlFARegExec I (YG) added the counter's reset to zero when transition-by-the-condition occurs. Without the counter's reset doesn't allow correct test of the following expressions: ((abc){2})* After transition back by "*", the counter was not reset and the regexp matching worked incorrectly. 4. xmlFAParseBranch & xmlFAParseRegExp This is just a simple optimization in order to decrease the number of transitions in automata. For expressions like (a|b)(c|d) the branches after (a|b) was not joined, and for each path we had separated branch (c|d). It is not a bug, but increases the size of automata. Regards, Youri.
Okay I went though the patch, makes sense, but far from trivial, very good job ! Applied, I just changed the comment to integrate them better and redocument the changed function. I also added tests coming from the bug report in the regressions. Could you just provide a non-trivial expression which passes the full regexp provided in comment #5, you can probably find one from an XML test case. Commited this in CVS, this will be in the next release, Thanks a lot ! Daniel