GNOME Bugzilla – Bug 781569
After Python conversion, gtkdoc-scan runs for ages
Last modified: 2017-05-03 20:14:36 UTC
Since gtkdoc-scan was converted to Python, gtkdoc-scan runs forever gobbling 100% CPU usage when building WebKit. I bisected it to: e6289d068101fa46c62f81602055ef3505b8c852 is the first bad commit commit e6289d068101fa46c62f81602055ef3505b8c852 Author: Jussi Pakkanen <jpakkane@gmail.com> Date: Fri Mar 31 18:59:34 2017 +0300 scan: Convert to Python. https://bugzilla.gnome.org/show_bug.cgi?id=780789 :100755 100755 672545e11d62a2e2dc4f12bb68647a8867f67169 688929358b96f75102bf639ec3c78acb7306b12a M gtkdoc-scan.in :040000 040000 7ddbabc4e6e6a494c059d5342e2c625a899d5491 ebca9c1a704985c73965be3376d0b29da86bc458 M tests Note that you'll need to configure WebKit with either -DENABLE_GTKDOC=ON or -DENABLE_DEVELOPER_MODE=ON to reproduce this, as otherwise gtkdoc-scan will never be run.
This issue can also be reproduced by building GLib with --enable-gtk-doc. It takes about 100 minutes to build and install GLib in a JHBuild environment. Before the python conversion, it usually takes less than 10 minutes to complete the build. After adding a few lines of logging code to gtkdoc/scan.py, I found most time were spent on re.search calls. Most files were processed in an acceptable speed, but a few other files, such as glib/gvariant.h and glib/gvarianttype.h, took unusually long time to scan. Scanning glib/gvariant.h and glib/gvarianttype.h took 77 minutes and 9 minutes respectively.
The problem seems to be regex 17. Here's the output of pprofile: 328| 147| 0.00106382| 7.2369e-06| 0.00%| m17 = re.search( 329| 147| 0.00313592| 2.13328e-05| 0.00%| r'^\s*(?:\b(?:extern|G_INLINE_FUNC|%s)\s*)*((?:const\s+|G_CONST_RETURN\s+|signed\s+|unsigned\s+|long\s+|short\s+|struct\s+|union\s+|enum\s+)*\w+)((?:\s+|\*)+(?:\s*(?:\*+|\bconst\b|\bG_CONST_RETURN\b))*)\s*(_[A-Za-z]\w*)\s*\(' % options.ignore_decorators, line) (call)| 147| 330.959| 2.25142| 99.63%|# /usr/lib/python2.7/re.py:143 search This shows that running the regex takes two seconds per line. I tried precompiling the regex but it did not make any difference. Regex 18 (which is the same except for one underscore) is very fast. Could it be that this regex invokes some sort of backtracking issue in the Python regex implementation?
Minimum reproducer: import re, sys ignores = '' regex = re.compile(r'^\s*(?:\b(?:extern|G_INLINE_FUNC|%s)\s*)*((?:const\s+|G_CONST_RETURN\s+|signed\s+|unsigned\s+|long\s+|short\s+|struct\s+|union\s+|enum\s+)*\w+)((?:\s+|\*)+(?:\s*(?:\*+|\bconst\b|\bG_CONST_RETURN\b))*)\s*(_[A-Za-z]\w*)\s*\(' % ignores) input = 'gboolean g_variant_type_string_is_valid (const gchar *type_string);' l = re.search(regex, input) This is abysmally slow. Removing the underscore from the last group makes it fast. Interestingly so does changing the input to this: input = 'gboolean g_variant_type_string_is_valid (const gchar *type_string);' The only difference is removing consecutive spaces with one space. A quick test shows that adding this at the beginning of the loop (line 227ish) fixes the issue: line = line.replace(' ', ' ')
Jussi, thanks for investigating. Can you make a patch with the: line = line.replace(' ', ' ') adding a comment pointing to this bug? I need to find a structure to replace the current parsing loop with a pythonic way of only doing the regexp as needed. Probably putting each if-branch into a local function.
Created attachment 350269 [details] [review] Patch that prevents pathological regex behaviour
Hmm, we can't do this unfortunately, since it will break the outputs, e.g.: diff -u /home/ensonic/projects/gnome/gtk-doc/tests/annotations/docs/xml.ref/tester.xml /home/ensonic/projects/gnome/gtk-doc/tests/annotations/docs/xml/tester.xml --- /home/ensonic/projects/gnome/gtk-doc/tests/annotations/docs/xml.ref/tester.xml 2016-09-19 16:04:09.210593647 +0200 +++ /home/ensonic/projects/gnome/gtk-doc/tests/annotations/docs/xml/tester.xml 2017-04-23 21:14:25.639656039 +0200 @@ -530,7 +530,7 @@ <title>struct GtkdocAnnotation</title> <indexterm zone="GtkdocAnnotation"><primary>GtkdocAnnotation</primary></indexterm> <programlisting language="C">struct GtkdocAnnotation { - gpointer that; + gpointer that; }; </programlisting> <para>small struct</para> I'll post an alternative patch. WDYT?
Created attachment 350272 [details] [review] Alternative version of the patch
LGTM. I reported the regex issue upstream: https://bugs.python.org/issue30148
I've pushed this change with additional handling of the ignore-decorators. Michael and Ting-Wei, could you please retry and report? commit a03aab7519d612c2851052b7e1e9a9e7a0650600 (HEAD -> master, origin/master, origin/HEAD) Author: Jussi Pakkanen <jpakkane@gmail.com> Date: Sun Apr 23 21:48:21 2017 +0300 scan: Workaround for pathological regex parsing Replace consecutive spaces with one to avoid bad performance for a particullar regex. As pointed out in upstream python issue discusssion avoid matching empty strings. This happens when no --ignore-decorators are given. This halfs the time for the regex already. https://bugzilla.gnome.org/show_bug.cgi?id=781569
Unfortunately this does not fix the issue for WebKit.
It fixes the issue for GLib. Building GLib with --enable-gtk-doc takes less than 5 minutes now.
Thanks for retrying! We must have another bad case there. I'll submit a similar workaround for regexp 18 later tonight and I'll try to collect a log of line that match there and try tweaking (aka fixing) the regexp. Will leave the ticket open for this.
I pushed another fix: https://git.gnome.org/browse/gtk-doc/commit/?id=6f3a0b74712c4ebb556ad0d884ff8cce02a8d3b2 commit 6f3a0b74712c4ebb556ad0d884ff8cce02a8d3b2 Author: Stefan Sauer <ensonic@users.sf.net> Date: Tue Apr 25 22:51:35 2017 +0200 scan: replace workaround with proper fix Replace the '(?:\s+|\*)+' with '[ *]+' since the former is slow and overly complicated. https://bugzilla.gnome.org/show_bug.cgi?id=781569 I hope that this fixes it also for webkit.
(In reply to Stefan Sauer (gstreamer, gtkdoc dev) from comment #13) > I pushed another fix: > https://git.gnome.org/browse/gtk-doc/commit/ > ?id=6f3a0b74712c4ebb556ad0d884ff8cce02a8d3b2 > > commit 6f3a0b74712c4ebb556ad0d884ff8cce02a8d3b2 > Author: Stefan Sauer <ensonic@users.sf.net> > Date: Tue Apr 25 22:51:35 2017 +0200 > > scan: replace workaround with proper fix > > Replace the '(?:\s+|\*)+' with '[ *]+' since the former is slow and > overly > complicated. > > https://bugzilla.gnome.org/show_bug.cgi?id=781569 > > I hope that this fixes it also for webkit. Whoops, I missed this comment. I tried master now and it seems to work fine, so I bet this is fixed. Thanks!