After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 781569 - After Python conversion, gtkdoc-scan runs for ages
After Python conversion, gtkdoc-scan runs for ages
Status: RESOLVED FIXED
Product: gtk-doc
Classification: Platform
Component: general
unspecified
Other Linux
: Normal normal
: ---
Assigned To: gtk-doc maintainers
gtk-doc maintainers
Depends on:
Blocks:
 
 
Reported: 2017-04-21 04:43 UTC by Michael Catanzaro
Modified: 2017-05-03 20:14 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Patch that prevents pathological regex behaviour (1.10 KB, patch)
2017-04-23 18:49 UTC, Jussi Pakkanen
none Details | Review
Alternative version of the patch (2.83 KB, patch)
2017-04-23 19:23 UTC, Stefan Sauer (gstreamer, gtkdoc dev)
committed Details | Review

Description Michael Catanzaro 2017-04-21 04:43:20 UTC
Since gtkdoc-scan was converted to Python, gtkdoc-scan runs forever gobbling 100% CPU usage when building WebKit. I bisected it to:

e6289d068101fa46c62f81602055ef3505b8c852 is the first bad commit
commit e6289d068101fa46c62f81602055ef3505b8c852
Author: Jussi Pakkanen <jpakkane@gmail.com>
Date:   Fri Mar 31 18:59:34 2017 +0300

    scan: Convert to Python.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=780789

:100755 100755 672545e11d62a2e2dc4f12bb68647a8867f67169 688929358b96f75102bf639ec3c78acb7306b12a M	gtkdoc-scan.in
:040000 040000 7ddbabc4e6e6a494c059d5342e2c625a899d5491 ebca9c1a704985c73965be3376d0b29da86bc458 M	tests

Note that you'll need to configure WebKit with either -DENABLE_GTKDOC=ON or -DENABLE_DEVELOPER_MODE=ON to reproduce this, as otherwise gtkdoc-scan will never be run.
Comment 1 Ting-Wei Lan 2017-04-22 12:33:24 UTC
This issue can also be reproduced by building GLib with --enable-gtk-doc. It takes about 100 minutes to build and install GLib in a JHBuild environment. Before the python conversion, it usually takes less than 10 minutes to complete the build.

After adding a few lines of logging code to gtkdoc/scan.py, I found most time were spent on re.search calls. Most files were processed in an acceptable speed, but a few other files, such as glib/gvariant.h and glib/gvarianttype.h, took unusually long time to scan. Scanning glib/gvariant.h and glib/gvarianttype.h took 77 minutes and 9 minutes respectively.
Comment 2 Jussi Pakkanen 2017-04-22 18:02:44 UTC
The problem seems to be regex 17. Here's the output of pprofile:

   328|       147|   0.00106382|   7.2369e-06|  0.00%|            m17 = re.search(
   329|       147|   0.00313592|  2.13328e-05|  0.00%|                r'^\s*(?:\b(?:extern|G_INLINE_FUNC|%s)\s*)*((?:const\s+|G_CONST_RETURN\s+|signed\s+|unsigned\s+|long\s+|short\s+|struct\s+|union\s+|enum\s+)*\w+)((?:\s+|\*)+(?:\s*(?:\*+|\bconst\b|\bG_CONST_RETURN\b))*)\s*(_[A-Za-z]\w*)\s*\(' % options.ignore_decorators, line)
(call)|       147|      330.959|      2.25142| 99.63%|# /usr/lib/python2.7/re.py:143 search

This shows that running the regex takes two seconds per line. I tried precompiling the regex but it did not make any difference. Regex 18 (which is the same except for one underscore) is very fast. Could it be that this regex invokes some sort of backtracking issue in the Python regex implementation?
Comment 3 Jussi Pakkanen 2017-04-22 18:44:38 UTC
Minimum reproducer:

import re, sys

ignores = ''

regex = re.compile(r'^\s*(?:\b(?:extern|G_INLINE_FUNC|%s)\s*)*((?:const\s+|G_CONST_RETURN\s+|signed\s+|unsigned\s+|long\s+|short\s+|struct\s+|union\s+|enum\s+)*\w+)((?:\s+|\*)+(?:\s*(?:\*+|\bconst\b|\bG_CONST_RETURN\b))*)\s*(_[A-Za-z]\w*)\s*\(' % ignores)

input = 'gboolean                        g_variant_type_string_is_valid          (const gchar         *type_string);'

l = re.search(regex, input)

This is abysmally slow. Removing the underscore from the last group makes it fast. Interestingly so does changing the input to this:

input = 'gboolean g_variant_type_string_is_valid (const gchar *type_string);'

The only difference is removing consecutive spaces with one space. A quick test shows that adding this at the beginning of the loop (line 227ish) fixes the issue:

line = line.replace('  ', ' ')
Comment 4 Stefan Sauer (gstreamer, gtkdoc dev) 2017-04-23 16:18:10 UTC
Jussi, thanks for investigating. Can you make a patch with the:
line = line.replace('  ', ' ')
adding a comment pointing to this bug? I need to find a structure to replace the current parsing loop with a pythonic way of only doing the regexp as needed. Probably putting each if-branch into a local function.
Comment 5 Jussi Pakkanen 2017-04-23 18:49:31 UTC
Created attachment 350269 [details] [review]
Patch that prevents pathological regex behaviour
Comment 6 Stefan Sauer (gstreamer, gtkdoc dev) 2017-04-23 19:22:05 UTC
Hmm, we can't do this unfortunately, since it will break the outputs, e.g.:


diff -u /home/ensonic/projects/gnome/gtk-doc/tests/annotations/docs/xml.ref/tester.xml /home/ensonic/projects/gnome/gtk-doc/tests/annotations/docs/xml/tester.xml
--- /home/ensonic/projects/gnome/gtk-doc/tests/annotations/docs/xml.ref/tester.xml	2016-09-19 16:04:09.210593647 +0200
+++ /home/ensonic/projects/gnome/gtk-doc/tests/annotations/docs/xml/tester.xml	2017-04-23 21:14:25.639656039 +0200
@@ -530,7 +530,7 @@
 <title>struct GtkdocAnnotation</title>
 <indexterm zone="GtkdocAnnotation"><primary>GtkdocAnnotation</primary></indexterm>
 <programlisting language="C">struct GtkdocAnnotation {
-  gpointer that;
+ gpointer that;
 };
 </programlisting>
 <para>small struct</para>

I'll post an alternative patch. WDYT?
Comment 7 Stefan Sauer (gstreamer, gtkdoc dev) 2017-04-23 19:23:04 UTC
Created attachment 350272 [details] [review]
Alternative version of the patch
Comment 8 Jussi Pakkanen 2017-04-23 19:28:49 UTC
LGTM. I reported the regex issue upstream: https://bugs.python.org/issue30148
Comment 9 Stefan Sauer (gstreamer, gtkdoc dev) 2017-04-24 17:38:40 UTC
I've pushed this change with additional handling of the ignore-decorators. Michael and Ting-Wei, could you please retry and report?

commit a03aab7519d612c2851052b7e1e9a9e7a0650600 (HEAD -> master, origin/master, origin/HEAD)
Author: Jussi Pakkanen <jpakkane@gmail.com>
Date:   Sun Apr 23 21:48:21 2017 +0300

    scan:  Workaround for pathological regex parsing
    
    Replace consecutive spaces with one to avoid bad performance for a particullar regex.
    
    As pointed out in upstream python issue discusssion avoid matching empty strings.
    This happens when no --ignore-decorators are given. This halfs the time for the regex
    already.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=781569
Comment 10 Michael Catanzaro 2017-04-25 00:21:34 UTC
Unfortunately this does not fix the issue for WebKit.
Comment 11 Ting-Wei Lan 2017-04-25 11:08:14 UTC
It fixes the issue for GLib. Building GLib with --enable-gtk-doc takes less than 5 minutes now.
Comment 12 Stefan Sauer (gstreamer, gtkdoc dev) 2017-04-25 17:11:58 UTC
Thanks for retrying!

We must have another bad case there. I'll submit a similar workaround for regexp 18 later tonight and I'll try to collect a log of line that match there and try tweaking (aka fixing) the regexp.

Will leave the ticket open for this.
Comment 13 Stefan Sauer (gstreamer, gtkdoc dev) 2017-04-25 20:56:38 UTC
I pushed another fix:
https://git.gnome.org/browse/gtk-doc/commit/?id=6f3a0b74712c4ebb556ad0d884ff8cce02a8d3b2

commit 6f3a0b74712c4ebb556ad0d884ff8cce02a8d3b2
Author: Stefan Sauer <ensonic@users.sf.net>
Date:   Tue Apr 25 22:51:35 2017 +0200

    scan: replace workaround with proper fix
    
    Replace the '(?:\s+|\*)+' with '[ *]+' since the former is slow and overly
    complicated.
    
    https://bugzilla.gnome.org/show_bug.cgi?id=781569

I hope that this fixes it also for webkit.
Comment 14 Michael Catanzaro 2017-05-03 20:14:36 UTC
(In reply to Stefan Sauer (gstreamer, gtkdoc dev) from comment #13)
> I pushed another fix:
> https://git.gnome.org/browse/gtk-doc/commit/
> ?id=6f3a0b74712c4ebb556ad0d884ff8cce02a8d3b2
> 
> commit 6f3a0b74712c4ebb556ad0d884ff8cce02a8d3b2
> Author: Stefan Sauer <ensonic@users.sf.net>
> Date:   Tue Apr 25 22:51:35 2017 +0200
> 
>     scan: replace workaround with proper fix
>     
>     Replace the '(?:\s+|\*)+' with '[ *]+' since the former is slow and
> overly
>     complicated.
>     
>     https://bugzilla.gnome.org/show_bug.cgi?id=781569
> 
> I hope that this fixes it also for webkit.

Whoops, I missed this comment. I tried master now and it seems to work fine, so I bet this is fixed. Thanks!