After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 311857 - xsltproc very slow generating index for gtk-docs.sgml
xsltproc very slow generating index for gtk-docs.sgml
Status: RESOLVED FIXED
Product: gtk-doc
Classification: Platform
Component: general
0.7
Other Linux
: Normal minor
: 1.11
Assigned To: gtk-doc maintainers
gtk-doc maintainers
Depends on:
Blocks:
 
 
Reported: 2005-07-28 12:38 UTC by Ed Catmur
Modified: 2008-10-03 13:21 UTC
See Also:
GNOME target: ---
GNOME version: 2.11/2.12


Attachments
oprofile report (12.81 KB, text/plain)
2005-08-24 16:39 UTC, Stefan Sauer (gstreamer, gtkdoc dev)
  Details
oprofile report (4.80 KB, text/plain)
2005-08-24 16:41 UTC, Stefan Sauer (gstreamer, gtkdoc dev)
  Details
generated api and deprecated index (4.49 KB, patch)
2006-03-29 08:38 UTC, Stefan Sauer (gstreamer, gtkdoc dev)
none Details | Review

Description Ed Catmur 2005-07-28 12:38:04 UTC
In gtk+-2.7.3/docs/reference:

$ make
...
*** Building HTML ***
rm -rf ./html
mkdir ./html
cd ./html && gtkdoc-mkhtml gtk ../gtk-docs.sgml
Computing chunks...
...
Writing glossary.html for glossary(glossary)

The docs make process then hangs on 100% CPU:
$ ps -C xsltproc -f
UID        PID  PPID  C STIME TTY          TIME CMD
root     11324 11304 53 12:32 pts/1    00:31:53 /usr/bin/xsltproc --nonet --xinc

I can leave it for hours without returning.

glossary.html has been generated (and looks OK) so I think it is hanging trying
to generate the index:

=== snip gtk-docs.sgml ===
  <part>
    <title>GTK+ Tools</title>

     &gtk-query-immodules;
     &gtk-update-icon-cache;
  </part>

  &gtk-glossary;

  <index>
    <title>Index</title>
  </index>
  <index role="deprecated">
    <title>Index of deprecated symbols</title>
  </index>
=== snip gtk-docs.sgml ===

If I attach gdb and repeatedly issue "finish" it seems it is failing to return
from this stack frame:

  • #0 xsltApplyTemplates
    from /usr/lib/libxslt.so.1
  • #1 xsltApplyOneTemplateInt
    from /usr/lib/libxslt.so.1
  • #2 xsltApplyOneTemplateInt
    from /usr/lib/libxslt.so.1
  • #3 xsltEvalVariable
    from /usr/lib/libxslt.so.1
  • #4 xsltRegisterVariable
    from /usr/lib/libxslt.so.1
  • #5 xsltApplyOneTemplateInt
    from /usr/lib/libxslt.so.1
  • #6 xsltCallTemplate
    from /usr/lib/libxslt.so.1
  • #7 xsltApplyOneTemplateInt
    from /usr/lib/libxslt.so.1
  • #8 xsltChoose
    from /usr/lib/libxslt.so.1
  • #9 xsltApplyOneTemplateInt
    from /usr/lib/libxslt.so.1
  • #10 xsltApplyOneTemplateInt
    from /usr/lib/libxslt.so.1
  • #11 xsltProcessOneNode
    from /usr/lib/libxslt.so.1
  • #12 xsltApplyTemplates
    from /usr/lib/libxslt.so.1
  • #13 xsltApplyOneTemplateInt
    from /usr/lib/libxslt.so.1
  • #14 xsltProcessOneNode
    from /usr/lib/libxslt.so.1
  • #15 xsltProcessOneNode
    from /usr/lib/libxslt.so.1
  • #16 xsltApplyTemplates
    from /usr/lib/libxslt.so.1
  • #17 xsltApplyOneTemplateInt
    from /usr/lib/libxslt.so.1
  • #18 xsltIf
    from /usr/lib/libxslt.so.1
  • #19 xsltApplyOneTemplateInt
    from /usr/lib/libxslt.so.1
  • #20 xsltChoose
    from /usr/lib/libxslt.so.1
  • #21 xsltApplyOneTemplateInt
    from /usr/lib/libxslt.so.1
  • #22 xsltProcessOneNode
    from /usr/lib/libxslt.so.1
  • #23 xsltApplyStylesheetInternal
    from /usr/lib/libxslt.so.1
  • #24 xsltProcess
  • #25 main

I can recompile libxslt with debugging support if that will help.
Comment 1 Ed Catmur 2005-07-28 12:57:03 UTC
Oops.

After cumulative user time reaches around ~40 mins it starts outputting index
files, at the rate of one every 10 or so minutes. (CPU is rated at 2916.35 bogomips)

It looks like it is going to finish if left long enough - but this is still
desperately slow and would make many people think it was in an infinite loop.
Comment 2 Daniel Veillard 2005-07-28 13:13:49 UTC
Maybe the XSLT stylesheet are too complex or not optimized at all.
Did you tried to profile them on a smaller example ? 

  See xsltproc --profile and xsltproc -v options.

It is unlikely I will have time to debug the stylesheets used by gtk-docs.
If this can be pointed at a specific problem in libxslt then reassign the 
bug back to it, but it's a bit like stating that a C compiler has a bug
because running a program with deeply nested loops is slow at runtime.
The first analysis really should be done by gtk-doc maintainers.

Daniel
Comment 3 Damon Chaplin 2005-07-28 15:04:08 UTC
I've just built the GTK+ reference docs from cvs and it took ~17 minutes.
(This is on an Athlon 2500+.)
But it did seem to use an enormous amount of memory. top was reporting ~140MB
resident. I don't know if that is correct.

Maybe your machine is low on memory and possibly thrashing the swap space.
(I have 512MB so it doesn't cause problems for me.)
Comment 4 Stefan Sauer (gstreamer, gtkdoc dev) 2005-08-24 16:34:40 UTC
the glossary and the indix generation is very slow here too. I've run the
doc-generation for gestremaer under oprofile (see attachements).
For gtk-doc it takes about one hour on a Pentium III 1.1 GHz 512MB memory.
Comment 5 Stefan Sauer (gstreamer, gtkdoc dev) 2005-08-24 16:39:29 UTC
Created attachment 51271 [details]
oprofile report

xmlXPathNodeCollectAndTest
  was during the run at 25%
xmlXPathCompOpEval
  climbed during index generation from about 6% to 10%
Comment 6 Stefan Sauer (gstreamer, gtkdoc dev) 2005-08-24 16:41:14 UTC
Created attachment 51272 [details]
oprofile report

xsltNumberFormatGetAnyLevel
  was 25% at maximum
xsltXPathVariableLookup, xsltStackLookup
  were used more often during index generation ( 8% -> 12 %)
Comment 7 Stefan Sauer (gstreamer, gtkdoc dev) 2006-01-17 07:49:48 UTC
Glynn did some profiling using DTrace:
http://www.gnome.org/~gman/blog/16012006-2
http://www.gnome.org/~gman/output.txt
http://www.gnome.org/~gman/output2.txt

@daniel: is that of any help for you
Comment 8 Glynn Foster 2006-01-17 09:24:47 UTC
The profiling I did was pretty rough - really only just investigating whether it was still doing something useful, or just going around in a loop, or just hanging. I can probably dip in a little more once I know what to look for. Right now, I don't.
Comment 9 Stefan Sauer (gstreamer, gtkdoc dev) 2006-03-28 13:13:09 UTC
The slow index generation is due to the fact that gtk-doc is not producing an index. doc-writers usually put in an empty <index> tag. This causes the docbook style sheets to generate the index, which is what takes so long.

The idea here is to add index generation to gtk-doc. The doc-writer then can include the generated indexentry list(s).

As an advantage we could also add trimming of prefixes ('g_' for glib and 'gtk_' for gtk+). This would make the index more useful :)

@damon: would you put that into gtkdoc-mkdb or into a separate gtkdoc-mkindex?
Comment 10 Matthias Clasen 2006-03-28 15:18:09 UTC
I would favour speedups which don't force me to change the documents.
Being able to add a <index> and have formatting system generate it
based on the contents of the document is good, since it makes my document
less redundant.
Comment 11 Stefan Sauer (gstreamer, gtkdoc dev) 2006-03-28 16:19:55 UTC
Matthias, but its pure lazyness that we let docbook.xsl do it. IMHO docbook should generate the formatting. I've got a first implementation locally. And the gstreamer core docs build more that twice as fast with the change.

I don't see a practical way to speed it up on the xsl level (other that a custom exslt function to replace the 'generate-index' template).

All that changes for developers is that

<index>
  <title>API Index</title>
</index>

becomes

<index>
  <title>API Index</title>
  &mylib_api_index;
</index>
Comment 12 Matthias Clasen 2006-03-28 16:40:15 UTC
It is still conceptually wrong, in my opinion. Or at least, its different.
XSL is a transformation language. Not just a style language. It is very
appropriate to let it transform the implicit index into an explicit one. 

Now, you may be right that it is hard to optimize this. But then, twice
as fast is not very impressive, imo.
Comment 13 Stefan Sauer (gstreamer, gtkdoc dev) 2006-03-29 06:10:26 UTC
Twice as fast for the whole doc generation, the index generation as such drops from 4 minutes to a couple of seconds (5 or so), I'll post a first patch later on.

Please remeber that we generate the xml, so we are free to generate as much as we can. Beside we can generate a more useful index (by stripping the library prefix) and allso e.g. add the short information to the index.

Finally no one is forced to use this kind of index generation. Just leave it as it is and you set.

Lets see what others think.
Comment 14 Stefan Sauer (gstreamer, gtkdoc dev) 2006-03-29 08:38:22 UTC
Created attachment 62273 [details] [review]
generated api and deprecated index

to use the generated indexentry list, the XXX-docs.sgml need the following change:

  <index id="api-index">
    <title>Index</title>
    <xi:include href="xml/index.sgml"/>
  </index>
  <index id="deprecated-index">
    <title>Index of deprecated symbols</title>
    <xi:include href="xml/deprecated_index.sgml"/>
  </index>

For gstreamer generating the docs after make clean drops from 9:15 minutes to 3:22 here.
Comment 15 Damon Chaplin 2006-03-29 11:29:55 UTC
I think generating the Deprecated and Since indexes in gtk-doc is probably OK.

I'm a bit hesitant to generate the main index ourselves though. The developers
may be adding index terms in the external documentation, which would be lost. Though maybe we could make this optional.

I'd still like to find out why it seems to be much slower for some people.
Stefan: what happened to make it speed up from 1 hour to 9 minutes for you?

I don't know much about the xsl. Is it our xsl code that is slow, or the
DocBook stylesheets? Has anyone pinpointed which piece of code is the problem?
Comment 16 Stefan Sauer (gstreamer, gtkdoc dev) 2006-03-29 13:14:11 UTC
Good point damon, if one manually adds <indexterms> in handwritten xml parts, they are not processed by gtk-doc.

Regarding the slowness, its not our code. The code in question is the 'generate-index' template in
/usr/share/sgml/docbook/xsl-stylesheets-1.69.1/html/autoidx.xsl

I've run the xsltproc call from gtkdoc-mkhtml standalone and added --profile. This shows statistics at the end. From these stats one sees that the generate-index template eats most of the whole processing time. If also spend a whole morning profiling the respective counterparts in libxml2/libxsl using oprofile and sysprof. It boils down to two function. These function don't have obvious optimization potential. It's just the sheer amount of calls the sums up (for gstreamer docs its about 12000 calls).

My suggestion is to generate all three kinds of index files. Whenever we first generate a XXX-docs.sgml file we add a comment at the bottom that shows how to activate the index. For existing docs it's up to the maintainer to activate them. If desiged I also add switched to gtkdoc-mkdb to supress index generation (--disable-api-index, --disable-deprecated-index, --disable-since-index).

Finally, I'd like to add an option to gtkdoc-mkdb --index-prefix="gtk", which would cause the index generation to drop the "gtk_" prefix from symbol-names. This makes the section divides a lot more useful. What you think about that?
Comment 17 Matthias Clasen 2006-03-29 14:13:22 UTC
a) I don't want to appear obstructionist. I just think that it is a failure of
the xsl tools and stylesheets if they can't handle this, while perl seems to do
fine. Its not as if perl would have to do less work to generate the index.

b) Regarding the prefix stripping, I think the symbols should appear with their
full name in the index, but we can probably ignore the prefix when sorting them,
to distribute the entries over the alphabet, rather than only populating the letter G...

c) I think handling manual indexterms is necessary, and maybe not too hard.
Before the docbook stylesheets had xsl templates for index generation, there
was a perl script to do it...
Comment 18 Matthias Clasen 2006-03-29 14:19:18 UTC
Oh, another thing: I believe the filename index.sgml is already used by gtk-doc for something else.
Comment 19 Damon Chaplin 2006-03-30 14:39:04 UTC
I've been looking at the DocBook stylesheets and this seems to be one of the
bottlenecks (in autoidx.xsl):

  <xsl:variable name="terms"
      select="//indexterm[count(.|key('letter',
                                      translate(substring(&primary;, \1, 1),
                                                &lowercase;,
                                                &uppercase;))[&scope;\][1]) = 1
                          and not(@class = 'endofrange')]"/>

It is very complicated, and with no comments. My guess is that the stylesheets
are pretty inefficient at present, and maybe we should do our own index in xsl.
Comment 20 Matthias Clasen 2006-03-30 14:52:53 UTC
I think the problem with optimizing things like this is that the xsl processor would have to be smart enough to keep the results of expressions like 

   count(.|key('letter',
                                      translate(substring(&primary;, \1, 1),
                                                &lowercase;,
                                                &uppercase;))[&scope;\][1]

(and possibly subexpressions thereof) around after computing them once, instead
of recomputing them over and over.
Comment 21 Stefan Sauer (gstreamer, gtkdoc dev) 2006-05-09 18:36:12 UTC
Tried to get some feedback from the xml and docbook xsl side:

http://sourceforge.net/tracker/index.php?func=detail&aid=1484912&group_id=21935&atid=373747
http://mail.gnome.org/archives/xml/2006-May/msg00040.html
Comment 22 Stefan Sauer (gstreamer, gtkdoc dev) 2007-07-11 07:57:55 UTC
find . -name "*.sgml" -exec grep -l "indexterm" {} \;

yielded nothing for gtk+, gstreamer. In glib I just found:

./reference/glib/running.sgml:223:<indexterm><primary>g_trap_free_size</primary></indexterm>
./reference/glib/running.sgml:224:<indexterm><primary>g_trap_realloc_size</primary></indexterm>
./reference/glib/running.sgml:225:<indexterm><primary>g_trap_malloc_size</primary></indexterm>

which is used to add this section to the index.

On the xml side it does not look like it can easily be accelerated.
Comment 23 Stefan Sauer (gstreamer, gtkdoc dev) 2007-07-15 13:49:07 UTC
Forget about comment #22. What the attached patch is doing is to create:
<indexentry><primaryie>$xref</primaryie></indexentry>
lines. Linking to the works. What need to be done if the current patch gets applied is to add manualy defined <indexterm>s there.

I guess that is fair. I found no custum ones so far.
Comment 24 Stefan Sauer (gstreamer, gtkdoc dev) 2008-03-22 09:37:11 UTC
New plan :) I'll add this to 1.11. But it will just generate the index-files (api-index, since api-index, deprecated api-index). Its up to the doc-maintainer to use the files.
If one wishes, one can still use the normal generic index (no role) which will contain everything, but is slow.
Comment 25 Stefan Sauer (gstreamer, gtkdoc dev) 2008-10-03 13:21:44 UTC
This is now in svn. Please test widely. Use something like this in you master document:

  <index id="api-index-full">
    <title>API Index</title>
    <xi:include href="xml/api-index-full.xml"><xi:fallback /></xi:include>
  </index>
  <index id="api-index-deprecated" role="deprecated">
    <title>Index of deprecated API</title>
    <xi:include href="xml/api-index-deprecated.xml"><xi:fallback /></xi:include>
  </index>
  <index id="api-index-0-1" role="0.1">
    <title>Index of new API in 0.1</title>
    <xi:include href="xml/api-index-0.1.xml"><xi:fallback /></xi:include>
  </index>