GNOME Bugzilla – Bug 311857
xsltproc very slow generating index for gtk-docs.sgml
Last modified: 2008-10-03 13:21:44 UTC
In gtk+-2.7.3/docs/reference: $ make ... *** Building HTML *** rm -rf ./html mkdir ./html cd ./html && gtkdoc-mkhtml gtk ../gtk-docs.sgml Computing chunks... ... Writing glossary.html for glossary(glossary) The docs make process then hangs on 100% CPU: $ ps -C xsltproc -f UID PID PPID C STIME TTY TIME CMD root 11324 11304 53 12:32 pts/1 00:31:53 /usr/bin/xsltproc --nonet --xinc I can leave it for hours without returning. glossary.html has been generated (and looks OK) so I think it is hanging trying to generate the index: === snip gtk-docs.sgml === <part> <title>GTK+ Tools</title> >k-query-immodules; >k-update-icon-cache; </part> >k-glossary; <index> <title>Index</title> </index> <index role="deprecated"> <title>Index of deprecated symbols</title> </index> === snip gtk-docs.sgml === If I attach gdb and repeatedly issue "finish" it seems it is failing to return from this stack frame:
+ Trace 62046
I can recompile libxslt with debugging support if that will help.
Oops. After cumulative user time reaches around ~40 mins it starts outputting index files, at the rate of one every 10 or so minutes. (CPU is rated at 2916.35 bogomips) It looks like it is going to finish if left long enough - but this is still desperately slow and would make many people think it was in an infinite loop.
Maybe the XSLT stylesheet are too complex or not optimized at all. Did you tried to profile them on a smaller example ? See xsltproc --profile and xsltproc -v options. It is unlikely I will have time to debug the stylesheets used by gtk-docs. If this can be pointed at a specific problem in libxslt then reassign the bug back to it, but it's a bit like stating that a C compiler has a bug because running a program with deeply nested loops is slow at runtime. The first analysis really should be done by gtk-doc maintainers. Daniel
I've just built the GTK+ reference docs from cvs and it took ~17 minutes. (This is on an Athlon 2500+.) But it did seem to use an enormous amount of memory. top was reporting ~140MB resident. I don't know if that is correct. Maybe your machine is low on memory and possibly thrashing the swap space. (I have 512MB so it doesn't cause problems for me.)
the glossary and the indix generation is very slow here too. I've run the doc-generation for gestremaer under oprofile (see attachements). For gtk-doc it takes about one hour on a Pentium III 1.1 GHz 512MB memory.
Created attachment 51271 [details] oprofile report xmlXPathNodeCollectAndTest was during the run at 25% xmlXPathCompOpEval climbed during index generation from about 6% to 10%
Created attachment 51272 [details] oprofile report xsltNumberFormatGetAnyLevel was 25% at maximum xsltXPathVariableLookup, xsltStackLookup were used more often during index generation ( 8% -> 12 %)
Glynn did some profiling using DTrace: http://www.gnome.org/~gman/blog/16012006-2 http://www.gnome.org/~gman/output.txt http://www.gnome.org/~gman/output2.txt @daniel: is that of any help for you
The profiling I did was pretty rough - really only just investigating whether it was still doing something useful, or just going around in a loop, or just hanging. I can probably dip in a little more once I know what to look for. Right now, I don't.
The slow index generation is due to the fact that gtk-doc is not producing an index. doc-writers usually put in an empty <index> tag. This causes the docbook style sheets to generate the index, which is what takes so long. The idea here is to add index generation to gtk-doc. The doc-writer then can include the generated indexentry list(s). As an advantage we could also add trimming of prefixes ('g_' for glib and 'gtk_' for gtk+). This would make the index more useful :) @damon: would you put that into gtkdoc-mkdb or into a separate gtkdoc-mkindex?
I would favour speedups which don't force me to change the documents. Being able to add a <index> and have formatting system generate it based on the contents of the document is good, since it makes my document less redundant.
Matthias, but its pure lazyness that we let docbook.xsl do it. IMHO docbook should generate the formatting. I've got a first implementation locally. And the gstreamer core docs build more that twice as fast with the change. I don't see a practical way to speed it up on the xsl level (other that a custom exslt function to replace the 'generate-index' template). All that changes for developers is that <index> <title>API Index</title> </index> becomes <index> <title>API Index</title> &mylib_api_index; </index>
It is still conceptually wrong, in my opinion. Or at least, its different. XSL is a transformation language. Not just a style language. It is very appropriate to let it transform the implicit index into an explicit one. Now, you may be right that it is hard to optimize this. But then, twice as fast is not very impressive, imo.
Twice as fast for the whole doc generation, the index generation as such drops from 4 minutes to a couple of seconds (5 or so), I'll post a first patch later on. Please remeber that we generate the xml, so we are free to generate as much as we can. Beside we can generate a more useful index (by stripping the library prefix) and allso e.g. add the short information to the index. Finally no one is forced to use this kind of index generation. Just leave it as it is and you set. Lets see what others think.
Created attachment 62273 [details] [review] generated api and deprecated index to use the generated indexentry list, the XXX-docs.sgml need the following change: <index id="api-index"> <title>Index</title> <xi:include href="xml/index.sgml"/> </index> <index id="deprecated-index"> <title>Index of deprecated symbols</title> <xi:include href="xml/deprecated_index.sgml"/> </index> For gstreamer generating the docs after make clean drops from 9:15 minutes to 3:22 here.
I think generating the Deprecated and Since indexes in gtk-doc is probably OK. I'm a bit hesitant to generate the main index ourselves though. The developers may be adding index terms in the external documentation, which would be lost. Though maybe we could make this optional. I'd still like to find out why it seems to be much slower for some people. Stefan: what happened to make it speed up from 1 hour to 9 minutes for you? I don't know much about the xsl. Is it our xsl code that is slow, or the DocBook stylesheets? Has anyone pinpointed which piece of code is the problem?
Good point damon, if one manually adds <indexterms> in handwritten xml parts, they are not processed by gtk-doc. Regarding the slowness, its not our code. The code in question is the 'generate-index' template in /usr/share/sgml/docbook/xsl-stylesheets-1.69.1/html/autoidx.xsl I've run the xsltproc call from gtkdoc-mkhtml standalone and added --profile. This shows statistics at the end. From these stats one sees that the generate-index template eats most of the whole processing time. If also spend a whole morning profiling the respective counterparts in libxml2/libxsl using oprofile and sysprof. It boils down to two function. These function don't have obvious optimization potential. It's just the sheer amount of calls the sums up (for gstreamer docs its about 12000 calls). My suggestion is to generate all three kinds of index files. Whenever we first generate a XXX-docs.sgml file we add a comment at the bottom that shows how to activate the index. For existing docs it's up to the maintainer to activate them. If desiged I also add switched to gtkdoc-mkdb to supress index generation (--disable-api-index, --disable-deprecated-index, --disable-since-index). Finally, I'd like to add an option to gtkdoc-mkdb --index-prefix="gtk", which would cause the index generation to drop the "gtk_" prefix from symbol-names. This makes the section divides a lot more useful. What you think about that?
a) I don't want to appear obstructionist. I just think that it is a failure of the xsl tools and stylesheets if they can't handle this, while perl seems to do fine. Its not as if perl would have to do less work to generate the index. b) Regarding the prefix stripping, I think the symbols should appear with their full name in the index, but we can probably ignore the prefix when sorting them, to distribute the entries over the alphabet, rather than only populating the letter G... c) I think handling manual indexterms is necessary, and maybe not too hard. Before the docbook stylesheets had xsl templates for index generation, there was a perl script to do it...
Oh, another thing: I believe the filename index.sgml is already used by gtk-doc for something else.
I've been looking at the DocBook stylesheets and this seems to be one of the bottlenecks (in autoidx.xsl): <xsl:variable name="terms" select="//indexterm[count(.|key('letter', translate(substring(&primary;, \1, 1), &lowercase;, &uppercase;))[&scope;\][1]) = 1 and not(@class = 'endofrange')]"/> It is very complicated, and with no comments. My guess is that the stylesheets are pretty inefficient at present, and maybe we should do our own index in xsl.
I think the problem with optimizing things like this is that the xsl processor would have to be smart enough to keep the results of expressions like count(.|key('letter', translate(substring(&primary;, \1, 1), &lowercase;, &uppercase;))[&scope;\][1] (and possibly subexpressions thereof) around after computing them once, instead of recomputing them over and over.
Tried to get some feedback from the xml and docbook xsl side: http://sourceforge.net/tracker/index.php?func=detail&aid=1484912&group_id=21935&atid=373747 http://mail.gnome.org/archives/xml/2006-May/msg00040.html
find . -name "*.sgml" -exec grep -l "indexterm" {} \; yielded nothing for gtk+, gstreamer. In glib I just found: ./reference/glib/running.sgml:223:<indexterm><primary>g_trap_free_size</primary></indexterm> ./reference/glib/running.sgml:224:<indexterm><primary>g_trap_realloc_size</primary></indexterm> ./reference/glib/running.sgml:225:<indexterm><primary>g_trap_malloc_size</primary></indexterm> which is used to add this section to the index. On the xml side it does not look like it can easily be accelerated.
Forget about comment #22. What the attached patch is doing is to create: <indexentry><primaryie>$xref</primaryie></indexentry> lines. Linking to the works. What need to be done if the current patch gets applied is to add manualy defined <indexterm>s there. I guess that is fair. I found no custum ones so far.
New plan :) I'll add this to 1.11. But it will just generate the index-files (api-index, since api-index, deprecated api-index). Its up to the doc-maintainer to use the files. If one wishes, one can still use the normal generic index (no role) which will contain everything, but is slow.
This is now in svn. Please test widely. Use something like this in you master document: <index id="api-index-full"> <title>API Index</title> <xi:include href="xml/api-index-full.xml"><xi:fallback /></xi:include> </index> <index id="api-index-deprecated" role="deprecated"> <title>Index of deprecated API</title> <xi:include href="xml/api-index-deprecated.xml"><xi:fallback /></xi:include> </index> <index id="api-index-0-1" role="0.1"> <title>Index of new API in 0.1</title> <xi:include href="xml/api-index-0.1.xml"><xi:fallback /></xi:include> </index>