GNOME Bugzilla – Bug 583368
Effective way to reduce memory usage: pruning
Last modified: 2009-06-02 09:27:55 UTC
The following passage from the documentation: Libxslt is not very specialized. It is built under the assumption that all nodes from the source and output document can fit in the virtual memory of the system. There is a big trade-off there. It is fine for reasonably sized documents but may not be suitable for large sets of data. The gain is that it can be used in a relatively versatile way. The input or output may never be serialized, but the size of documents it can handle are limited by the size of the memory available. Shows the philosophical error that causes this bug. And this bug report explains an fairly simple correction. The assumption is that this implementation makes the program more general. In fact, it makes it less general as it can't handle large files and can't handle any reasonable size file without wasting memory. Yet a simple addition can greatly reduce memory usage. You simply allow the library user to suggest places to prune the tree. Usually, you would prune a subtree after processing. I.E. as soon as you have finished processing a record for data oriented XML you delete that records subtree or as soon as you have finished processing a docbook <section> you prune that section. Sometimes, you would prune before. You might, for example, prune all inline SVG if you aren't going to be using it. Or you might prune the contents (other than <title>) of all docbook sections before processing if you are only going to build a table of contents to a separate file. There is no reason to keep data around after it will no longer be needed or to wait until all data is loaded to begin processing. Yet you can keep your in memory tree processing model. This suggestion doesn't even require that the library be smart enough to know where it can prune, just let the library user make suggestions to the library. You may want to have an option to keep some vestiges of the tree around if indexing elements by number is being used. A web browser that displays docbook by using XSLT to translate into XHTML does not need two copies of the document in memory at once (one docbook, one XHTML). It can prune each <section> of docbook after processing. It might be able to prune even more agressively. This can be done with some xpath expressions indicating where to prune. As an option, an <xslt:prune> element could be added to allow stylesheet authors to indicate when a section of the tree was no longer needed. This affects xsltproc, xmlstarlet, and many other programs using libxml2 and libxslt. Obviously, you don't prune the input tree if the application already had it in memory and didn't indicate it was prunable. While loading a document, as soon as you encounter a prune-after node, you can immediately: process the templates that match above the prune-after node and set a checkpoint. Apply all the other templates that match at the prune-after node, or below, to that subtree. Then prune that subtree and suspend xslt processing until you hit the next prune-after node or end of file. At end of file, you resume from your checkpoint the processing of the nodes above the prune-after level - i.e. emitting the closing tags for the prune top level tree elements if serializing. You tailor this to your specific implementation, but that is the basic idea. For example, instead of setting a checkpoint, a stack based tree walker or template walker, would pause at certain levels and call the routine to prune the tree and load more input. This is explained in detail in another bug report: https://sourceforge.net/tracker/index.php?func=detail&aid=2794533&group_id=66612&atid=515109
XSLT-1.0 and XPath were design with the assumption that the tree is available basically. There have been a lot of work to make streamable XSLT engines, ask IBM to open-source them. <xsl:prune> sounds completely useless to me because it's: - non standard - doesn't explain how the input tree can and should be fetched the XSLT and XPAth processing model assume a full infoset availablility not a restricted one. Just drop XSLT and or use it to transform only subtrees one at a time. but you can't expect the stylesheet itself to express how a document should be loaded. Daniel