Bug 583368 – Effective way to reduce memory usage: pruning

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 583368 - Effective way to reduce memory usage: pruning


Summary:	Effective way to reduce memory usage: pruning


Status:	RESOLVED INVALID

Product:	libxslt
Classification:	Platform
Component:	general
Version:	git master
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2009-05-20 20:03 UTC by whitis
Modified:	2009-06-02 09:27 UTC

See Also:
GNOME target:	---
GNOME version:	Unversioned Enhancement

Description whitis 2009-05-20 20:03:11 UTC

The following passage from the documentation:
Libxslt is not very specialized. It is built under the assumption that all nodes from the source and output document can fit in the virtual memory of the system. There is a big trade-off there. It is fine for reasonably sized documents but may not be suitable for large sets of data. The gain is that it can be used in a relatively versatile way. The input or output may never be serialized, but the size of documents it can handle are limited by the size of the memory available.

Shows the philosophical error that causes this bug. And this bug report explains an fairly simple correction. The assumption is that this implementation makes the program more general. In fact, it makes it less general as it can't handle large files and can't handle any reasonable size file without wasting memory. Yet a simple addition can greatly reduce memory usage. You simply allow the library user to suggest places to prune the tree.
Usually, you would prune a subtree after processing. I.E. as soon as you have finished processing a record for data oriented XML you delete that records subtree or as soon as you have finished processing a docbook <section> you prune that section. Sometimes, you would prune before. You might, for example, prune all inline SVG if you aren't going to be using it. Or you might prune the contents (other than <title>) of all docbook sections before processing if you are only going to build a table of contents to a separate file. There is no reason to keep data around after it will no longer be needed or to wait until all data is loaded to begin processing. Yet you can keep your in memory tree processing model. This suggestion doesn't even require that the library be smart enough to know where it can prune, just let the library user make suggestions to the library.

You may want to have an option to keep some vestiges of the tree around if indexing elements by number is being used.

A web browser that displays docbook by using XSLT to translate into XHTML does not need two copies of the document in memory at once (one docbook, one XHTML). It can prune each <section> of docbook after processing. It might be able to prune even more agressively.

This can be done with some xpath expressions indicating where to prune. As an option, an <xslt:prune> element could be added to allow stylesheet authors to indicate when a section of the tree was no longer needed.

This affects xsltproc, xmlstarlet, and many other programs using libxml2 and libxslt.

Obviously, you don't prune the input tree if the application already had it in memory and didn't indicate it was prunable.

While loading a document, as soon as you encounter a prune-after node, you can immediately: process the templates that match above the prune-after node and set a checkpoint. Apply all the other templates that match at the prune-after node, or below, to that subtree. Then prune that subtree and suspend xslt processing until you hit the next prune-after node or end of file.
At end of file, you resume from your checkpoint the processing of the nodes above the prune-after level - i.e. emitting the closing tags for the prune top level tree elements if serializing. You tailor this to your specific implementation, but that is the basic idea. For example, instead of setting a checkpoint, a stack based tree walker or template walker, would pause at certain levels and call the routine to prune the tree and load more input.

This is explained in detail in another bug report:
https://sourceforge.net/tracker/index.php?func=detail&aid=2794533&group_id=66612&atid=515109

Comment 1 Daniel Veillard 2009-06-02 09:27:55 UTC

XSLT-1.0 and XPath were design with the assumption that the tree is
available basically. There have been a lot of work to make streamable
XSLT engines, ask IBM to open-source them.
<xsl:prune> sounds completely useless to me because it's:
   - non standard
   - doesn't explain how the input tree can and should be fetched
the XSLT and XPAth processing model assume a full infoset availablility
not a restricted one.

Just drop XSLT and or use it to transform only subtrees one at a time.
but you can't expect the stylesheet itself to express how a document
should be loaded.

Daniel