Bug 624408 – Processing a 1.8GB XML file with a simple transformation requires 16+GB of RAM on the target platform

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 624408 - Processing a 1.8GB XML file with a simple transformation requires 16+GB of RAM on the target platform


Summary:	Processing a 1.8GB XML file with a simple transformation requires 16+GB of RA...


Status:	RESOLVED NOTABUG

Product:	libxslt
Classification:	Platform
Component:	general
Version:	1.1.17
Hardware:	Other Linux

Importance:	Normal blocker
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2010-07-15 00:25 UTC by dgotwisner
Modified:	2014-10-05 15:22 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Transformation file (2.54 KB, text/xml) 2010-07-15 00:25 UTC, dgotwisner	Details

Description dgotwisner 2010-07-15 00:25:55 UTC

Created attachment 165935 [details]
Transformation file

I just tried running xsltproc
    xsltproc was compiled against libxml 20626, libxslt 10117 and libexslt 813
    libxslt 10117 was compiled against libxml 20626
    libexslt 813 was compiled against libxml 20626

on a "Red Hat Enterprise Linux Server release 5.3 (Tikanga)" (64 bit X86) machine.

The XML is 1,850,828,515 bytes big, and consists of a 24 hour run with 2 blocks of XML every second and three blocks every hour.

The transformation file generates a method=text output with an outer template on the root that applies a template to one of the every second blocks.  The inner template consists of a series of value-of select="field" items with no xpath functions or anything else of any complexity.  The output is a basic flat file that will be fed to gnuplot.

Anyway, what happens is that when the command "xsltproc gpsNavData.xsl" is typed, the system freezes, and a monitoring of memory indicates that at the point I looked, xsltproc was consuming 16GB of Ram (meaning everything else was being swapped, and xsltproc was probably thrashing).  I understand that xsltproc MUST cache the entire XML file in memory, I can even consider a doubling of this for computation storage, 4x or more of storage from the source file seems unreasonable and makes the tool unusable for huge files.  Running the software on a 32 bit X86:
    xsltproc was compiled against libxml 20632, libxslt 10124 and libexslt 813
    libxslt 10124 was compiled against libxml 20632
    libexslt 813 was compiled against libxml 20632
fails silently with kernel out of memory errors.

Note that I had planned on using this same transformation on 16 files generated every day for statistical purposes.  Writing a C program using tinyxml appears to not have this problem.

Running with a smaller XML file works perfectly.

Comment 1 dgotwisner 2010-07-15 00:34:51 UTC

Because of the size of the file, I am unable to attach it to this bug.  If I zip it, it is still 94,396,468 bytes, which is also way too large.  If someone wants, I can split the zip file into 95 chunks and post them, but I will hold off for the time being.

Comment 2 Nick Wellnhofer 2014-10-04 10:01:50 UTC

It depends on the structure of the XML file. The size of the xmlNode struct is 120 bytes on a 64-bit system, for example. So a 4-8x memory overhead is completely reasonable if you have small text nodes.

Comment 3 Daniel Veillard 2014-10-05 15:22:41 UTC

Yup XSLT is not designed to work on streaming, but on building a tree
(there are implementations trying to do streaming but it won't work in the
general case, libxslt doesn't even try).
Depending what your transformation is doing you may have more chance doing
that processing from a general purpose language like python using the
reader API see the doc at:

  http://xmlsoft.org/xmlreader.html

you should have the tools available without problem in the libxml2-python
package which is usually installed by default on RHEL-5
Libxslt is also available from python and you could try to mix the streaming
and the output via the existing XSLT templates by trying the approach suggested
in

  http://xmlsoft.org/xmlreader.html#Mixing

and apply the template to each record subtree but unless your XSLT is really
complex it's probably simpler to rewrite it in Python

Daniel