GNOME Bugzilla – Bug 624408
Processing a 1.8GB XML file with a simple transformation requires 16+GB of RAM on the target platform
Last modified: 2014-10-05 15:22:41 UTC
Created attachment 165935 [details] Transformation file I just tried running xsltproc xsltproc was compiled against libxml 20626, libxslt 10117 and libexslt 813 libxslt 10117 was compiled against libxml 20626 libexslt 813 was compiled against libxml 20626 on a "Red Hat Enterprise Linux Server release 5.3 (Tikanga)" (64 bit X86) machine. The XML is 1,850,828,515 bytes big, and consists of a 24 hour run with 2 blocks of XML every second and three blocks every hour. The transformation file generates a method=text output with an outer template on the root that applies a template to one of the every second blocks. The inner template consists of a series of value-of select="field" items with no xpath functions or anything else of any complexity. The output is a basic flat file that will be fed to gnuplot. Anyway, what happens is that when the command "xsltproc gpsNavData.xsl" is typed, the system freezes, and a monitoring of memory indicates that at the point I looked, xsltproc was consuming 16GB of Ram (meaning everything else was being swapped, and xsltproc was probably thrashing). I understand that xsltproc MUST cache the entire XML file in memory, I can even consider a doubling of this for computation storage, 4x or more of storage from the source file seems unreasonable and makes the tool unusable for huge files. Running the software on a 32 bit X86: xsltproc was compiled against libxml 20632, libxslt 10124 and libexslt 813 libxslt 10124 was compiled against libxml 20632 libexslt 813 was compiled against libxml 20632 fails silently with kernel out of memory errors. Note that I had planned on using this same transformation on 16 files generated every day for statistical purposes. Writing a C program using tinyxml appears to not have this problem. Running with a smaller XML file works perfectly.
Because of the size of the file, I am unable to attach it to this bug. If I zip it, it is still 94,396,468 bytes, which is also way too large. If someone wants, I can split the zip file into 95 chunks and post them, but I will hold off for the time being.
It depends on the structure of the XML file. The size of the xmlNode struct is 120 bytes on a 64-bit system, for example. So a 4-8x memory overhead is completely reasonable if you have small text nodes.
Yup XSLT is not designed to work on streaming, but on building a tree (there are implementations trying to do streaming but it won't work in the general case, libxslt doesn't even try). Depending what your transformation is doing you may have more chance doing that processing from a general purpose language like python using the reader API see the doc at: http://xmlsoft.org/xmlreader.html you should have the tools available without problem in the libxml2-python package which is usually installed by default on RHEL-5 Libxslt is also available from python and you could try to mix the streaming and the output via the existing XSLT templates by trying the approach suggested in http://xmlsoft.org/xmlreader.html#Mixing and apply the template to each record subtree but unless your XSLT is really complex it's probably simpler to rewrite it in Python Daniel