GNOME Bugzilla – Bug 325533
xmlNode member 'line' is 16-bit integer, many XML files are longer than 65535 lines
Last modified: 2008-06-20 10:29:50 UTC
Please describe the problem:
The 'line' member of 'xmlNode' and similar structures/classes is a 16-bit short
integer and cannot represent line numbers greater than 65535. We are working
with 500,000 line and greater XML files, so diagnostic messages referencing line
numbers are inaccurate. 'line' should be a 32-bit integer.
Steps to reproduce:
Does this happen every time?
Can't fix, there is not enough space in the node structure to extend it
without breaking ABI compatibility.
In practice it should not be a problem. If processing large documents
streaming is usually requested, which mean nodes are processed in sequence,
just keep track of the last line number in a 32bit unsigned counter in the
program and increase by 2^^16 if jumping back in line number sequences.
I don't quite follow the workaround. I see that
'xmlTextReaderGetParserLineNumber' appears to return a 32-bit
representation line number. However it's not synchronized with
the node where the cursor is pointing. Does a call-back exist
somewhere that can be used to count lines up to the cursor node?
The wortkaround suggested is not coming from the library, it must be
implemented in your code. You do fetch one node at a time, then check
how line numbers evolve there.
How does one get 'xmlTextReader' to fetch exactly one node at a
time? In my observation 'xmlTextReaderExpand' sucks in a
somewhat arbitrary number of lines past the end of the node
being acquired. The line numbers that result are equally
Please don't use bugzilla as a support channel.
Use the mailing-list !
You have admitted yourself that this is a bug by
marking it as WONTFIX. So the least you can do
is provide a clear, useable workaround in the bug
tracking system so that others may find the solution
here when they search for the bug that you have
Comment #4 indicates you don't seems to follow what Expand() actually does
nor how the xmlReader is supposed to be used to step on each node by a
succession of Read(). Hence this is no more a matter of giving a workaround but
a matter of explaining how the xmlReader is to be used. There is a free !
mailing-list for this kind of support as well as free! online documentation
on this precise subject, nothing can justify duplicating that information
in bugzilla ! You get free tools, advice and help, don't bite the hand which
is helping, I have zero reason to accept being harrassed though bugzilla, and
I gave the explanations specific to this entry, so pretty please !
I am using 'xmlTextReaderExpand' correctly as documented, and it
However you yourself state in another e-mail conversation I
googled awhile back that no guarantees are given at to exactly
how much extra data beyond the current node this function will
load each time it's called. The loading of extra data is
clearly exhibited when one walks the in-memory 'xmltree', and I
had to explicitly code to avoid referencing data in the next
node forward. I did this after reading you comments that
'Expand' is allowed to read as much as it wants so long as it
brings in all of the node referenced in the call parameter.
Therefore the line number returned by
'xmlTextReaderGetParserLineNumber' is quite arbitrary and of no
value for determining where a problem is located in the XML
stream. That is unless you will stop trying to baffle with BS
and answer the question, or FIX THE BUG.
I said, the bug CANNOT BE FIXED. Or it must be a release of libxml3
with a different soname, and I WON'T DO IT. There is only a WONTFIX
option and no CANTFIX option in bugzilla, so forget about your hope
to see that unsigned short grow to a different size, this won't happen,
because this can't happen !
I answered the question to my best understanding of your need. If you
need more detail this is outside of this specific bug. Now you can accuse me
of bullshitting, I don't appreciate this.
Bugzilla is *not* the proper place to explain internals and share informations,
I REALLY don't understand why you don't want this to happen on the archived
and indexed mailing list. I don't want to answer to one person, I want the
time I spend explaining stuff to be as widely available as possible (as I have
only so much time). So I restate that this need to be asked on on the
Created attachment 56936 [details] [review]
change 'line' member of 'xmlNode' from 16-bit to 32-bit integer
No sane and accurate way to work around this problem exists.
So for the benefit of others like myself who couldn't care less
about ABI compatibility and who don't object to "biting the hand
that helps them" (is that funny or what?), here is the patch
that fixes the bug.
Created attachment 56951 [details] [review]
ABI safe change 'line' member of 'xmlNode' from 16-bit to 32-bit integer
It popped into my head that it's trivial to support a 32-bit
line number in 'xmlNode' ABI compatible with applications that
were linked against an earlier 'libxml2' shared library. The
revised patch adds the 32-bit 'line' member to the end of
'xmlNode' and renames the original to 'line16'. Old
applications will reference the 16-bit value and new
applications will reference the 32-bit value. Old, badly
behaved applications that muck with memory owned by 'libxml2'
might have problems though. Apps linked against the new library
won't work with old libraries.
A minor 'soname' version tweak can be used to prevent new apps
from running on old libraries, but I have better things to do
right now than make this change.
This increases by 4 bytes each node on a document, which is precisely what
I didn't want when line was made to be 16bits. You want 32bits line numbers
that can usually be computed at the application level. Most people want their
in-memory tree to stay small. This is a trade-off. Such change would be done
only by discussing them on the mailing list where people who care about libxml2
are subscribed, not on a bugzilla entry.
Still no ! Use the mailing-list about this.
The patch is for people who, like myself, need accurate line
numbers and don't have time to wrangle with tetchy developers or
spend writing complex work-arounds. I couldn't care less if you
use it in the formal release or not. People will find it and
use it if they need it. It takes about two minutes to download
and apply with
patch -p0 -b -i libxml2_linenum.patch
In case you haven't noticed, memory now costs $150 per gigabyte
and anyone who wants to load 100MB+ XML files is more likely to
want good line numbers than care about the 32-bits per node it
costs. I can load the entire XML file up with 'emacs' and go
straight to the line causing a problem, and my $600 HP doesn't
break a sweat--takes about one second. The only reason I use
the streaming API is insurance against the XML file growing
100x or 1000x sometime in the future. It's easy to see people
not wanting to bother with 'xmlReader' even though they process
huge XML files.
I think the arguments for and against the change are clear. However, for the same reasons that 'starlight' gave for making it a 32 bit integer, I would recommend making it a "size_t" instead. On 64 bit systems, nodes are huge already, so adding 6 bytes won't kill anyone.
We are having the same problem. We plan to do the following patch as you suggested.
patch -p0 -b -i libxml2_linenum.patch
Would you please let us know where can I download the patch from?
It's the attachment in comment #11. Might need
some tweaking for the latest version--haven't
The suggested patch work fine on 2.6.32 version (i used a text editor to modify, not patch tool, and built for Windows).
And there is no excessive consumption !
For exemple, parsing a 140 Mo XML file as a 457 Mo memory footprint without patch, and 459 Mo with patch (and good line numbers ;-))