Bug 325533 – xmlNode member 'line' is 16-bit integer, many XML files are longer than 65535 lines

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 325533 - xmlNode member 'line' is 16-bit integer, many XML files are longer than 65535 lines


Summary:	xmlNode member 'line' is 16-bit integer, many XML files are longer than 65535...


Status:	RESOLVED WONTFIX

Product:	libxml2
Classification:	Platform
Component:	general
Version:	2.6.22
Hardware:	Other All

Importance:	Normal minor
Target Milestone:	---
Assigned To:	Daniel Veillard
QA Contact:	libxml QA maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2006-01-02 17:38 UTC by Starlight
Modified:	2008-06-20 10:29 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
change 'line' member of 'xmlNode' from 16-bit to 32-bit integer (2.47 KB, patch) 2006-01-07 21:46 UTC, Starlight	rejected	Details \| Review
ABI safe change 'line' member of 'xmlNode' from 16-bit to 32-bit integer (2.93 KB, patch) 2006-01-08 04:08 UTC, Starlight	none	Details \| Review

Description Starlight 2006-01-02 17:38:51 UTC

Please describe the problem:
The 'line' member of 'xmlNode' and similar structures/classes is a 16-bit short
integer and cannot represent line numbers greater than 65535.  We are working
with 500,000 line and greater XML files, so diagnostic messages referencing line
numbers are inaccurate.  'line' should be a 32-bit integer.

Steps to reproduce:


Actual results:


Expected results:


Does this happen every time?


Other information:

Comment 1 Daniel Veillard 2006-01-04 16:19:10 UTC

Can't fix, there is not enough space in the node structure to extend it
without breaking ABI compatibility. 
In practice it should not be a problem. If processing large documents
streaming is usually requested, which mean nodes are processed in sequence,
just keep track of the last line number in a 32bit unsigned counter in the
program and increase by 2^^16 if jumping back in line number sequences.

Daniel

Comment 2 Starlight 2006-01-06 01:03:50 UTC

I don't quite follow the workaround.  I see that 
'xmlTextReaderGetParserLineNumber' appears to return a 32-bit 
representation line number.  However it's not synchronized with 
the node where the cursor is pointing.  Does a call-back exist 
somewhere that can be used to count lines up to the cursor node?

Comment 3 Daniel Veillard 2006-01-06 09:25:35 UTC

The wortkaround suggested is not coming from the library, it must be
implemented in your code. You do fetch one node at a time, then check
how line numbers evolve there.

Daniel

Comment 4 Starlight 2006-01-06 13:49:54 UTC

How does one get 'xmlTextReader' to fetch exactly one node at a 
time?  In my observation 'xmlTextReaderExpand' sucks in a 
somewhat arbitrary number of lines past the end of the node 
being acquired.  The line numbers that result are equally 
random.

Comment 5 Daniel Veillard 2006-01-06 14:31:07 UTC

Please don't use bugzilla as a support channel.
Use the mailing-list !

Daniel

Comment 6 Starlight 2006-01-06 14:36:15 UTC

You have admitted yourself that this is a bug by
marking it as WONTFIX.  So the least you can do
is provide a clear, useable workaround in the bug
tracking system so that others may find the solution
here when they search for the bug that you have
acknowledged exists.

Comment 7 Daniel Veillard 2006-01-06 14:54:50 UTC

Comment #4 indicates you don't seems to follow what Expand() actually does
nor how the xmlReader is supposed to be used to step on each node by a
succession of Read(). Hence this is no more a matter of giving a workaround but
a matter of explaining how the xmlReader is to be used. There is a free !
mailing-list for this kind of support as well as free! online documentation
on this precise subject, nothing can justify duplicating that information
in bugzilla ! You get free tools, advice and help, don't bite the hand which
is helping, I have zero reason to accept being harrassed though bugzilla, and
I gave the explanations specific to this entry, so pretty please !

Daniel

Comment 8 Starlight 2006-01-06 15:08:21 UTC

I am using 'xmlTextReaderExpand' correctly as documented, and it 
works fine.

However you yourself state in another e-mail conversation I 
googled awhile back that no guarantees are given at to exactly 
how much extra data beyond the current node this function will 
load each time it's called.  The loading of extra data is 
clearly exhibited when one walks the in-memory 'xmltree', and I 
had to explicitly code to avoid referencing data in the next 
node forward.  I did this after reading you comments that 
'Expand' is allowed to read as much as it wants so long as it 
brings in all of the node referenced in the call parameter.

Therefore the line number returned by 
'xmlTextReaderGetParserLineNumber' is quite arbitrary and of no 
value for determining where a problem is located in the XML 
stream.  That is unless you will stop trying to baffle with BS 
and answer the question, or FIX THE BUG.

Comment 9 Daniel Veillard 2006-01-06 15:21:18 UTC

I said, the bug CANNOT BE FIXED. Or it must be a release of libxml3
with a different soname, and I WON'T DO IT. There is only a WONTFIX
option and no CANTFIX option in bugzilla, so forget about your hope
to see that unsigned short grow to a different size, this won't happen,
because this can't happen !

I answered the question to my best understanding of your need. If you
need more detail this is outside of this specific bug. Now you can accuse me
of bullshitting, I don't appreciate this.
Bugzilla is *not* the proper place to explain internals and share informations,
I REALLY don't understand why you don't want this to happen on the archived
and indexed mailing list. I don't want to answer to one person, I want the
time I spend explaining stuff to be as widely available as possible (as I have
only so much time). So I restate that this need to be asked on on the
mailing-list.

Daniel

Comment 10 Starlight 2006-01-07 21:46:29 UTC

Created attachment 56936 [details] [review]
change 'line' member of 'xmlNode' from 16-bit to 32-bit integer

No sane and accurate way to work around this problem exists.
So for the benefit of others like myself who couldn't care less 
about ABI compatibility and who don't object to "biting the hand 
that helps them" (is that funny or what?), here is the patch 
that fixes the bug.

Comment 11 Starlight 2006-01-08 04:08:45 UTC

Created attachment 56951 [details] [review]
ABI safe change 'line' member of 'xmlNode' from 16-bit to 32-bit integer

It popped into my head that it's trivial to support a 32-bit 
line number in 'xmlNode' ABI compatible with applications that 
were linked against an earlier 'libxml2' shared library.  The 
revised patch adds the 32-bit 'line' member to the end of 
'xmlNode' and renames the original to 'line16'.  Old 
applications will reference the 16-bit value and new 
applications will reference the 32-bit value.  Old, badly 
behaved applications that muck with memory owned by 'libxml2' 
might have problems though.  Apps linked against the new library 
won't work with old libraries.

A minor 'soname' version tweak can be used to prevent new apps 
from running on old libraries, but I have better things to do 
right now than make this change.

Comment 12 Daniel Veillard 2006-01-08 20:07:29 UTC

This increases by 4 bytes each node on a document, which is precisely what
I didn't want when line was made to be 16bits. You want 32bits line numbers
that can usually be computed at the application level. Most people want their
in-memory tree to stay small. This is a trade-off. Such change would be done
only by discussing them on the mailing list where people who care about libxml2
are subscribed, not on a bugzilla entry.
Still no ! Use the mailing-list about this.

Daniel

Comment 13 Starlight 2006-01-08 20:43:38 UTC

The patch is for people who, like myself, need accurate line 
numbers and don't have time to wrangle with tetchy developers or 
spend writing complex work-arounds.  I couldn't care less if you 
use it in the formal release or not.  People will find it and 
use it if they need it.  It takes about two minutes to download 
and apply with

   patch -p0 -b -i libxml2_linenum.patch

In case you haven't noticed, memory now costs $150 per gigabyte 
and anyone who wants to load 100MB+ XML files is more likely to 
want good line numbers than care about the 32-bits per node it 
costs.  I can load the entire XML file up with 'emacs' and go 
straight to the line causing a problem, and my $600 HP doesn't 
break a sweat--takes about one second.  The only reason I use 
the streaming API is insurance against the XML file growing 
100x or 1000x sometime in the future.  It's easy to see people 
not wanting to bother with 'xmlReader' even though they process
huge XML files.

Comment 14 Stefan Behnel 2007-11-25 09:28:48 UTC

I think the arguments for and against the change are clear. However, for the same reasons that 'starlight' gave for making it a 32 bit integer, I would recommend making it a "size_t" instead. On 64 bit systems, nodes are huge already, so adding 6 bytes won't kill anyone.

Comment 15 Linjuan Gong 2008-05-23 15:35:00 UTC

Hi,

We are having the same problem. We plan to do the following patch as you suggested.
patch -p0 -b -i libxml2_linenum.patch

Would you please let us know where can I download the patch from?

Thanks,

Comment 16 Starlight 2008-05-23 15:42:47 UTC

It's the attachment in comment #11.  Might need
some tweaking for the latest version--haven't
checked it.

Comment 17 Bruce BARDOU 2008-06-20 10:29:50 UTC

The suggested patch work fine on 2.6.32 version (i used a text editor to modify, not patch tool, and built for Windows). 
And there is no excessive consumption !
For exemple, parsing a 140 Mo XML file as a 457 Mo memory footprint without patch, and 459 Mo with patch (and good line numbers ;-))