GNOME Bugzilla – Bug 325533
xmlNode member 'line' is 16-bit integer, many XML files are longer than 65535 lines
Last modified: 2008-06-20 10:29:50 UTC
Please describe the problem: The 'line' member of 'xmlNode' and similar structures/classes is a 16-bit short integer and cannot represent line numbers greater than 65535. We are working with 500,000 line and greater XML files, so diagnostic messages referencing line numbers are inaccurate. 'line' should be a 32-bit integer. Steps to reproduce: Actual results: Expected results: Does this happen every time? Other information:
Can't fix, there is not enough space in the node structure to extend it without breaking ABI compatibility. In practice it should not be a problem. If processing large documents streaming is usually requested, which mean nodes are processed in sequence, just keep track of the last line number in a 32bit unsigned counter in the program and increase by 2^^16 if jumping back in line number sequences. Daniel
I don't quite follow the workaround. I see that 'xmlTextReaderGetParserLineNumber' appears to return a 32-bit representation line number. However it's not synchronized with the node where the cursor is pointing. Does a call-back exist somewhere that can be used to count lines up to the cursor node?
The wortkaround suggested is not coming from the library, it must be implemented in your code. You do fetch one node at a time, then check how line numbers evolve there. Daniel
How does one get 'xmlTextReader' to fetch exactly one node at a time? In my observation 'xmlTextReaderExpand' sucks in a somewhat arbitrary number of lines past the end of the node being acquired. The line numbers that result are equally random.
Please don't use bugzilla as a support channel. Use the mailing-list ! Daniel
You have admitted yourself that this is a bug by marking it as WONTFIX. So the least you can do is provide a clear, useable workaround in the bug tracking system so that others may find the solution here when they search for the bug that you have acknowledged exists.
Comment #4 indicates you don't seems to follow what Expand() actually does nor how the xmlReader is supposed to be used to step on each node by a succession of Read(). Hence this is no more a matter of giving a workaround but a matter of explaining how the xmlReader is to be used. There is a free ! mailing-list for this kind of support as well as free! online documentation on this precise subject, nothing can justify duplicating that information in bugzilla ! You get free tools, advice and help, don't bite the hand which is helping, I have zero reason to accept being harrassed though bugzilla, and I gave the explanations specific to this entry, so pretty please ! Daniel
I am using 'xmlTextReaderExpand' correctly as documented, and it works fine. However you yourself state in another e-mail conversation I googled awhile back that no guarantees are given at to exactly how much extra data beyond the current node this function will load each time it's called. The loading of extra data is clearly exhibited when one walks the in-memory 'xmltree', and I had to explicitly code to avoid referencing data in the next node forward. I did this after reading you comments that 'Expand' is allowed to read as much as it wants so long as it brings in all of the node referenced in the call parameter. Therefore the line number returned by 'xmlTextReaderGetParserLineNumber' is quite arbitrary and of no value for determining where a problem is located in the XML stream. That is unless you will stop trying to baffle with BS and answer the question, or FIX THE BUG.
I said, the bug CANNOT BE FIXED. Or it must be a release of libxml3 with a different soname, and I WON'T DO IT. There is only a WONTFIX option and no CANTFIX option in bugzilla, so forget about your hope to see that unsigned short grow to a different size, this won't happen, because this can't happen ! I answered the question to my best understanding of your need. If you need more detail this is outside of this specific bug. Now you can accuse me of bullshitting, I don't appreciate this. Bugzilla is *not* the proper place to explain internals and share informations, I REALLY don't understand why you don't want this to happen on the archived and indexed mailing list. I don't want to answer to one person, I want the time I spend explaining stuff to be as widely available as possible (as I have only so much time). So I restate that this need to be asked on on the mailing-list. Daniel
Created attachment 56936 [details] [review] change 'line' member of 'xmlNode' from 16-bit to 32-bit integer No sane and accurate way to work around this problem exists. So for the benefit of others like myself who couldn't care less about ABI compatibility and who don't object to "biting the hand that helps them" (is that funny or what?), here is the patch that fixes the bug.
Created attachment 56951 [details] [review] ABI safe change 'line' member of 'xmlNode' from 16-bit to 32-bit integer It popped into my head that it's trivial to support a 32-bit line number in 'xmlNode' ABI compatible with applications that were linked against an earlier 'libxml2' shared library. The revised patch adds the 32-bit 'line' member to the end of 'xmlNode' and renames the original to 'line16'. Old applications will reference the 16-bit value and new applications will reference the 32-bit value. Old, badly behaved applications that muck with memory owned by 'libxml2' might have problems though. Apps linked against the new library won't work with old libraries. A minor 'soname' version tweak can be used to prevent new apps from running on old libraries, but I have better things to do right now than make this change.
This increases by 4 bytes each node on a document, which is precisely what I didn't want when line was made to be 16bits. You want 32bits line numbers that can usually be computed at the application level. Most people want their in-memory tree to stay small. This is a trade-off. Such change would be done only by discussing them on the mailing list where people who care about libxml2 are subscribed, not on a bugzilla entry. Still no ! Use the mailing-list about this. Daniel
The patch is for people who, like myself, need accurate line numbers and don't have time to wrangle with tetchy developers or spend writing complex work-arounds. I couldn't care less if you use it in the formal release or not. People will find it and use it if they need it. It takes about two minutes to download and apply with patch -p0 -b -i libxml2_linenum.patch In case you haven't noticed, memory now costs $150 per gigabyte and anyone who wants to load 100MB+ XML files is more likely to want good line numbers than care about the 32-bits per node it costs. I can load the entire XML file up with 'emacs' and go straight to the line causing a problem, and my $600 HP doesn't break a sweat--takes about one second. The only reason I use the streaming API is insurance against the XML file growing 100x or 1000x sometime in the future. It's easy to see people not wanting to bother with 'xmlReader' even though they process huge XML files.
I think the arguments for and against the change are clear. However, for the same reasons that 'starlight' gave for making it a 32 bit integer, I would recommend making it a "size_t" instead. On 64 bit systems, nodes are huge already, so adding 6 bytes won't kill anyone.
Hi, We are having the same problem. We plan to do the following patch as you suggested. patch -p0 -b -i libxml2_linenum.patch Would you please let us know where can I download the patch from? Thanks,
It's the attachment in comment #11. Might need some tweaking for the latest version--haven't checked it.
The suggested patch work fine on 2.6.32 version (i used a text editor to modify, not patch tool, and built for Windows). And there is no excessive consumption ! For exemple, parsing a 140 Mo XML file as a 457 Mo memory footprint without patch, and 459 Mo with patch (and good line numbers ;-))