GNOME Bugzilla – Bug 315636
Need a way to retreive progress with xmlTextReader
Last modified: 2009-01-19 20:10:00 UTC
The current line number is available with xmlTextReaderGetParserLineNumber, but the total number of lines is hard to get (the app would have to pre-read the file as text and count them up, which is a rather wasteful thing to do. And it may just stick at 1 if the file isn't pretty-printed in any way. The current byte offset is available via xmlTextReaderByteConsumed, but can be very expensive when an encoding conversion is happening, and in any case the max value may not be the one retreieved via stat() if libxml is reading compressed input. So it would be nice to have a function to get an indication of the reader's progress through the file, either current&max or perhaps a float going from 0 to 1.
Created attachment 55453 [details] [review] First try at implementing this libxml2-xmlTextReader-progress.patch adds an API for this capability to the xmlTextReader, and implements it for local files, fds (when fstat works), zlib-compressed local files, and HTTP. Building it on win32 requires a tweak to zlib.def, as gzDirect is public (in zlib.h) but not explorted in the dll form (it is, however, included in the static library). I think this is just an oversight and will be reporting that to the upstream... This is my first time in libxml; review comments would be welcome...
Created attachment 55454 [details] [review] Patch for the zlib tweak I'ev also attached the zlib tweak for anyone wanting to build this as-is.
Hum, this is far more intrusive than expected. It looks like it would work (except for the zlib change, no way we can rely on the other library being changed). One thing that suprises me is that you make all this work to try to get a precise result but it can't be precise, for the very simple reason that the buffering is in 4K usually and any progress informations extracted the way you did it is not gonna be more precise than that. So I wonder if all this work is really needed to achieve the rought estimate that is really needed in most cases. Also the getsize callback could be extremely expensive (lseek() on compressed data :-\) and I think it should be done only if needed. In general the patch looks interesting, but a bit more massive than what I expected. It should probably be discussed a bit on the mailing-list to get other people feedback. thanks, Daniel
The lseek is on the compressed (raw) data, not the uncompressed data (it's on the underlying fd, not the decompressed stream). I only want to find the 'footer' portion of the gzip frame, not any part of the end of the contained data. So it's an lseek on the underlying on the fd I was kept after gzdopen (hence the changes that introduce struct gzContext) precisely to avoid needing zlib's expensive emulated gzseek (which would have to decompress all the intervening data to find the end). I think the zlib problem is a simple bug on their part (gzdirect is declared exported, it's contained in the static library, and would be available right now in the shared object on *nix). It's just missing from the win32 .def file and so hidden in the dll form. As I said, I'm asking about that upstream. If they say that it was intentional (and I have no idea why this would be) the decision could be based on the same sort of heuristic you used to use to determine ret->compressed (which I also replaced with a call to gzdirect). But I was trying to write the *Size callbacks so that they were safe to call at any time, just in case you asked for the stat()/etc to only be done on-demand :-), so then it really would have to gztell/gzrewind/gzseek to get the z_stream back at the the original offset, because reading in the 4 'detection' bytes would remove them from the input. Or I suppose we could try to arrach for the size callback to have access to the IO context, and then it could just push them onto the input buffer. What work are you thinking could have been avoided for an even more 'imprecise' result (which would indeed be fine with me)? The other approach I considered would have been to base it on the amount of the compressed input that's been consumed (which is available, at z_stream.total_in) rather than on the fraction of the uncompressed data. That would make the gzip size callback a bit simpler (it wouldn't have to find the gzip footer, so it wouldn't need gzdirect or the seeking), but it would still need the fd so it could fstat to learn the compressed length. If you think you like that approach better I can certainly try it, but I originally discarded that approach because then the progress callback would also have to be directly aware that it was working with zlib; the measure of 'current' would have to be in compressed bytes dug out of the IO context instead of just using rawconsumed. That would probably meen we'd need an xmlFile*Tell callback that returned in the same units that xmlFile*Size did.
I tried to apply the patch, the gzopen fais to compile on Linux xmlIO.c: In function `xmlGzfileOpen_real': xmlIO.c:986: error: `O_BINARY' undeclared (first use in this function) xmlIO.c:986: error: (Each undeclared identifier is reported only once xmlIO.c:986: error: for each function it appears in.) xmlIO.c: In function `xmlGzfileSize': xmlIO.c:1146: warning: implicit declaration of function `gzdirect' xmlIO.c:1146: warning: nested extern declaration of `gzdirect' Apparently this wasn't tested there or something went wrong. One of the includes also had a stray { instead of ; As is this can't be applied. I will postpone applying the patch until we have a solution for compressed streams. thanks, but it's not yet ripe enough :-) Daniel
No, I hadn't actually tested on linux, but I will shortly. I was still waiting to hear back on gzdirect. I'm not really sure what happened with the stray {, since I did test that part; maybe I failed to refresh the patch before attaching or something. I'll check into that...
in libxml2-xmlTextReader-progress.patch around line 39: @@ -331,6 +350,10 @@ int len); XMLPUBFUN int XMLCALL xmlIOHTTPClose (void * context); + +XMLPUBFUN unsigned long XMLCALL + xmlIOHTTPSize (void * context) { + #endif /* LIBXML_HTTP_ENABLED */ I think it would be best to avoid gzdirect() if possible, it's not in zlib.h and not exported from the library on linux either: paphio:~/XML -> nm /usr/lib/libz.a | grep gzdirect paphio:~/XML -> Daniel
kevin, can you update the patch according to daniel's last comment? thanks in advance.
Closing this bug report as no further information has been provided. Please feel free to reopen this bug if you can provide the information asked for. Thanks!