After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 315636 - Need a way to retreive progress with xmlTextReader
Need a way to retreive progress with xmlTextReader
Status: RESOLVED INCOMPLETE
Product: libxml2
Classification: Platform
Component: general
2.6.x
Other All
: Normal enhancement
: ---
Assigned To: Daniel Veillard
libxml QA maintainers
Depends on:
Blocks:
 
 
Reported: 2005-09-09 13:52 UTC by Kevin Puetz
Modified: 2009-01-19 20:10 UTC
See Also:
GNOME target: ---
GNOME version: Unversioned Enhancement


Attachments
First try at implementing this (13.00 KB, patch)
2005-12-01 00:12 UTC, Kevin Puetz
needs-work Details | Review
Patch for the zlib tweak (352 bytes, patch)
2005-12-01 00:14 UTC, Kevin Puetz
reviewed Details | Review

Description Kevin Puetz 2005-09-09 13:52:08 UTC
The current line number is available with xmlTextReaderGetParserLineNumber, but
the total number of lines is hard to get (the app would have to pre-read the
file as text and count them up, which is a rather wasteful thing to do. And it
may just stick at 1 if the file isn't pretty-printed in any way.

The current byte offset is available via xmlTextReaderByteConsumed, but can be
very expensive when an encoding conversion is happening, and in any case the max
value may not be the one retreieved via stat() if libxml is reading compressed
input.

So it would be nice to have a function to get an indication of the reader's
progress through the file, either current&max or perhaps a float going from 0 to 1.
Comment 1 Kevin Puetz 2005-12-01 00:12:58 UTC
Created attachment 55453 [details] [review]
First try at implementing this

libxml2-xmlTextReader-progress.patch adds an API for this capability to the
xmlTextReader, and implements it for local files, fds (when fstat works),
zlib-compressed local files, and HTTP. Building it on win32 requires a tweak to
zlib.def, as gzDirect is public (in zlib.h) but not explorted in the dll form
(it is, however, included in the static library). I think this is just an
oversight and will be reporting that to the upstream...

This is my first time in libxml; review comments would be welcome...
Comment 2 Kevin Puetz 2005-12-01 00:14:46 UTC
Created attachment 55454 [details] [review]
Patch for the zlib tweak

I'ev also attached the zlib tweak for anyone wanting to build this as-is.
Comment 3 Daniel Veillard 2005-12-01 08:45:19 UTC
Hum, this is far more intrusive than expected.
It looks like it would work (except for the zlib change, no way we can rely on
the other library being changed). 
One thing that suprises me is that you make all this work to try to get a precise
result but it can't be precise, for the very simple reason that the buffering is
in 4K usually and any progress informations extracted the way you did it is
not gonna be more precise than that. So I wonder if all this work is really needed
to achieve the rought estimate that is really needed in most cases. Also the
getsize callback could be extremely expensive (lseek() on compressed data :-\)
and I think it should be done only if needed.

In general the patch looks interesting, but a bit more massive than what I
expected. It should probably be discussed a bit on the mailing-list to get
other people feedback.

  thanks,

Daniel
Comment 4 Kevin Puetz 2005-12-01 14:52:23 UTC
The lseek is on the compressed (raw) data, not the uncompressed data (it's on  
the underlying fd, not the decompressed stream). I only want to find the  
'footer' portion of the gzip frame, not any part of the end of the contained  
data. So it's an lseek on the underlying on the fd I was kept after gzdopen  
(hence the changes that introduce struct gzContext) precisely to avoid needing  
zlib's expensive emulated gzseek (which would have to decompress all the  
intervening data to find the end).   
     
I think the zlib problem is a simple bug on their part (gzdirect is  declared 
exported, it's contained in the static library, and would be available right 
now in the shared object on *nix). It's just missing from the win32 .def file 
and so hidden in the dll form. As I said, I'm asking about that upstream. If 
they say that it was intentional (and I have no idea why this would be) the 
decision could be based on the same sort of heuristic you used to use to 
determine ret->compressed (which I also replaced with a call to gzdirect). But 
I was trying to write the *Size callbacks so that they were safe to call at 
any time, just in case you asked for the stat()/etc to only be done 
on-demand :-), so then it really would have to gztell/gzrewind/gzseek to  
get the z_stream back at the the original offset, because reading in the 4  
'detection' bytes would remove them from the input. Or I suppose we could try  
to arrach for the size callback to have access to the IO context, and then it  
could just push them onto the input buffer.  
  
What work are you thinking could have been avoided for an even more  
'imprecise' result (which would indeed be fine with me)? The other approach I  
considered would have been to base it on the amount of the compressed input  
that's been consumed (which is available, at z_stream.total_in) rather than on  
the fraction of the uncompressed data. That would make the gzip size callback  
a bit simpler (it wouldn't have to find the gzip footer, so it wouldn't need  
gzdirect or the seeking), but it would still need the fd so it could fstat to  
learn the compressed length. If you think you like that approach better I can  
certainly try it, but I originally discarded that approach because then the  
progress callback would also have to be directly aware that it was working  
with zlib; the measure of 'current' would have to be in compressed bytes dug  
out of the IO context instead of just using rawconsumed. That would probably  
meen we'd need an xmlFile*Tell callback that returned in the same units that  
xmlFile*Size did.  
Comment 5 Daniel Veillard 2006-01-04 10:35:53 UTC
I tried to apply the patch, the gzopen fais to compile on Linux
xmlIO.c: In function `xmlGzfileOpen_real':
xmlIO.c:986: error: `O_BINARY' undeclared (first use in this function)
xmlIO.c:986: error: (Each undeclared identifier is reported only once
xmlIO.c:986: error: for each function it appears in.)
xmlIO.c: In function `xmlGzfileSize':
xmlIO.c:1146: warning: implicit declaration of function `gzdirect'
xmlIO.c:1146: warning: nested extern declaration of `gzdirect'

 Apparently this wasn't tested there or something went wrong. One of the
includes also had a stray { instead of ; 

As is this can't be applied. I will postpone applying the patch until we have
a solution for compressed streams.

  thanks, but it's not yet ripe enough :-)

Daniel
Comment 6 Kevin Puetz 2006-01-04 15:07:39 UTC
No, I hadn't actually tested on linux, but I will shortly. I was still waiting to hear back on gzdirect.

I'm not really sure what happened with the stray {, since I did test that part; maybe I failed to refresh the patch before attaching or something. I'll check into that...
Comment 7 Daniel Veillard 2006-01-04 16:14:30 UTC
in libxml2-xmlTextReader-progress.patch around line 39:

@@ -331,6 +350,10 @@
                                         int len);
 XMLPUBFUN int XMLCALL
        xmlIOHTTPClose                  (void * context);
+
+XMLPUBFUN unsigned long XMLCALL
+       xmlIOHTTPSize                   (void * context) {
+
 #endif /* LIBXML_HTTP_ENABLED */

  I think it would be best to avoid gzdirect() if possible, it's not in zlib.h
and not exported from the library on linux either:

paphio:~/XML -> nm /usr/lib/libz.a | grep gzdirect
paphio:~/XML ->

Daniel
Comment 8 André Klapper 2006-10-01 04:13:11 UTC
kevin, can you update the patch according to daniel's last comment? thanks in advance.
Comment 9 Christoph Wurm 2009-01-19 20:10:00 UTC
Closing this bug report as no further information has been provided. Please feel free to reopen this bug if you can provide the information asked for.
Thanks!