GNOME Bugzilla – Bug 732209
ZIP64 extensions not supported
Last modified: 2014-12-03 07:15:46 UTC
Neither the ZIP reader nor writer support ZIP64 extensions, and neither of them properly fail in situations where they would need this support: - On write, libgsf will allow the "number of archive members" field and file offset/length fields to wrap around, producing a corrupt ZIP. - On read, libgsf will treat the special -1 markers literally (0xffffffff for file offsets/lengths, 0xffff for the number of archive members), causing truncated output or reads from incorrect file offsets.
I would guess that zip64 didn't exist when this code was written. Do you have a pointer to documentation? The library as a whole doesn't have this problem, I think. We've been using 64-bit offsets from very early on.
Documentation seems to be at http://www.pkware.com/documents/casestudies/APPNOTE.TXT I don't see why we cannot have at least minimal support for zip64. For starters, three issues: 1. Members >4G. 2. #Files >64k 3. Archive size >4G. We should start at the read side where we can test with files made by some other tool. Can you provide small test cases for 1 and 2? (I.e., something that compresses really well.)
Created attachment 279284 [details] Zip file with member > 4 GB I've gzipped the Zip file (!) to get around Bugzilla file size limits.
Created attachment 279285 [details] Zip file with > 64k members gzipped for Bugzilla file size limit.
Those were created with the zip command from Info-ZIP 3.0.
(2) [many files] is now handled on the read side. And things are faster. (3) [large archive] is partially handled on the read side.
(1) [large files] is now handled on the read side. A pile of 64-bit issues were in the way.
(2) [many files] is now handled on the write side too. (3) [large archive] is handled. The code for (1) [large files] is there, but not on by default. We need for magic to only turn it on when needed.
This problem has been fixed in our software repository. The fix will go into the next software release. Thank you for your bug report.
The ZIP64 output side has a few problems: - The zip64 property is installed as PROP_DEFLATE_LEVEL, so the application cannot set it. - There are some datatype issues with the zip64 flag: it's really an integer, but is declared gboolean in GsfOutfileZip and guint8 in zip_close_root. The former is probably harmless, but the latter causes the "zip64 == -1" test in zip_close_root to always evaluate to false, so the ZIP64 end-of-central-directory record is always written in auto mode. (The compiler doesn't warn about this because -Wtype-limits is not enabled.) - In auto mode, for both small and large files, libgsf writes corrupt local file headers: the sizes are 0xFFFFFFFF but the extra field is of type ZIP_DIRENT_EXTRA_FIELD_IGNORE. - In !zip64 mode, libgsf will disable all ZIP64 extensions *except* the ZIP64 end of central directory record. When the ZIP file requires ZIP64 extensions (e.g. a member is too large), libgsf will silently write corrupt metadata rather than failing. - The code to write the data descriptor record is never executed because ZIP_DIRENT_FLAGS_HAS_DDESC is never set. As a result, if the output file is not seekable, libgsf takes the zip_header_patch_sizes() path and dies on a critical error: "g_io_channel_seek_position: assertion 'channel->is_seekable' failed". This did not occur in the previous release. - With a non-seekable output, auto mode should force ZIP64 (or cause a fatal error). The field sizes in the data descriptor block depend on the existence of the ZIP64 extra field, and the size of the member isn't yet known when writing the extra field. - The call to gsf_output_write() in zip_ddesc_write() skips the ZIP_DDESC_SIGNATURE; is this intentional? - The ZIP spec seems to say that if a data descriptor is written, the crc/usize/csize fields should be zero even in ZIP64 mode.
- The zip64 property is installed as PROP_DEFLATE_LEVEL, so the application cannot set it. Fixed. - There are some datatype issues with the zip64 flag: it's really an integer, but is declared gboolean in GsfOutfileZip and guint8 in zip_close_root. gint8 was meant. Fixed. - In auto mode, for both small and large files, libgsf writes corrupt local file headers: the sizes are 0xFFFFFFFF but the extra field is of type ZIP_DIRENT_EXTRA_FIELD_IGNORE. Fixed. - In !zip64 mode, libgsf will disable all ZIP64 extensions *except* the ZIP64 end of central directory record. When the ZIP file requires ZIP64 extensions (e.g. a member is too large), libgsf will silently write corrupt metadata rather than failing. I don't understand. Please elaborate. - The code to write the data descriptor record is never executed because ZIP_DIRENT_FLAGS_HAS_DDESC is never set. As a result, if the output file is not seekable, libgsf takes the zip_header_patch_sizes() path and dies on a critical error: "g_io_channel_seek_position: assertion 'channel->is_seekable' failed". This did not occur in the previous release. We were off spec and always used DDESC in the deflate case. (And only then, so the seek problem cannot be new.) Do you have sample code that causes that problem to exhibit itself? - The call to gsf_output_write() in zip_ddesc_write() skips the ZIP_DDESC_SIGNATURE; is this intentional? Ah, no -- leftever debug junk. I do note, however, that it is unclear whether a signature should be written. Fixed, but as you note not currently reachable. Note, that even with all these problems, the generated archives passed "unzip -t". Clearly not an exhaustive test.
We now use a DDESC if (and if only) we determine that we cannot seek. This hasn't seen a lot of testing, though.
> - In !zip64 mode, libgsf will disable all ZIP64 extensions *except* > the ZIP64 end of central directory record. When the ZIP file requires > ZIP64 extensions (e.g. a member is too large), libgsf will silently > write corrupt metadata rather than failing. > > I don't understand. Please elaborate. If the zip64 property is FALSE, and the application attempts to write data that will not fit in an old-style ZIP, libgsf should fail rather than writing a corrupt file. (Otherwise, setting zip64 to FALSE wouldn't be safe.) Currently, libgsf writes corrupt metadata just as it used to, but with the addition of a ZIP64 end of central directory record. > We were off spec and always used DDESC in the deflate case. (And > only then, so the seek problem cannot be new.) Do you have sample > code that causes that problem to exhibit itself? To test non-seekable sinks, I just changed gsf_create() in tools/gsf.c to do GIOChannel *chan = g_io_channel_unix_new(1); g_io_channel_set_encoding(chan, NULL, NULL); dest = gsf_output_iochannel_new(chan); and then ran the tool as gsf createzip out.zip dir | cat > out.zip > - The call to gsf_output_write() in zip_ddesc_write() skips the > ZIP_DDESC_SIGNATURE; is this intentional? > > Ah, no -- leftever debug junk. I do note, however, that it is unclear > whether a signature should be written. See paragraph 4.3.9.4 of the spec. > Note, that even with all these problems, the generated archives passed > "unzip -t". Clearly not an exhaustive test. I'm not surprised. I've been testing with a hex editor. As of e420fc52, the Zip writer has the following problems: - The zip64 local variable in zip_close_root is still guint8, as is the zip64 field in GsfZipDirent. Both of these are causing conditionals to produce unintended results. - In auto mode, when writing a small file, the central directory header always puts the usize/csize in the ZIP64 extra field. To maintain consistency with the local file header, the central directory should include the extra field only if the sink is non-seekable. - In auto mode with a small file and a non-seekable sink, the ZIP64 end of central directory record and locator are not written, even though ZIP64 extra fields are used. - As described above, invalid ZIP files can be written when zip64 == FALSE.
We will now not allow a zip member to be written beyond 4G if zip64==FALSE. It's possible that the compressed size will blow the limit, but it is not going to be easy to fix. Just don't do that. We should also now add zip64 trailer and locator if a member uses zip64. I am not entirely sure that is required, but it should not hurt. I really hope the gint8 vs guint8 problem is gone now. That leaves "In auto mode, when writing a small file, the central directory header always puts the usize/csize in the ZIP64 extra field."
> We will now not allow a zip member to be written beyond 4G if > zip64==FALSE. It's possible that the compressed size will blow the > limit, but it is not going to be easy to fix. Just don't do that. Hmm. Wouldn't it be possible to cause gsf_output_close to return FALSE by returning FALSE from zip_header_write/zip_close_root? It'd be better than nothing, and the archive is going to be corrupt either way. (Otherwise, the application will have to do its own checks: is it compressing a file that is almost 4 GB, whose csize might expand over the 4 GB boundary? Is it compressing more than 65535 files? Returning FALSE lets it catch these cases through the existing API.)
It might be possible somewhere, but anyone who explicitly sets zip64 to "no" and proceeds to stuff really large files into it is asking for trouble. I can add a check to zip_output_block which is the only place that updates ->csize (except for "stored" files). That error will propagate into zip_flush and then into zip_close_stream and then into gsf_outfile_zip_close. gsf (the command line tool) doesn't check the return value from gsf_output_close. I suspect most callers fail to do that.
I tried creating a 4G file from /dev/urandom and zipping it. I get: compressed size: 4295663804 bytes uncompressed size: 4294967296 bytes I.e., if I had taken 2 bytes less, this file would cause overflow in the compressed size only.
csize check is in. "In auto mode, when writing a small file, the central directory header always puts the usize/csize in the ZIP64 extra field." fixed, I think.
They presumably want to ensure compatibility with old readers (including older libgsf) and don't necessarily know in advance what they'll be zipping. zip_close_root should also fail if e.g. there are too many files.
I carefully tested af8db4ad and found no problems other than comment 19. 4a4a8a8c introduces a regression: with a non-streaming sink in auto mode, small files written beyond the 4 GB boundary won't have the "relative offset of local header" ZIP64 extra field set properly in the central directory. If you want to keep the use of the ZIP64 extra field consistent between the local file header and the central directory header, you'll need to reinstate the dirent->offset limit check when setting real_zip64.
dirent's offset in central directory fixed.
I am unconcerned that we currently can create zip64 files even if requested not to. In only happens in cases where a non-zip64 could not be created. I have added the beginning of a test suite. While not perfect, it at least checks that what comes out is what we put in and that unzip and zipinfo are happy with the files we create. zip64 and non-seekable sinks are tested. Is there anything left here?
Nope. Thanks for all your work.