Bug 732209 – ZIP64 extensions not supported

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 732209 - ZIP64 extensions not supported


Summary:	ZIP64 extensions not supported


Status:	RESOLVED FIXED

Product:	libgsf
Classification:	Core
Component:	ZIP
Version:	1.14.x
Hardware:	Other All

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	Jody Goldberg
QA Contact:	Jody Goldberg

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-06-25 09:01 UTC by Benjamin Gilbert
Modified:	2014-12-03 07:15 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Zip file with member > 4 GB (10.29 KB, application/x-gzip) 2014-06-26 03:42 UTC, Benjamin Gilbert	Details
Zip file with > 64k members (1.07 MB, application/x-gzip) 2014-06-26 03:46 UTC, Benjamin Gilbert	Details

Description Benjamin Gilbert 2014-06-25 09:01:27 UTC

Neither the ZIP reader nor writer support ZIP64 extensions, and neither of them properly fail in situations where they would need this support:

- On write, libgsf will allow the "number of archive members" field and file offset/length fields to wrap around, producing a corrupt ZIP.

- On read, libgsf will treat the special -1 markers literally (0xffffffff for file offsets/lengths, 0xffff for the number of archive members), causing truncated output or reads from incorrect file offsets.

Comment 1 Morten Welinder 2014-06-25 14:38:12 UTC

I would guess that zip64 didn't exist when this code was written.
Do you have a pointer to documentation?

The library as a whole doesn't have this problem, I think.  We've been using
64-bit offsets from very early on.

Comment 2 Morten Welinder 2014-06-25 15:05:40 UTC

Documentation seems to be at

    http://www.pkware.com/documents/casestudies/APPNOTE.TXT

I don't see why we cannot have at least minimal support for zip64.
For starters, three issues:

1. Members >4G.
2. #Files >64k
3. Archive size >4G.

We should start at the read side where we can test with files made by
some other tool.  Can you provide small test cases for 1 and 2?
(I.e., something that compresses really well.)

Comment 3 Benjamin Gilbert 2014-06-26 03:42:48 UTC

Created attachment 279284 [details]
Zip file with member > 4 GB

I've gzipped the Zip file (!) to get around Bugzilla file size limits.

Comment 4 Benjamin Gilbert 2014-06-26 03:46:30 UTC

Created attachment 279285 [details]
Zip file with > 64k members

gzipped for Bugzilla file size limit.

Comment 5 Benjamin Gilbert 2014-06-26 03:49:10 UTC

Those were created with the zip command from Info-ZIP 3.0.

Comment 6 Morten Welinder 2014-11-21 01:08:39 UTC

(2) [many files] is now handled on the read side.  And things are faster.
(3) [large archive] is partially handled on the read side.

Comment 7 Morten Welinder 2014-11-21 04:03:12 UTC

(1) [large files] is now handled on the read side.

A pile of 64-bit issues were in the way.

Comment 8 Morten Welinder 2014-11-24 00:02:25 UTC

(2) [many files] is now handled on the write side too.
(3) [large archive] is handled.

The code for (1) [large files] is there, but not on by default.  We need
for magic to only turn it on when needed.

Comment 9 Morten Welinder 2014-11-28 00:39:52 UTC

This problem has been fixed in our software repository. The fix will go into the next software release. Thank you for your bug report.

Comment 10 Benjamin Gilbert 2014-11-29 03:49:55 UTC

The ZIP64 output side has a few problems:

- The zip64 property is installed as PROP_DEFLATE_LEVEL, so the application cannot set it.

- There are some datatype issues with the zip64 flag: it's really an integer, but is declared gboolean in GsfOutfileZip and guint8 in zip_close_root.  The former is probably harmless, but the latter causes the "zip64 == -1" test in zip_close_root to always evaluate to false, so the ZIP64 end-of-central-directory record is always written in auto mode.  (The compiler doesn't warn about this because -Wtype-limits is not enabled.)

- In auto mode, for both small and large files, libgsf writes corrupt local file headers: the sizes are 0xFFFFFFFF but the extra field is of type ZIP_DIRENT_EXTRA_FIELD_IGNORE.

- In !zip64 mode, libgsf will disable all ZIP64 extensions *except* the ZIP64 end of central directory record.  When the ZIP file requires ZIP64 extensions (e.g. a member is too large), libgsf will silently write corrupt metadata rather than failing.

- The code to write the data descriptor record is never executed because ZIP_DIRENT_FLAGS_HAS_DDESC is never set.  As a result, if the output file is not seekable, libgsf takes the zip_header_patch_sizes() path and dies on a critical error: "g_io_channel_seek_position: assertion 'channel->is_seekable' failed".  This did not occur in the previous release.

- With a non-seekable output, auto mode should force ZIP64 (or cause a fatal error).  The field sizes in the data descriptor block depend on the existence of the ZIP64 extra field, and the size of the member isn't yet known when writing the extra field.

- The call to gsf_output_write() in zip_ddesc_write() skips the ZIP_DDESC_SIGNATURE; is this intentional?

- The ZIP spec seems to say that if a data descriptor is written, the crc/usize/csize fields should be zero even in ZIP64 mode.

Comment 11 Morten Welinder 2014-11-29 17:42:40 UTC

- The zip64 property is installed as PROP_DEFLATE_LEVEL, so the application
cannot set it.

Fixed.

- There are some datatype issues with the zip64 flag: it's really an integer,
but is declared gboolean in GsfOutfileZip and guint8 in zip_close_root.

gint8 was meant.  Fixed.

- In auto mode, for both small and large files, libgsf writes corrupt local
file headers: the sizes are 0xFFFFFFFF but the extra field is of type
ZIP_DIRENT_EXTRA_FIELD_IGNORE.

Fixed.

- In !zip64 mode, libgsf will disable all ZIP64 extensions *except* the ZIP64
end of central directory record.  When the ZIP file requires ZIP64 extensions
(e.g. a member is too large), libgsf will silently write corrupt metadata
rather than failing.

I don't understand.  Please elaborate.

- The code to write the data descriptor record is never executed because
ZIP_DIRENT_FLAGS_HAS_DDESC is never set.  As a result, if the output file is
not seekable, libgsf takes the zip_header_patch_sizes() path and dies on a
critical error: "g_io_channel_seek_position: assertion 'channel->is_seekable'
failed".  This did not occur in the previous release.

We were off spec and always used DDESC in the deflate case.  (And only then,
so the seek problem cannot be new.)  Do you have sample code that causes that
problem to exhibit itself?

- The call to gsf_output_write() in zip_ddesc_write() skips the
ZIP_DDESC_SIGNATURE; is this intentional?

Ah, no -- leftever debug junk.  I do note, however, that it is unclear
whether a signature should be written.  Fixed, but as you note not
currently reachable.


Note, that even with all these problems, the generated archives passed
"unzip -t".  Clearly not an exhaustive test.

Comment 12 Morten Welinder 2014-11-29 20:20:39 UTC

We now use a DDESC if (and if only) we determine that we cannot seek.
This hasn't seen a lot of testing, though.

Comment 13 Benjamin Gilbert 2014-11-30 07:15:20 UTC

> - In !zip64 mode, libgsf will disable all ZIP64 extensions *except* 
> the ZIP64 end of central directory record. When the ZIP file requires
> ZIP64 extensions (e.g. a member is too large), libgsf will silently
> write corrupt metadata rather than failing.
> 
> I don't understand.  Please elaborate.

If the zip64 property is FALSE, and the application attempts to write data that will not fit in an old-style ZIP, libgsf should fail rather than writing a corrupt file.  (Otherwise, setting zip64 to FALSE wouldn't be safe.)  Currently, libgsf writes corrupt metadata just as it used to, but with the addition of a ZIP64 end of central directory record.

> We were off spec and always used DDESC in the deflate case.  (And
> only then, so the seek problem cannot be new.)  Do you have sample
> code that causes that problem to exhibit itself?

To test non-seekable sinks, I just changed gsf_create() in tools/gsf.c to do

	GIOChannel *chan = g_io_channel_unix_new(1);
	g_io_channel_set_encoding(chan, NULL, NULL);
	dest = gsf_output_iochannel_new(chan);

and then ran the tool as

	gsf createzip out.zip dir | cat > out.zip

> - The call to gsf_output_write() in zip_ddesc_write() skips the
> ZIP_DDESC_SIGNATURE; is this intentional?
> 
> Ah, no -- leftever debug junk.  I do note, however, that it is unclear
> whether a signature should be written.

See paragraph 4.3.9.4 of the spec.

> Note, that even with all these problems, the generated archives passed
> "unzip -t".  Clearly not an exhaustive test.

I'm not surprised.  I've been testing with a hex editor.

As of e420fc52, the Zip writer has the following problems:

- The zip64 local variable in zip_close_root is still guint8, as is the zip64 field in GsfZipDirent.  Both of these are causing conditionals to produce unintended results.

- In auto mode, when writing a small file, the central directory header always puts the usize/csize in the ZIP64 extra field.  To maintain consistency with the local file header, the central directory should include the extra field only if the sink is non-seekable.

- In auto mode with a small file and a non-seekable sink, the ZIP64 end of central directory record and locator are not written, even though ZIP64 extra fields are used.

- As described above, invalid ZIP files can be written when zip64 == FALSE.

Comment 14 Morten Welinder 2014-11-30 19:29:56 UTC

We will now not allow a zip member to be written beyond 4G if zip64==FALSE.
It's possible that the compressed size will blow the limit, but it is not
going to be easy to fix.  Just don't do that.

We should also now add zip64 trailer and locator if a member uses zip64.
I am not entirely sure that is required, but it should not hurt.

I really hope the gint8 vs guint8 problem is gone now.

That leaves "In auto mode, when writing a small file, the central
directory header always puts the usize/csize in the ZIP64 extra field."

Comment 15 Benjamin Gilbert 2014-11-30 22:57:34 UTC

> We will now not allow a zip member to be written beyond 4G if
> zip64==FALSE. It's possible that the compressed size will blow the
> limit, but it is not going to be easy to fix.  Just don't do that.

Hmm.  Wouldn't it be possible to cause gsf_output_close to return FALSE by returning FALSE from zip_header_write/zip_close_root?  It'd be better than nothing, and the archive is going to be corrupt either way.

(Otherwise, the application will have to do its own checks: is it compressing a file that is almost 4 GB, whose csize might expand over the 4 GB boundary?  Is it compressing more than 65535 files?  Returning FALSE lets it catch these cases through the existing API.)

Comment 16 Morten Welinder 2014-12-01 00:08:58 UTC

It might be possible somewhere, but anyone who explicitly sets zip64
to "no" and proceeds to stuff really large files into it is asking for
trouble.

I can add a check to zip_output_block which is the only place that updates
->csize (except for "stored" files).

That error will propagate into zip_flush and then into zip_close_stream and
then into gsf_outfile_zip_close.

gsf (the command line tool) doesn't check the return value from
gsf_output_close.  I suspect most callers fail to do that.

Comment 17 Morten Welinder 2014-12-01 00:11:29 UTC

I tried creating a 4G file from /dev/urandom and zipping it.  I get:

  compressed size:                                4295663804 bytes
  uncompressed size:                              4294967296 bytes

I.e., if I had taken 2 bytes less, this file would cause overflow in the
compressed size only.

Comment 18 Morten Welinder 2014-12-01 00:23:45 UTC

csize check is in.

"In auto mode, when writing a small file, the central
directory header always puts the usize/csize in the ZIP64 extra field."
fixed, I think.

Comment 19 Benjamin Gilbert 2014-12-01 00:26:12 UTC

They presumably want to ensure compatibility with old readers (including older libgsf) and don't necessarily know in advance what they'll be zipping.

zip_close_root should also fail if e.g. there are too many files.

Comment 20 Benjamin Gilbert 2014-12-01 03:26:19 UTC

I carefully tested af8db4ad and found no problems other than comment 19.  4a4a8a8c introduces a regression: with a non-streaming sink in auto mode, small files written beyond the 4 GB boundary won't have the "relative offset of local header" ZIP64 extra field set properly in the central directory.  If you want to keep the use of the ZIP64 extra field consistent between the local file header and the central directory header, you'll need to reinstate the dirent->offset limit check when setting real_zip64.

Comment 21 Morten Welinder 2014-12-01 18:17:57 UTC

dirent's offset in central directory fixed.

Comment 22 Morten Welinder 2014-12-03 01:43:39 UTC

I am unconcerned that we currently can create zip64 files even if
requested not to.  In only happens in cases where a non-zip64
could not be created.

I have added the beginning of a test suite.  While not perfect, it
at least checks that what comes out is what we put in and that unzip
and zipinfo are happy with the files we create.  zip64 and non-seekable
sinks are tested.

Is there anything left here?

Comment 23 Benjamin Gilbert 2014-12-03 07:15:46 UTC

Nope.  Thanks for all your work.