Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

you should probably mention this library is not actually zip standards compliant #7

Open
greggman opened this issue Jan 12, 2021 · 10 comments

Comments

@greggman
Copy link

The reason other libraries read the TOC is because it's the source of truth about the contents of the file. Just like a hard drive has tables of contents but scanning every sector you might find lost files, so is zip. The entire reason the TOC is at the end of the file is so you can easily update small files by just appending data to the end of the zip file and writing a new TOC, no need to read the entire file and write a one. That is by design since zip was designed in the era of floppy disks when re-writing files, especially across multiple disks, could take 20-60 minutes.

It would be polite to tell people this library will fail on some percentage of zip files because it's not correctly reading the TOC.

@madler
Copy link
Owner

madler commented Jan 12, 2021

sunzip does read and use the central directory. It does not finalize the extraction of the files until it has found them in the central directory. sunzip is in fact compliant with the zip standard. It simply changes the order of operations so as to be able to read the zip file as a stream.

If you have a bug to report, it would be polite to provide an example file manifesting the bug.

@madler madler closed this as completed Jan 12, 2021
@greggman
Copy link
Author

Here you go. Unzips with all the normal TOC reading zip utils but not sunzip

issue.zip

@madler
Copy link
Owner

madler commented Jan 12, 2021

Thank you!

@madler madler reopened this Jan 12, 2021
@madler
Copy link
Owner

madler commented Jan 12, 2021

That is not a compliant zip file. There are two end-of-central-directory records. Per the appnote: 'A ZIP file MUST have only one "end of central directory record".'

@greggman
Copy link
Author

It is compliant. It was made with Info-Zip, It is read correctly even by pkzip2.04g in a dosbox from 1993. MacOS's built in zip support in finder reads it correctly. Windows built in zip support in explorer reads it correct. Winzip reads it correct. WinRaR reads it correctly.

But in any case here's a another without the middle central directory, works all the readers above, fails in sunzip.

issue-f.zip

@madler
Copy link
Owner

madler commented Jan 13, 2021

Thank you for the example. However you overwrote the central directory header instead of the end-of-central-directory record. You don't need to provide another one. I'll overwrite the correct thing.

How do you make these with Info-ZIP's zip?

Note that software being able to read it has nothing to do with compliance with the document.

By the way, that's not why the central directory is at the end. You don't need to consider adding to an existing zip file. Just consider making a new zip file. You don't want to put the central directory at the start because you don't know how big to make it until you have written all the entries. To put it at the beginning, you would need to know how many entries there will be and what all their file name lengths will be, needing two scans of the file system you are compressing from. (Not much memory back then.)

Also interestingly, there is nothing in the appnote that says that the central directory needs to be at the end, and in fact there are zip files with the central directory elsewhere. Only the end-of-central-directory record is expected to be at the end by the software you listed, though oddly the appnote doesn't require that either.

@greggman
Copy link
Author

greggman commented Jan 13, 2021

From the app note

4.1.9 ZIP files MAY be streamed, split into segments (on fixed or on
removable media) or "self-extracting". Self-extracting ZIP
files MUST include extraction code for a target platform within
the ZIP file.

From Info-Zip

How do I make a DOS (or other non-native) self-extracting archive under Unix?

The procedure is basically described in the UnZipSFX man page. First grab the appropriate UnZip binary distribution for your target platform (DOS, Windows, OS/2, etc.), as described above; we'll assume DOS in the following example. Then extract the UnZipSFX stub from the distribution and prepend as if it were a native Unix stub:

unzip unz552x3.exe unzipsfx.exe                 // extract the DOS SFX stub
cat unzipsfx.exe yourzip.zip > yourDOSzip.exe  // create the SFX archive
zip -A yourDOSzip.exe                          // fix up internal offsets

That's it. You can still test, update and delete entries from the archive; it's a fully functional zipfile.

What's in that self-extractor is entirely arbitrary.

@greggman
Copy link
Author

greggman commented Jan 15, 2021

If you're curious I contacted pkware, unfortunately their answers were wishy-washy.

They did seem to confirm that the end-of-central-directory-record is supposed to be at the end of the file

  1. Is a zip file with the end of central directory record not at the end of the file a valid zip file?

(No, but it will depend on how you interpret what is the end of a file in relation to where the 0x06054b50 record marker is found, see next comment.)

This is unclear. I tried padding a zip file with extra 0s up to 200k and pkzip2.04g still handled it fine. Other readers it depends. MacOS failed, Windows failed, WinRar succeeded, Info-Zip failed.

(The 0x06054b50 record allows for variable length data within the last field defined as a “comment” as recorded using the fields “.ZIP file comment length”, 2 bytes and “.ZIP file comment”, variable size. Any data up to the comment length would be considered part of the 0x06054b50 record and part of the ZIP file. The amount of data (this can be a lot) that may be placed as a “comment” could make it look like the 0x06054b50 is not at the end of the file. Anything beyond the length of the comment would not be considered correct.)

So if you saw a zip file with the end-of-central-directory-record not at the end of the file then it was not a valid zip file

They confirmed that a self extracting zip is a valid zip file

APPNOTE 4.1.9 says zip files may be self extracting

(CORRECT)

They didn't confirm whether or not having anything in the self extracting part that appears to be any of the zip records like the end-of-central-directory-record is a valid or not.

I asked for clarification of that point but have not heard back. But, it should be obvious that just writing code like

     if (value == 0x06054b50)

in the self extractor will possibly end up with that value in the self extractor and a forward scanner is likely to misinterpret it.

As a test I just built Info-Zip from source on linux, created a zip, prepended the info-zip self extractor, ran zip -A on it to fix the central-directory offsets. I checked that it self extracts correctly. It does. Checked other backward parsing tools unzipped it, they do, passed it to sunzip and sunzip failed.

self-extracting.zip

@madler
Copy link
Owner

madler commented Jan 16, 2021

Thanks for the investigation.

I will be updating sunzip to address many zip file variants and ill-formed zip files, as noted by your report.

In the meantime, consider this beauty that I just constructed:

nest.zip

There are no constraints on the contents of the zip file comment in the end-of-central-directory record. In the linked file, I put a zip file in the comment. If that zip file is searched from the end, like unzip does, it finds the zip file in the comment, with the entry named "1". There are 100 extra bytes at the start that could be considered to be a self-extractor, making it a valid zip file. If that zip file is read from the beginning, as sunzip does, it finds that the entire file is a valid zip file, with the single entry named "2". Its end record has a 100-byte comment at the end, ending that record exactly at the end of the file. Looked at either way, the zip file is entirely valid.

So does that zip file contain the entry "1" or the entry "2"?

@greggman
Copy link
Author

greggman commented Jan 16, 2021

I agree the APPNOTE is poorly written.

This is just speculation. I assume a comment is a string and back when PKZIP was invented it's likely the Phil Katz only thought about readable ASCII in that area. But I agree with you it's problematic in that the spec doesn't' cover the case that the data inside the comment happens to look like a zip file. 0x50 0x4b 0x05 0x06 are valid ASCII

¯\(ツ)

Another issue is can there be gaps between records. Clearly since a self extracting chunk fits in front there is certainly precedent. For example, following PKWare's claim that the diagram of chunks in the APPNOTE show the end-of-central-directory-record is at the end of the file, the same claim can be made that that diagram shows no self extracting portion in front.

There's this from my correspondence which doesn't help either

The format was originally intended to be written front-to-back so the central directory and end of central directory record could be written out last after all files included in the ZIP are known and written. If adding files, changes can applied without rewriting the entire file. This was how the original PKZIP application was designed to write .ZIP files. (supporting what I said at the top of this issue). When reading, it will read the ZIP file end of central directory first to locate the central directory and then seek to any files it needs to access.

That confirms to me that you can just update the TOC. But it doesn't answer if there can be gaps between records or unused records. It does answer that pkunzip would not care if there are gaps between records or unused records.

Some more poorly thought out spec

4.3.1 A ZIP file MUST contain an "end of central directory record". A ZIP
file containing only an "end of central directory record" is considered an
empty ZIP file. Files MAY be added or replaced within a ZIP file, or deleted.
A ZIP file MUST have only one "end of central directory record". Other
records defined in this specification MAY be used as needed to support
storage requirements for individual ZIP files.

Saying "Files MAY be added or replaced within a ZIP file, or deleted" has no place in the spec unless the replaced or deleted data is still in the file. Otherwise there is no such thing. There are just new zip files. There is just file FOO with ABC in it, and file FOO2 with ABCD (D added) and file FOO3 with BCD (A deleted). Where as that sentence does make sense IF you have to read the TOC and there can be records in the file the TOC doesn't reference (files deleted from the TOC, files not referenced by the TOC because a new version of the same file was appended and only that version is referenced by the TOC)

Sigh......... 😭

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants