Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pixz | xz -d corrupts data #65

Open
JoernEngel opened this issue Apr 26, 2016 · 16 comments
Open

pixz | xz -d corrupts data #65

JoernEngel opened this issue Apr 26, 2016 · 16 comments

Comments

@JoernEngel
Copy link

I checked whether pixz and xz format were compatible. Looks like they are. But on a large testfile, I don't get the same data back as before. In particular, the output file is larger than the input file.

All bytes up to the end of input file are identical. Output file simply has additional data. Additional data seems derived from input file, not all zeroes or such.

$ ls -l foo
-rw-r--r-- 1 joern joern 4380151296 Apr 26 11:30 foo
$ cat foo | pixz -0 | xz -d > bar
$ ls -l bar
-rw-r--r-- 1 joern joern 4384062468 Apr 26 12:08 bar

pixz 1.0.2
xz (XZ Utils) 5.1.0alpha

Both from Debian, running on x86_64.

@JoernEngel
Copy link
Author

While dealing with data corruption, can I suggest running American Fuzzy Lop against every combination of pixz and xz? People found load of bugs in compressors with relatively little effort.

Can I also suggest a major version change when the data corruption bug gets fixed? At this point we have decided not to use pixz at all. Not ever. The danger of someone installing a broken version somewhere and having to spend weeks to track down the problem trumps any speed gains.

@wookietreiber
Copy link
Collaborator

@JoernEngel Please try to reproduce with version 1.0.6.

@wookietreiber
Copy link
Collaborator

Also, @JoernEngel, is it possible for you to provide the file you are testing with as a download so we can try to work with it? For such a download, please also provide us with the md5 of its original, uncompressed version, and then compress it via another compression tool like gzip to exclude any xz-related issues.

@JoernEngel
Copy link
Author

On Tue, Apr 26, 2016 at 12:41:20PM -0700, Christian Krause wrote:

@JoernEngel Please try to reproduce with version 1.0.6.

I tried and failed.

./configure
make
make install

There is no ./configure checked in. Autoconf is throwing errors:
configure.ac:11: error: possibly undefined macro: AM_INIT_AUTOMAKE
If this token and others are legitimate, please use m4_pattern_allow.
See the Autoconf documentation.

The file I picked contains confidential data. But seriously, I
litterally picked the first large file I had lying around and did the
most obvious smoke test I could think of. If it was this easy for me to
break thing, you should be able to reproduce this with a random dvd image
or similar.

Jörn

A surrounded army must be given a way out.
-- Sun Tzu

@wookietreiber
Copy link
Collaborator

configure is in here. You probably downloaded one of the GitHub-generated tarballs. These do not contain the make dist generated files.

@JoernEngel
Copy link
Author

On Tue, Apr 26, 2016 at 01:02:40PM -0700, Christian Krause wrote:

configure is in here. You probably downloaded one of the GitHub-generated tarballs. These do not contain the make dist generated files.

I used the git tree. Anyway, 1.0.6 reproduced the problem, file sizes
are identical to 1.0.2 sizes.

And I found yet another fun bug.
$ zcat foo.gz | ./pixz -0 | ./pixz -d | wc -c
can not seek in input: Illegal seek

Jörn

It's not that I'm so smart, it's just that I stay with problems longer.
-- Albert Einstein

@JoernEngel
Copy link
Author

pixz -0 | pixz -d seems to work correctly, modulo the "Illegal Seek"
message. I suppose that explains how this bug slipped through.

Trouble is that files generated by pixz get detected as regular xz files
and using regular xz to uncompress yields an incorrect result. Not sure
whether pixz is creating an incorrect archive or xz is violating the
spec. Not even sure if there is a spec.

But xz came first and pixz, for better or worse, has to be compatible.
Bug-for-bug compatible, if necessary. Sorry, but you have some work to
do.

Jörn

Measure. Don't tune for speed until you've measured, and even then
don't unless one part of the code overwhelms the rest.
-- Rob Pike

@wookietreiber
Copy link
Collaborator

wookietreiber commented Apr 26, 2016

I just tried with a large file and couldn't reproduce:

$ dd if=file bs=1M | md5sum
4478+1 records in
4478+1 records out
4696129652 bytes (4.7 GB, 4.4 GiB) copied, 10.9025 s, 431 MB/s
44c5d7d956b71ba1724875c7b0648bf3  -

$ dd if=file bs=1M | pixz -0 | xz -d | md5sum
4478+1 records in
4478+1 records out
4696129652 bytes (4.7 GB, 4.4 GiB) copied, 619.726 s, 7.6 MB/s
44c5d7d956b71ba1724875c7b0648bf3  -

pixz 1.0.6
xz (XZ Utils) 5.2.2
liblzma 5.2.2

@wookietreiber
Copy link
Collaborator

@vasi any insights?

@vasi
Copy link
Owner

vasi commented Apr 26, 2016

@JoernEngel : Could it be that the thing you compressed is a tar file? When it sees a tarball, pixz will by default add an index, so that it's possible to seek to a specific file.

This is done in a safe way, nothing is "corrupted":

  • pixz checks that the file looks like a tarball before doing this, non-tar files won't get an index
  • pixz uses libarchive to check if something is a tarball, which is the same library that /usr/bin/tar uses on OS X and FreeBSD—it's pretty robust.
  • pixz will remove this index when you decompress
  • pixz adds the index after the tar end-of-file marker—so even if you decompress with xz, tar will just ignore this extra data
  • You can turn off the indexing functionality with the -t flag.

If you've discovered a type of file that is not a tarball, but can fool libarchive's checks, please let us know.

Also, please try to be kind. pixz is a tool that other people wrote for free, and just gave to you (and everyone else), and are now supporting for free as well. Nobody is forcing you to use it. If you want to be helpful, that's great! But if you want to act aggressively, you're welcome to find someone offering paid support instead of using our issue tracker.

@JoernEngel
Copy link
Author

On Tue, Apr 26, 2016 at 02:00:45PM -0700, Dave Vasilevsky wrote:

@JoernEngel : Could it be that the thing you compressed is a tar file? When it sees a tarball, pixz will by default add an index, so that it's possible to seek to a specific file.

This is done in a safe way, nothing is "corrupted":

  • pixz checks that the file looks like a tarball before doing this, non-tar files won't get an index
  • pixz uses libarchive to check if something is a tarball, which is the same library that /usr/bin/tar uses on OS X and FreeBSD—it's pretty robust.
  • pixz will remove this index when you decompress
  • pixz adds the index after the tar end-of-file marker—so even if you decompress with xz, tar will just ignore this extra data
  • You can turn off the indexing functionality with the -t flag.

If you've discovered a type of file that is not a tarball, but can fool libarchive's checks, please let us know.

Indeed, the file is a tarball and the extra data looks like filenames
with some binary (offsets?) in between.

I would still call this corruption. If the archive looks like regular
xz, gets processed by regular xz and the result is different from the
original input file, that is corruption.

Tarball mode clearly has its appeal. I can see why you did it. But I
think it shouldn't be the default. If one variant results in worse
performance and the other variant sometimes results in a catastrophy,
the default should be slow and safe.

Historically there have been consumer flash devices that worked in
similar ways. They assumed a FAT filesystem, would parse the FAT and
erase blocks belonging to deleted files. Works great if you actually
use FAT. And given the many implementations with subtle
incompatibilities, you should use the correct variant.

The problem with autodetection is that things can go wrong and
eventually will go wrong.

Anyway, another interesting question is whether the pixz archive can be
"poisoned" in such a way that xz ignores the index, but correctly
decompresses everything else. But then you would have to change an
established format and depend on future versions of xz to handle
"poisoned" archives the same way. Might be too far into crazy land.

Jörn

Functionality is an asset, but code is a liability.
--Ted Dziuba

@JoernEngel
Copy link
Author

On Tue, Apr 26, 2016 at 02:00:45PM -0700, Dave Vasilevsky wrote:

Also, please try to be kind. pixz is a tool that other people wrote for free, and just gave to you (and everyone else), and are now supporting for free as well. Nobody is forcing you to use it. If you want to be helpful, that's great! But if you want to act aggressively, you're welcome to find someone offering paid support instead of using our issue tracker.

Would you consider paid support?

I believe pixz is exactly what we need and would save us money. Sending
some of the saved money your way would only be fair.

Jörn

The cost of changing business rules is much more expensive for software
than for a secretary.
-- unknown

@vasi
Copy link
Owner

vasi commented Apr 26, 2016

Yeah, unfortunately the xz file format doesn't allow for arbitrary extra data to be carried along, like gzip does. Or at least I haven't found a way to do that.

While nowadays most people probably use pixz for its multi-core support, it was actually originally developed for the indexing feature! Once that was implemented, it turned out that parallelizing things was trivial, so that happened. But indexing is a pretty core part of pixz's mission. Sorry it doesn't suit you. You might want to create an alias pixz='pixz -t' or something, so you don't have to deal with it.

I do think it would be nice if pixz eventually supported a configuration file, so folks could say whether or not they wanted this feature. (Maybe also for features like number of CPUs! I'm sure some people don't appreciate us maxing out all their cores.)

@wookietreiber
Copy link
Collaborator

wookietreiber commented Apr 26, 2016

@JoernEngel Would you please verify that the pixz -t -0 | xz -d round trip works as expected for your input file?

@JoernEngel
Copy link
Author

On Tue, Apr 26, 2016 at 02:55:46PM -0700, Dave Vasilevsky wrote:

I do think it would be nice if pixz eventually supported a configuration file, so folks could say whether or not they wanted this feature. (Maybe also for features like number of CPUs! I'm sure some people don't appreciate us maxing out all their cores.)

Interesting. That might also explain whey "pixz | xz -d" is slower than
"pixz". If xz keeps roughly one cpu busy, you have two effects costing
you performance. The first is the OS scheduler moving threads around.
The second is that one thread is always behind. If you have points in
your code that depend on all threads catching up and making roughly
equal progress, the faster threads go idle during those times.

Looks like "pixz | xz -d" keeps about 3.5 cpus busy on my notebook,
while "pixz" keeps about 7.5 cpus busy.

Anyway, thank you for this interesting piece of software. Extra thanks
for doing it in only 2250 lines.

Jörn

When in doubt, use brute force.
-- Ken Thompson

@JoernEngel
Copy link
Author

On Tue, Apr 26, 2016 at 03:02:30PM -0700, Christian Krause wrote:

@JoernEngel Would you please verify that the pixz -t -0 | xz -d round trip works as expected?

Verified. "pixz -t -0 | xz -d" works correctly.

Jörn

The trouble with the world is that the stupid are cocksure and
the intelligent are full of doubt.
-- Bertrand Russell

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants