Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is pixz tar-append possible? #83

Open
eichin opened this issue Nov 16, 2019 · 9 comments
Open

is pixz tar-append possible? #83

eichin opened this issue Nov 16, 2019 · 9 comments

Comments

@eichin
Copy link

eichin commented Nov 16, 2019

Some naive experiments didn't work, and I was wondering if this makes sense structurally - I'd like to add a small file to a (large) pixz-compressed tarfile, and was wondering if updating the index was possibly cheaper than rebuilding the whole archive.

@vasi
Copy link
Owner

vasi commented Dec 27, 2019

Something like that ought to be possible, but a bit complicated. You'd need to truncate, append the data, and then rewrite the index.

@antmak
Copy link

antmak commented Dec 30, 2019

Hi! I'm curious if it's possible to place the index at the end of file? It is not a request, it's just a general discuss about formats.

I wrote some storage system for keeping tons of telemetry and media from industrial equipment, and the end placed index had shown itself very well. I understand that it was a special case, but it seems that most pixz's use cases fit that well.

@vasi
Copy link
Owner

vasi commented Dec 30, 2019

Oh, the index does go near the end! A pixz file with 3 data blocks looks like this:

  1. XZ stream header
  2. Data block 1
  3. Data block 2
  4. Data block 3
  5. pixz file index (in XZ data block 4)
  6. XZ index
  7. XZ stream footer

See the XZ file format info more more details on the XZ wrapper.

The problem is that to append to a pixz file, you need to move the pixz file index. So it’s not trivial.

@antmak
Copy link

antmak commented Dec 30, 2019

Ah sorry, you're right. Then I would like to see in detail the problems with the high cost of adding a file to the archive @eichin How big is your index?

In experience, yes, the index can increase itself to unacceptable size (compared to a chunk of added data). I used 2-level index (some split 1-lvl indexes over file, a 2-lvl index ("index of indexes") at the end of file). It works ok if we mostly add data sequentially. But that may not acceptable for general purpose archiver.

@abitrolly
Copy link

What is the format of the index?

I am also curious how easy it is to remove a file from .tpxz? Would it be faster than a full recompression?

@vasi
Copy link
Owner

vasi commented Nov 24, 2020

As mentioned above, there's actually two indexes in a pixz file: The XZ index, and the pixz tar file index. You can read about the XZ index in the link above. The pixz tar file index isn't documented, but if you read the code, you can see it's basically a bunch of filename/offset pairs.

It sounds very difficult to remove a file from a .tpxz archive. You'd have to recompress the partial blocks on each end, and rewrite both the pixz tar file index and the XZ index. If you find yourself frequently wanting to remove files from compressed archives, there are probably better archive formats out there!

@abitrolly
Copy link

From the spec XZ index is just a list of pairs Unpadded Size | Uncompressed Size for getting and checking the size of decompressed blocks. There is no place for user metadata there, so how pixz adds its own index to keep the file compatible with original .tar.xz?

I can not read C code as freely as a text spec. If I know the binary structure, I could try to experiment with that algorithm. The problem is more generic than it seems facebook/zstd#2396

@vasi
Copy link
Owner

vasi commented Nov 24, 2020

pixz's tar-mode uses (abuses?) the fact that tar-files always end with a couple of blocks full of zeros, and tar ignores anything after it. So pixz's index can just go after the end-of-file blocks, and it preserves compatibility with tar, even if you're using plain old xz to decompress the archive.

@Rogdham
Copy link

Rogdham commented Apr 26, 2021

I'm not really interested in that use case, but I can share my thoughts anyways, it may help someone who is.


When you write to a plain tar file in append mode, here is what tar does:

  1. Finds the place where the next tar header would have been (this is within the tar archive, somewhere in the null-bytes blocks at the end of the archive);
  2. Write from that point on.

Let me illustrate how much work is needed to do tar-append in tpxz format, with the example:

A pixz file with 3 data blocks looks like this:

1. XZ stream header

2. Data block 1

3. Data block 2

4. Data block 3

5. pixz file index (in XZ data block 4)

6. XZ index

7. XZ stream footer

Here what you would need to do:

  1. Save XZ index in memory
  2. Save pixz index in memory
  3. Locate the place where the next tar record would be (it probably inside data block 3, but could be in data block 2 in some edge cases) - this means decompressing the blocks
  4. Re-write that block (and add new blocks as needed), adding the tar data to append, upding the pixz file index as well as the xz index in memory
  5. Write pixz file index (after compression)
  6. Write XZ index
  7. Write XZ stream footer (updated)

In other words, you cannot just “insert” a XZ block, you would need to modify existing ones, and override the bytes that come after.

And also, you will face the issue that tar itself does not seem to be willing to works with compressed files in append mode (which does makes sense), so you would need to add the CLI arguments to pixz.

$ tar -Ipixz --apend --file=foo.tpxz bar.txt
tar: Cannot update compressed archives
Try 'tar --help' or 'tar --usage' for more information.

So it’s not trivial.

I agree, that would be quite a bit of work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants