is pixz tar-append possible? #83

eichin · 2019-11-16T00:38:40Z

Some naive experiments didn't work, and I was wondering if this makes sense structurally - I'd like to add a small file to a (large) pixz-compressed tarfile, and was wondering if updating the index was possibly cheaper than rebuilding the whole archive.

vasi · 2019-12-27T07:10:37Z

Something like that ought to be possible, but a bit complicated. You'd need to truncate, append the data, and then rewrite the index.

antmak · 2019-12-30T03:59:13Z

Hi! I'm curious if it's possible to place the index at the end of file? It is not a request, it's just a general discuss about formats.

I wrote some storage system for keeping tons of telemetry and media from industrial equipment, and the end placed index had shown itself very well. I understand that it was a special case, but it seems that most pixz's use cases fit that well.

vasi · 2019-12-30T04:14:47Z

Oh, the index does go near the end! A pixz file with 3 data blocks looks like this:

XZ stream header
Data block 1
Data block 2
Data block 3
pixz file index (in XZ data block 4)
XZ index
XZ stream footer

See the XZ file format info more more details on the XZ wrapper.

The problem is that to append to a pixz file, you need to move the pixz file index. So it’s not trivial.

antmak · 2019-12-30T04:36:34Z

Ah sorry, you're right. Then I would like to see in detail the problems with the high cost of adding a file to the archive @eichin How big is your index?

In experience, yes, the index can increase itself to unacceptable size (compared to a chunk of added data). I used 2-level index (some split 1-lvl indexes over file, a 2-lvl index ("index of indexes") at the end of file). It works ok if we mostly add data sequentially. But that may not acceptable for general purpose archiver.

abitrolly · 2020-11-23T21:02:56Z

What is the format of the index?

I am also curious how easy it is to remove a file from .tpxz? Would it be faster than a full recompression?

vasi · 2020-11-24T08:40:37Z

As mentioned above, there's actually two indexes in a pixz file: The XZ index, and the pixz tar file index. You can read about the XZ index in the link above. The pixz tar file index isn't documented, but if you read the code, you can see it's basically a bunch of filename/offset pairs.

It sounds very difficult to remove a file from a .tpxz archive. You'd have to recompress the partial blocks on each end, and rewrite both the pixz tar file index and the XZ index. If you find yourself frequently wanting to remove files from compressed archives, there are probably better archive formats out there!

abitrolly · 2020-11-24T08:59:02Z

From the spec XZ index is just a list of pairs Unpadded Size | Uncompressed Size for getting and checking the size of decompressed blocks. There is no place for user metadata there, so how pixz adds its own index to keep the file compatible with original .tar.xz?

I can not read C code as freely as a text spec. If I know the binary structure, I could try to experiment with that algorithm. The problem is more generic than it seems facebook/zstd#2396

vasi · 2020-11-24T16:16:25Z

pixz's tar-mode uses (abuses?) the fact that tar-files always end with a couple of blocks full of zeros, and tar ignores anything after it. So pixz's index can just go after the end-of-file blocks, and it preserves compatibility with tar, even if you're using plain old xz to decompress the archive.

Rogdham · 2021-04-26T19:27:59Z

I'm not really interested in that use case, but I can share my thoughts anyways, it may help someone who is.

When you write to a plain tar file in append mode, here is what tar does:

Finds the place where the next tar header would have been (this is within the tar archive, somewhere in the null-bytes blocks at the end of the archive);
Write from that point on.

Let me illustrate how much work is needed to do tar-append in tpxz format, with the example:

A pixz file with 3 data blocks looks like this:

1. XZ stream header

2. Data block 1

3. Data block 2

4. Data block 3

5. pixz file index (in XZ data block 4)

6. XZ index

7. XZ stream footer

Here what you would need to do:

Save XZ index in memory
Save pixz index in memory
Locate the place where the next tar record would be (it probably inside data block 3, but could be in data block 2 in some edge cases) - this means decompressing the blocks
Re-write that block (and add new blocks as needed), adding the tar data to append, upding the pixz file index as well as the xz index in memory
Write pixz file index (after compression)
Write XZ index
Write XZ stream footer (updated)

In other words, you cannot just “insert” a XZ block, you would need to modify existing ones, and override the bytes that come after.

And also, you will face the issue that tar itself does not seem to be willing to works with compressed files in append mode (which does makes sense), so you would need to add the CLI arguments to pixz.

$ tar -Ipixz --apend --file=foo.tpxz bar.txt
tar: Cannot update compressed archives
Try 'tar --help' or 'tar --usage' for more information.

So it’s not trivial.

I agree, that would be quite a bit of work.

Rogdham mentioned this issue Apr 26, 2021

Questions about tpxz / file index format #96

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is pixz tar-append possible? #83

is pixz tar-append possible? #83

eichin commented Nov 16, 2019

vasi commented Dec 27, 2019

antmak commented Dec 30, 2019

vasi commented Dec 30, 2019

antmak commented Dec 30, 2019

abitrolly commented Nov 23, 2020

vasi commented Nov 24, 2020

abitrolly commented Nov 24, 2020

vasi commented Nov 24, 2020

Rogdham commented Apr 26, 2021

is pixz tar-append possible? #83

is pixz tar-append possible? #83

Comments

eichin commented Nov 16, 2019

vasi commented Dec 27, 2019

antmak commented Dec 30, 2019

vasi commented Dec 30, 2019

antmak commented Dec 30, 2019

abitrolly commented Nov 23, 2020

vasi commented Nov 24, 2020

abitrolly commented Nov 24, 2020

vasi commented Nov 24, 2020

Rogdham commented Apr 26, 2021