Parse tar data backed up via stdin #2226

Kidswiss · 2019-03-29T14:46:21Z

Output of `restic version`

restic 0.9.4 compiled with go1.11.4 on darwin/amd64

What should restic do differently? Which functionality do you think we should add?

If someone streams tar data to restic to do a backup:

tar -cf - -C /veryimportantfolder | restic backup --stdin

The whole thing will be saved as a single file. This will make restoring a single file very tedious as the whole tar has to be restored, it will get more painful the large the tar file gets.

If restic would parse the tar file and "convert" the entries into restic native file trees, it would be possible to create a virtual folder snapshot. This way a tar file is backed up, but single file restore is still available.

What are you trying to do?

We use restic quite heavily in Kubernetes and OpenShift workloads where it's not always possible to give direct filesystem access to restic. So we stream quite a lot of tar files between containers to get the backups. This creates the problem described above.

This feature would complement #2123.

What do you think? Would something like this make sense?

Did restic help you or made you happy in any way?

Restic rocks :)

The text was updated successfully, but these errors were encountered:

cdhowie · 2019-03-29T22:16:08Z

Duplicate of #437.

fd0 · 2019-04-27T20:30:39Z

Ah, I actually like the idea. We even have an abstraction layer now (fs.FS) which could be used to implement a tar file system maybe.

eikevons · 2019-05-01T19:38:55Z

This would also help in a situation, where firewall rules forbid connection from the system-to-be-backed-up to the backup-storage but not the reverse direction. We can have a simple script on the system-to-be-backed-up that is invoked via ssh and tars to stdout and by-pass the necessity to make the whole system available through sshfs.

alallier · 2019-06-18T00:13:30Z

@fd0 does that mean Restic plans to support streaming tar data to stdin?

FiloSottile · 2020-04-26T17:06:08Z

Something I do often on machines where I don't want to install software or credentials is ssh machine.home.arpa tar cv ~. It would be awesome to be able to pipe that into restic and have it understand as a filesystem.

jinnko · 2020-11-08T15:39:59Z

This would also be great for backing up volumes from within docker which also uses tar under the hood, for example:

docker cp running-or-stopped-container:/path/to/volume - | restic backup --stdin

Legion2 · 2020-12-31T15:23:08Z

I think it is also important that the tar stdin is not completely stored in ram, because if the backup is huge it would not fit. This would allow to stream data from one remote source to a backup without storing the "source" on the local file system.

cdhowie · 2021-01-01T15:24:42Z

Note that this may not be a good solution for securely backing up remote systems. On a LAN it might work, but restic has no way to communicate to the sending side that it can skip a file based on the contents of the parent snapshot. The sender has to send every single byte regardless of what is already in the repository, and restic has to receive all of that data even if it is just going to discard it because the file didn't change. This could be incredibly slow over a WAN connection, and it also requires the sender to read all of the data from disk, which might be very slow.

This feature could be useful in some niche cases, but I would argue that it should not be used across the board for secure remote backups as it would be horribly inefficient. A different solution would be required to implement this efficiently.

jcotton42 · 2022-05-18T01:39:18Z

This would also be nice for things like postgres tar dumps, eg something like

pg_dump --format=tar | restic backup --stdin

jniggemann · 2022-06-21T16:59:53Z

This would also be nice for backing up proxmox VMs / LXCs
vzdump 103 --mode snapshot --stdout | restic backup --stdin

wmertens · 2022-09-27T13:16:54Z

👍 on stdin backups for database dumps. It's a great way to make a clean DB backup that doesn't disturb the app since it only holds a read lock.

@Kidswiss as for the tar case specifically, how about using instead using zip to stdout with 0 compression, and then mounting the backup via FUSE? That should allow zip to directly access the index and only read the parts it needs. Tar doesn't have an index.

cipriancraciun · 2022-09-27T20:48:53Z

Kidswiss as for the tar case specifically, how about using instead using zip to stdout with 0 compression, and then mounting the backup via FUSE? That should allow zip to directly access the index and only read the parts it needs. Tar doesn't have an index.

But then one needs to store the zip locally in order to have it mounted. (If one is dumping multi-TiB data sources, with tar one only needs the patience to stream it, meanwhile with zip one also needs to store it temporarily.)

The main use-case for streaming a tar, but backing-it-up via restic as it would be a proper file-system, is as @FiloSottile has mentioned being able to ssh into an untrusted server, create a full tar of the target file-system, stream it over ssh on a trusted staging server (but one that perhaps doesn't have the storage capacity to temporarily store the tar), and feed it to restic.

wmertens · 2022-09-27T21:03:53Z

But then one needs to store the zip locally in order to have it mounted

I meant, you mount the backup, which reveals the .zip file, and then you use zip on that (but you can of course use FUSE again to mount the .zip)

I understand the streaming use case, it's just that it seems a bit specific. Tar isn't the nicest format, and it won't support vzdump either because that's not tar. OTOH tar is really popular so if restic were to support something like that, tar seems a good candidate.

cipriancraciun · 2022-09-27T23:39:42Z

I meant, you mount the backup, which reveals the .zip file, and then you use zip on that (but you can of course use FUSE again to mount the .zip)

Given how restic chunks the data, backing-up a large proper file-system, or a single zip with all the contents, wouldn't yield the same boundaries at least for the first and last chunk of each file.

Thus, if the zip creation is not deterministic, or if lots of small files keep changing, then the "single zip" route would just create lots of changed chunks, when in fact not that much has changed.

wmertens · 2022-09-28T11:41:44Z

Given how restic chunks the data, backing-up a large proper file-system, or a single zip with all the contents, wouldn't yield the same boundaries at least for the first and last chunk of each file.

Isn't that what the rolling hash is for? https://restic.net/blog/2015-09-12/restic-foundation1-cdc/

restic will find regions in the zip file that start a new boundary, and if you make a change in the zip file, it will only change a few chunks, especially if also need to turn off zip compression.

cipriancraciun · 2022-09-29T11:20:51Z

Isn't that what the rolling hash is for? https://restic.net/blog/2015-09-12/restic-foundation1-cdc/
restic will find regions in the zip file that start a new boundary, and if you make a change in the zip file, it will only change a few chunks, especially if also need to turn off zip compression.

First of all, there is the issue of deterministic zip creation. If there are lots of small files, and their order change non deterministically, then for certain deduplication would not work properly, unless the chunk size is well below the average file size. (In case of restic, the documentation states it aims at 1 MiB chunk size, thus well above the average small file size.)

Then there is the issue of the zip format itself. It seems that each file data is prefixed by a file header which contains the modification time. Thus if something touches a file (without changing the contents), then that chunk will be seen as changed, thus not deduplicated. If restic operates on a proper file-system, the data is not stored, only a new file entry is created.

Also, given that restic aims at a chunk of 1 MiB in size, it means that changing a file of 1 KiB, would imply storing a new chunk (from the zip stream), thus a 99.9% waste. On the other side, if restic operates on a proper file-system, it would just store that 1 KiB and move on.

wmertens · 2022-09-29T12:10:04Z

@cipriancraciun very good points and they also hold for tar.

You make a good case indeed for restic supporting GNU tar input as a virtual filesystem 👍

IMHO it would have to be as a separate flag though. If it were to parse any tar file as a subdirectory, there's no guarantee that it can generate the exact same tar file, and if the file were corrupted it would have to abort the backup.
I suppose it could retry a failed tar as a regular file when reading from disk, but not when reading from stdin.

cipriancraciun · 2022-10-04T12:06:42Z

If it were to parse any tar file as a subdirectory, there's no guarantee that it can generate the exact same tar file, and if the file were corrupted it would have to abort the backup.

This is exactly what this ticket proposes: to use tar over stdin as an alternative to walking the file-system (in essence a tar contains all the meta-data restic would obtain from the proper file-system). Thus after restic consumes the tar and creates the snapshot, there would be no more mention of the initial tar, and the newly created snapshot would be identical to a similar snapshot created by using the proper file-system.

MichaelEischer · 2022-10-04T18:27:34Z

That would essential mean implementing borg import-tar for restic.

allisonkarlitskaya · 2023-09-18T08:25:15Z

My idea for using this feature: Google takeout works by giving you access to a series of very large .tgz files that you have to download. You can (and should) download those to your local computer/NAS, but if you want to do an offsite backup, you're going to be pushing a tonne of data over your slow home internet upload speeds.

Instead, you can temporarily create a very small/cheap VPS in the cloud somewhere near the storagebox for your offsite backups and do something like:

curl https://path/to/takeout.tgz | gzip -d | restic backup --from-tar - ...

and have nothing from your data need to be stored on disk.

lqb · 2024-04-15T22:06:38Z

I'd like to migrate borg repositories to restic. I already found some issues mentioning tar c /path | restic backup --from-tar (#784, #1910) and tried borg export-tar ::archive - | restic backup --from-tar which did't work.
The reverse migration from restic to borg already works with the command "restic dump aa69bb92 / | borg import-tar ::aa69bb92 -"

avonwyss · 2024-05-03T13:50:26Z

One problem of the TAR format it that the size of the entry must be known in advance. So if you want, for instance, to backup multiple databases by combining the output of multiple command invocations into a TAR, you'd need to store the output of each database backup as temporary file in order to create the TAR data to be piped to Restic. Since Restic does not seem to require knowing the length in advance (--stdin) this would not really solve the case of efficiently combining multiple program outputs into one data stream for backing up with Restic.

remram44 · 2024-05-03T14:46:46Z

@avonwyss Do you know a file format that would work?

lqb · 2024-05-03T15:35:15Z

@avonwyss what exactly? The size should be part of the tar headers.

Edit: https://git.savannah.gnu.org/cgit/tar.git/tree/src/tar.h

avonwyss · 2024-05-03T15:39:03Z

@lqb Yes, that's exactly the issue - it does not allow for streaming files of unknown size.

@remram44 I'm afraid I don't know the perfect answer, especially since TAR is well supported in the *nix domain. Also note that the ZIP format has the same limitation in this regard (is requires known file sizes).

The HTTP protocol solved this issue in V1.1 with the chunking transfer encoding, where data is sent in chunks (each with a size header) and when a chunk of size 0 is transmitted, the stream ends. This together with HTTP headers would allow to transfer both metadata and data of files. (The newer HTTP versions also support streaming but I think they are too complex and not wenn suited for such as task.)

When looking into the SMTP protocol, which is text-based, it uses a single dot on a line as end-of-data marker. MIME uses a known separator to mark the end of data. When using a text-based approach, detection of the end-of-data markers (or MIME separators) can easily be implemented in a very efficient manner using a DFA-based approach (e.g. a simple stat machine, which is very cheap in terms of performance). These protocols are, however, text-based, and thus come with the cost of converting data to (usually) BASE64 for transfer.

In the context of mail services, that problem has also been addressed for SMTP with the use of the chunking extension (RFC 3030), that could actually work pretty well.

The downside of any such approach - or a solution specific to Restic - is of course that the other tools will most likely not have built-in support for generating compliant output, but it would enable developers to create dedicated multi-file backup support through Restic as needed.

remram44 · 2024-05-03T16:20:54Z

This is what I've found so far:

TAR: writing needs size upfront, reading is streamed (no random access) but earlier files might be overwritten
ZIP: writing needs size upfront, reading is streamed but files might be overwritten or deleted (directory record has accurate list and permits random access, but is at end of file)
MIME multipart/mixed archive: writing is streamed, reading is streamed (no random access). No support for UNIX permissions

You are right that a MIME archive (multipart/mixed) is the simplest standard format that allows streaming multiple files (without UNIX metadata). I found that using base64 is not actually required. This format is actually used by cloud-init in addition to emails and HTTP transfers.

Here's my test:

# Test data (binary)
mkdir dir
dd if=/dev/urandom of=file1.bin bs=1k count=2
dd if=/dev/urandom of=dir/file2.bin bs=1k count=1

# Create archive (I couldn't find a tool to do it)
printf -- 'MIME-Version: 1.0\nContent-Type: multipart/mixed;boundary=bb9aba0f55da\n\n' > archive.mime
printf -- '--bb9aba0f55da\n' >> archive.mime
printf -- 'Content-type: application/octet-stream\nContent-disposition: inline; filename="file1.bin"\n\n' >> archive.mime
cat file1.bin >> archive.mime
printf -- '\n--bb9aba0f55da\n' >> archive.mime
printf -- 'Content-type: application/octet-stream\nContent-disposition: inline; filename="dir/file2.bin"\n\n' >> archive.mime
cat dir/file2.bin >> archive.mime
printf -- '\n--bb9aba0f55da\n' >> archive.mime

# Unpack archive
rm -rf testextract && mkdir testextract && cd testextract && munpack ../archive.mime
ls -lR testextract/
# testextract/:
# drwxrwxr-x 2 remram remram 4096 May  3 12:10 dir
# -rw------- 1 remram remram 2048 May  3 12:10 file1.bin
# -rw------- 1 remram remram    0 May  3 12:10 part1
#
# testextract/dir:
# -rw------- 1 remram remram 1024 May  3 12:10 file2.bin

I would say however that this concern seems like a separate step from this issue and maybe we should move this discussion to another ticket (once the TAR support is implemented).

edit: Looking into the code, Restic wants the list of files first. So we'd need an archive format that lists all the files first, without sizes, then the file contents.

avonwyss · 2024-05-03T23:14:36Z

Thank you @remram44 for looking into the formats. Based on your edit, I guess that TAR (or ZIP or MIME) support is not something which fits easily into the existing codebase, since it requires the filenames upfront.

That being said, I came to this issue while looking for a solution to combine multiple STDIN backups into one snapshot, and the issue #1873 has been closed in favor of this one. However, given the streaming nature of STDIN program output and the necessity to know the size in advance, I think that TAR support would not actually be a solution for the original issue. There is, however, another related issue #2133 which is open.

Just to get the different related topics together, there is also a pending pull request #3405 which aims to implement merging of snapshots. This would actually allow for building snapshots from archives such as TAR, ZIP or MIME, and also work for the multiple STDIN files case (by backing up single files into snapshots which are then merged). So maybe finishing the work on the merge functionality would be the way forward for all of these open issues?

Kidswiss mentioned this issue Apr 18, 2019

Import/export of snapshots #1910

Closed

Kidswiss mentioned this issue Apr 27, 2019

Fix dumping issues with / and the first sub level #2255

Merged

7 tasks

fd0 added the type: feature suggestion suggesting a new feature label Apr 27, 2019

pschultz mentioned this issue Jun 26, 2019

dump to tar doesn't support directories #2319

Closed

rawtaz changed the title ~~Parse tar data backupped via stdin~~ Parse tar data backed up via stdin Nov 8, 2020

MichaelEischer added the category: backup label Nov 15, 2020

bbigras mentioned this issue Dec 4, 2020

[Feature] stdin support kopia/kopia#683

Closed

MichaelEischer mentioned this issue Apr 15, 2021

Accept backing up from tar input #437

Closed

MichaelEischer mentioned this issue Aug 20, 2022

Feed multi-file data to restic from external backup "helper" programs and scripts. #1873

Closed

Dr-Emann mentioned this issue Jan 17, 2023

Allow treating stdin as a tar archive rustic-rs/rustic#395

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse tar data backed up via stdin #2226

Parse tar data backed up via stdin #2226

Kidswiss commented Mar 29, 2019

cdhowie commented Mar 29, 2019

fd0 commented Apr 27, 2019

eikevons commented May 1, 2019

alallier commented Jun 18, 2019

FiloSottile commented Apr 26, 2020

jinnko commented Nov 8, 2020 •

edited

Legion2 commented Dec 31, 2020

cdhowie commented Jan 1, 2021 •

edited

jcotton42 commented May 18, 2022

jniggemann commented Jun 21, 2022

wmertens commented Sep 27, 2022

cipriancraciun commented Sep 27, 2022

wmertens commented Sep 27, 2022

cipriancraciun commented Sep 27, 2022

wmertens commented Sep 28, 2022

cipriancraciun commented Sep 29, 2022

wmertens commented Sep 29, 2022 •

edited

cipriancraciun commented Oct 4, 2022

MichaelEischer commented Oct 4, 2022

allisonkarlitskaya commented Sep 18, 2023 •

edited

lqb commented Apr 15, 2024

avonwyss commented May 3, 2024

remram44 commented May 3, 2024

lqb commented May 3, 2024 •

edited

avonwyss commented May 3, 2024

remram44 commented May 3, 2024 •

edited

avonwyss commented May 3, 2024

Parse tar data backed up via stdin #2226

Parse tar data backed up via stdin #2226

Comments

Kidswiss commented Mar 29, 2019

Output of restic version

What should restic do differently? Which functionality do you think we should add?

What are you trying to do?

Did restic help you or made you happy in any way?

cdhowie commented Mar 29, 2019

fd0 commented Apr 27, 2019

eikevons commented May 1, 2019

alallier commented Jun 18, 2019

FiloSottile commented Apr 26, 2020

jinnko commented Nov 8, 2020 • edited

Legion2 commented Dec 31, 2020

cdhowie commented Jan 1, 2021 • edited

jcotton42 commented May 18, 2022

jniggemann commented Jun 21, 2022

wmertens commented Sep 27, 2022

cipriancraciun commented Sep 27, 2022

wmertens commented Sep 27, 2022

cipriancraciun commented Sep 27, 2022

wmertens commented Sep 28, 2022

cipriancraciun commented Sep 29, 2022

wmertens commented Sep 29, 2022 • edited

cipriancraciun commented Oct 4, 2022

MichaelEischer commented Oct 4, 2022

allisonkarlitskaya commented Sep 18, 2023 • edited

lqb commented Apr 15, 2024

avonwyss commented May 3, 2024

remram44 commented May 3, 2024

lqb commented May 3, 2024 • edited

avonwyss commented May 3, 2024

remram44 commented May 3, 2024 • edited

avonwyss commented May 3, 2024

Output of `restic version`

jinnko commented Nov 8, 2020 •

edited

cdhowie commented Jan 1, 2021 •

edited

wmertens commented Sep 29, 2022 •

edited

allisonkarlitskaya commented Sep 18, 2023 •

edited

lqb commented May 3, 2024 •

edited

remram44 commented May 3, 2024 •

edited