Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse tar data backed up via stdin #2226

Open
Kidswiss opened this issue Mar 29, 2019 · 27 comments
Open

Parse tar data backed up via stdin #2226

Kidswiss opened this issue Mar 29, 2019 · 27 comments
Labels

Comments

@Kidswiss
Copy link
Contributor

Output of restic version

restic 0.9.4 compiled with go1.11.4 on darwin/amd64

What should restic do differently? Which functionality do you think we should add?

If someone streams tar data to restic to do a backup:

tar -cf - -C /veryimportantfolder | restic backup --stdin

The whole thing will be saved as a single file. This will make restoring a single file very tedious as the whole tar has to be restored, it will get more painful the large the tar file gets.

If restic would parse the tar file and "convert" the entries into restic native file trees, it would be possible to create a virtual folder snapshot. This way a tar file is backed up, but single file restore is still available.

What are you trying to do?

We use restic quite heavily in Kubernetes and OpenShift workloads where it's not always possible to give direct filesystem access to restic. So we stream quite a lot of tar files between containers to get the backups. This creates the problem described above.

This feature would complement #2123.

What do you think? Would something like this make sense?

Did restic help you or made you happy in any way?

Restic rocks :)

@cdhowie
Copy link
Contributor

cdhowie commented Mar 29, 2019

Duplicate of #437.

@fd0
Copy link
Member

fd0 commented Apr 27, 2019

Ah, I actually like the idea. We even have an abstraction layer now (fs.FS) which could be used to implement a tar file system maybe.

@fd0 fd0 added the type: feature suggestion suggesting a new feature label Apr 27, 2019
@eikevons
Copy link

eikevons commented May 1, 2019

This would also help in a situation, where firewall rules forbid connection from the system-to-be-backed-up to the backup-storage but not the reverse direction. We can have a simple script on the system-to-be-backed-up that is invoked via ssh and tars to stdout and by-pass the necessity to make the whole system available through sshfs.

@alallier
Copy link

@fd0 does that mean Restic plans to support streaming tar data to stdin?

@FiloSottile
Copy link
Contributor

Something I do often on machines where I don't want to install software or credentials is ssh machine.home.arpa tar cv ~. It would be awesome to be able to pipe that into restic and have it understand as a filesystem.

@jinnko
Copy link

jinnko commented Nov 8, 2020

This would also be great for backing up volumes from within docker which also uses tar under the hood, for example:

docker cp running-or-stopped-container:/path/to/volume - | restic backup --stdin

@rawtaz rawtaz changed the title Parse tar data backupped via stdin Parse tar data backed up via stdin Nov 8, 2020
@Legion2
Copy link

Legion2 commented Dec 31, 2020

I think it is also important that the tar stdin is not completely stored in ram, because if the backup is huge it would not fit. This would allow to stream data from one remote source to a backup without storing the "source" on the local file system.

@cdhowie
Copy link
Contributor

cdhowie commented Jan 1, 2021

Note that this may not be a good solution for securely backing up remote systems. On a LAN it might work, but restic has no way to communicate to the sending side that it can skip a file based on the contents of the parent snapshot. The sender has to send every single byte regardless of what is already in the repository, and restic has to receive all of that data even if it is just going to discard it because the file didn't change. This could be incredibly slow over a WAN connection, and it also requires the sender to read all of the data from disk, which might be very slow.

This feature could be useful in some niche cases, but I would argue that it should not be used across the board for secure remote backups as it would be horribly inefficient. A different solution would be required to implement this efficiently.

@jcotton42
Copy link

This would also be nice for things like postgres tar dumps, eg something like

pg_dump --format=tar | restic backup --stdin

@jniggemann
Copy link
Contributor

This would also be nice for backing up proxmox VMs / LXCs
vzdump 103 --mode snapshot --stdout | restic backup --stdin

@wmertens
Copy link

👍 on stdin backups for database dumps. It's a great way to make a clean DB backup that doesn't disturb the app since it only holds a read lock.

@Kidswiss as for the tar case specifically, how about using instead using zip to stdout with 0 compression, and then mounting the backup via FUSE? That should allow zip to directly access the index and only read the parts it needs. Tar doesn't have an index.

@cipriancraciun
Copy link

Kidswiss as for the tar case specifically, how about using instead using zip to stdout with 0 compression, and then mounting the backup via FUSE? That should allow zip to directly access the index and only read the parts it needs. Tar doesn't have an index.

But then one needs to store the zip locally in order to have it mounted. (If one is dumping multi-TiB data sources, with tar one only needs the patience to stream it, meanwhile with zip one also needs to store it temporarily.)

The main use-case for streaming a tar, but backing-it-up via restic as it would be a proper file-system, is as @FiloSottile has mentioned being able to ssh into an untrusted server, create a full tar of the target file-system, stream it over ssh on a trusted staging server (but one that perhaps doesn't have the storage capacity to temporarily store the tar), and feed it to restic.

@wmertens
Copy link

But then one needs to store the zip locally in order to have it mounted

I meant, you mount the backup, which reveals the .zip file, and then you use zip on that (but you can of course use FUSE again to mount the .zip)

I understand the streaming use case, it's just that it seems a bit specific. Tar isn't the nicest format, and it won't support vzdump either because that's not tar. OTOH tar is really popular so if restic were to support something like that, tar seems a good candidate.

@cipriancraciun
Copy link

I meant, you mount the backup, which reveals the .zip file, and then you use zip on that (but you can of course use FUSE again to mount the .zip)

Given how restic chunks the data, backing-up a large proper file-system, or a single zip with all the contents, wouldn't yield the same boundaries at least for the first and last chunk of each file.

Thus, if the zip creation is not deterministic, or if lots of small files keep changing, then the "single zip" route would just create lots of changed chunks, when in fact not that much has changed.

@wmertens
Copy link

Given how restic chunks the data, backing-up a large proper file-system, or a single zip with all the contents, wouldn't yield the same boundaries at least for the first and last chunk of each file.

Isn't that what the rolling hash is for? https://restic.net/blog/2015-09-12/restic-foundation1-cdc/

restic will find regions in the zip file that start a new boundary, and if you make a change in the zip file, it will only change a few chunks, especially if also need to turn off zip compression.

@cipriancraciun
Copy link

Isn't that what the rolling hash is for? https://restic.net/blog/2015-09-12/restic-foundation1-cdc/
restic will find regions in the zip file that start a new boundary, and if you make a change in the zip file, it will only change a few chunks, especially if also need to turn off zip compression.

First of all, there is the issue of deterministic zip creation. If there are lots of small files, and their order change non deterministically, then for certain deduplication would not work properly, unless the chunk size is well below the average file size. (In case of restic, the documentation states it aims at 1 MiB chunk size, thus well above the average small file size.)

Then there is the issue of the zip format itself. It seems that each file data is prefixed by a file header which contains the modification time. Thus if something touches a file (without changing the contents), then that chunk will be seen as changed, thus not deduplicated. If restic operates on a proper file-system, the data is not stored, only a new file entry is created.

Also, given that restic aims at a chunk of 1 MiB in size, it means that changing a file of 1 KiB, would imply storing a new chunk (from the zip stream), thus a 99.9% waste. On the other side, if restic operates on a proper file-system, it would just store that 1 KiB and move on.

@wmertens
Copy link

wmertens commented Sep 29, 2022

@cipriancraciun very good points and they also hold for tar.

You make a good case indeed for restic supporting GNU tar input as a virtual filesystem 👍

IMHO it would have to be as a separate flag though. If it were to parse any tar file as a subdirectory, there's no guarantee that it can generate the exact same tar file, and if the file were corrupted it would have to abort the backup.
I suppose it could retry a failed tar as a regular file when reading from disk, but not when reading from stdin.

@cipriancraciun
Copy link

If it were to parse any tar file as a subdirectory, there's no guarantee that it can generate the exact same tar file, and if the file were corrupted it would have to abort the backup.

This is exactly what this ticket proposes: to use tar over stdin as an alternative to walking the file-system (in essence a tar contains all the meta-data restic would obtain from the proper file-system). Thus after restic consumes the tar and creates the snapshot, there would be no more mention of the initial tar, and the newly created snapshot would be identical to a similar snapshot created by using the proper file-system.

@MichaelEischer
Copy link
Member

That would essential mean implementing borg import-tar for restic.

@allisonkarlitskaya
Copy link

allisonkarlitskaya commented Sep 18, 2023

My idea for using this feature: Google takeout works by giving you access to a series of very large .tgz files that you have to download. You can (and should) download those to your local computer/NAS, but if you want to do an offsite backup, you're going to be pushing a tonne of data over your slow home internet upload speeds.

Instead, you can temporarily create a very small/cheap VPS in the cloud somewhere near the storagebox for your offsite backups and do something like:

curl https://path/to/takeout.tgz | gzip -d | restic backup --from-tar - ...

and have nothing from your data need to be stored on disk.

@lqb
Copy link

lqb commented Apr 15, 2024

I'd like to migrate borg repositories to restic. I already found some issues mentioning tar c /path | restic backup --from-tar (#784, #1910) and tried borg export-tar ::archive - | restic backup --from-tar which did't work.
The reverse migration from restic to borg already works with the command "restic dump aa69bb92 / | borg import-tar ::aa69bb92 -"

@avonwyss
Copy link

avonwyss commented May 3, 2024

One problem of the TAR format it that the size of the entry must be known in advance. So if you want, for instance, to backup multiple databases by combining the output of multiple command invocations into a TAR, you'd need to store the output of each database backup as temporary file in order to create the TAR data to be piped to Restic. Since Restic does not seem to require knowing the length in advance (--stdin) this would not really solve the case of efficiently combining multiple program outputs into one data stream for backing up with Restic.

@remram44
Copy link

remram44 commented May 3, 2024

@avonwyss Do you know a file format that would work?

@lqb
Copy link

lqb commented May 3, 2024

@avonwyss what exactly? The size should be part of the tar headers.

Edit: https://git.savannah.gnu.org/cgit/tar.git/tree/src/tar.h

@avonwyss
Copy link

avonwyss commented May 3, 2024

@lqb Yes, that's exactly the issue - it does not allow for streaming files of unknown size.

@remram44 I'm afraid I don't know the perfect answer, especially since TAR is well supported in the *nix domain. Also note that the ZIP format has the same limitation in this regard (is requires known file sizes).

The HTTP protocol solved this issue in V1.1 with the chunking transfer encoding, where data is sent in chunks (each with a size header) and when a chunk of size 0 is transmitted, the stream ends. This together with HTTP headers would allow to transfer both metadata and data of files. (The newer HTTP versions also support streaming but I think they are too complex and not wenn suited for such as task.)

When looking into the SMTP protocol, which is text-based, it uses a single dot on a line as end-of-data marker. MIME uses a known separator to mark the end of data. When using a text-based approach, detection of the end-of-data markers (or MIME separators) can easily be implemented in a very efficient manner using a DFA-based approach (e.g. a simple stat machine, which is very cheap in terms of performance). These protocols are, however, text-based, and thus come with the cost of converting data to (usually) BASE64 for transfer.

In the context of mail services, that problem has also been addressed for SMTP with the use of the chunking extension (RFC 3030), that could actually work pretty well.

The downside of any such approach - or a solution specific to Restic - is of course that the other tools will most likely not have built-in support for generating compliant output, but it would enable developers to create dedicated multi-file backup support through Restic as needed.

@remram44
Copy link

remram44 commented May 3, 2024

This is what I've found so far:
  • TAR: writing needs size upfront, reading is streamed (no random access) but earlier files might be overwritten
  • ZIP: writing needs size upfront, reading is streamed but files might be overwritten or deleted (directory record has accurate list and permits random access, but is at end of file)
  • MIME multipart/mixed archive: writing is streamed, reading is streamed (no random access). No support for UNIX permissions

You are right that a MIME archive (multipart/mixed) is the simplest standard format that allows streaming multiple files (without UNIX metadata). I found that using base64 is not actually required. This format is actually used by cloud-init in addition to emails and HTTP transfers.

Here's my test:
# Test data (binary)
mkdir dir
dd if=/dev/urandom of=file1.bin bs=1k count=2
dd if=/dev/urandom of=dir/file2.bin bs=1k count=1

# Create archive (I couldn't find a tool to do it)
printf -- 'MIME-Version: 1.0\nContent-Type: multipart/mixed;boundary=bb9aba0f55da\n\n' > archive.mime
printf -- '--bb9aba0f55da\n' >> archive.mime
printf -- 'Content-type: application/octet-stream\nContent-disposition: inline; filename="file1.bin"\n\n' >> archive.mime
cat file1.bin >> archive.mime
printf -- '\n--bb9aba0f55da\n' >> archive.mime
printf -- 'Content-type: application/octet-stream\nContent-disposition: inline; filename="dir/file2.bin"\n\n' >> archive.mime
cat dir/file2.bin >> archive.mime
printf -- '\n--bb9aba0f55da\n' >> archive.mime

# Unpack archive
rm -rf testextract && mkdir testextract && cd testextract && munpack ../archive.mime
ls -lR testextract/
# testextract/:
# drwxrwxr-x 2 remram remram 4096 May  3 12:10 dir
# -rw------- 1 remram remram 2048 May  3 12:10 file1.bin
# -rw------- 1 remram remram    0 May  3 12:10 part1
#
# testextract/dir:
# -rw------- 1 remram remram 1024 May  3 12:10 file2.bin

I would say however that this concern seems like a separate step from this issue and maybe we should move this discussion to another ticket (once the TAR support is implemented).

edit: Looking into the code, Restic wants the list of files first. So we'd need an archive format that lists all the files first, without sizes, then the file contents.

@avonwyss
Copy link

avonwyss commented May 3, 2024

Thank you @remram44 for looking into the formats. Based on your edit, I guess that TAR (or ZIP or MIME) support is not something which fits easily into the existing codebase, since it requires the filenames upfront.

That being said, I came to this issue while looking for a solution to combine multiple STDIN backups into one snapshot, and the issue #1873 has been closed in favor of this one. However, given the streaming nature of STDIN program output and the necessity to know the size in advance, I think that TAR support would not actually be a solution for the original issue. There is, however, another related issue #2133 which is open.

Just to get the different related topics together, there is also a pending pull request #3405 which aims to implement merging of snapshots. This would actually allow for building snapshots from archives such as TAR, ZIP or MIME, and also work for the multiple STDIN files case (by backing up single files into snapshots which are then merged). So maybe finishing the work on the merge functionality would be the way forward for all of these open issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests