-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse tar data backed up via stdin #2226
Comments
Duplicate of #437. |
Ah, I actually like the idea. We even have an abstraction layer now ( |
This would also help in a situation, where firewall rules forbid connection from the system-to-be-backed-up to the backup-storage but not the reverse direction. We can have a simple script on the system-to-be-backed-up that is invoked via ssh and |
@fd0 does that mean Restic plans to support streaming tar data to stdin? |
Something I do often on machines where I don't want to install software or credentials is |
This would also be great for backing up volumes from within docker which also uses tar under the hood, for example:
|
I think it is also important that the tar stdin is not completely stored in ram, because if the backup is huge it would not fit. This would allow to stream data from one remote source to a backup without storing the "source" on the local file system. |
Note that this may not be a good solution for securely backing up remote systems. On a LAN it might work, but restic has no way to communicate to the sending side that it can skip a file based on the contents of the parent snapshot. The sender has to send every single byte regardless of what is already in the repository, and restic has to receive all of that data even if it is just going to discard it because the file didn't change. This could be incredibly slow over a WAN connection, and it also requires the sender to read all of the data from disk, which might be very slow. This feature could be useful in some niche cases, but I would argue that it should not be used across the board for secure remote backups as it would be horribly inefficient. A different solution would be required to implement this efficiently. |
This would also be nice for things like postgres tar dumps, eg something like pg_dump --format=tar | restic backup --stdin |
This would also be nice for backing up proxmox VMs / LXCs |
👍 on stdin backups for database dumps. It's a great way to make a clean DB backup that doesn't disturb the app since it only holds a read lock. @Kidswiss as for the tar case specifically, how about using instead using |
But then one needs to store the The main use-case for streaming a |
I meant, you mount the backup, which reveals the .zip file, and then you use zip on that (but you can of course use FUSE again to mount the .zip) I understand the streaming use case, it's just that it seems a bit specific. Tar isn't the nicest format, and it won't support |
Given how restic chunks the data, backing-up a large proper file-system, or a single zip with all the contents, wouldn't yield the same boundaries at least for the first and last chunk of each file. Thus, if the zip creation is not deterministic, or if lots of small files keep changing, then the "single zip" route would just create lots of changed chunks, when in fact not that much has changed. |
Isn't that what the rolling hash is for? https://restic.net/blog/2015-09-12/restic-foundation1-cdc/ restic will find regions in the zip file that start a new boundary, and if you make a change in the zip file, it will only change a few chunks, especially if also need to turn off zip compression. |
First of all, there is the issue of deterministic Then there is the issue of the Also, given that |
@cipriancraciun very good points and they also hold for tar. You make a good case indeed for restic supporting GNU tar input as a virtual filesystem 👍 IMHO it would have to be as a separate flag though. If it were to parse any tar file as a subdirectory, there's no guarantee that it can generate the exact same tar file, and if the file were corrupted it would have to abort the backup. |
This is exactly what this ticket proposes: to use |
That would essential mean implementing borg import-tar for restic. |
My idea for using this feature: Google takeout works by giving you access to a series of very large .tgz files that you have to download. You can (and should) download those to your local computer/NAS, but if you want to do an offsite backup, you're going to be pushing a tonne of data over your slow home internet upload speeds. Instead, you can temporarily create a very small/cheap VPS in the cloud somewhere near the storagebox for your offsite backups and do something like:
and have nothing from your data need to be stored on disk. |
I'd like to migrate borg repositories to restic. I already found some issues mentioning |
One problem of the TAR format it that the size of the entry must be known in advance. So if you want, for instance, to backup multiple databases by combining the output of multiple command invocations into a TAR, you'd need to store the output of each database backup as temporary file in order to create the TAR data to be piped to Restic. Since Restic does not seem to require knowing the length in advance ( |
@avonwyss Do you know a file format that would work? |
@avonwyss what exactly? The size should be part of the tar headers. Edit: https://git.savannah.gnu.org/cgit/tar.git/tree/src/tar.h |
@lqb Yes, that's exactly the issue - it does not allow for streaming files of unknown size. @remram44 I'm afraid I don't know the perfect answer, especially since TAR is well supported in the *nix domain. Also note that the ZIP format has the same limitation in this regard (is requires known file sizes). The HTTP protocol solved this issue in V1.1 with the chunking transfer encoding, where data is sent in chunks (each with a size header) and when a chunk of size 0 is transmitted, the stream ends. This together with HTTP headers would allow to transfer both metadata and data of files. (The newer HTTP versions also support streaming but I think they are too complex and not wenn suited for such as task.) When looking into the SMTP protocol, which is text-based, it uses a single dot on a line as end-of-data marker. MIME uses a known separator to mark the end of data. When using a text-based approach, detection of the end-of-data markers (or MIME separators) can easily be implemented in a very efficient manner using a DFA-based approach (e.g. a simple stat machine, which is very cheap in terms of performance). These protocols are, however, text-based, and thus come with the cost of converting data to (usually) BASE64 for transfer. In the context of mail services, that problem has also been addressed for SMTP with the use of the chunking extension (RFC 3030), that could actually work pretty well. The downside of any such approach - or a solution specific to Restic - is of course that the other tools will most likely not have built-in support for generating compliant output, but it would enable developers to create dedicated multi-file backup support through Restic as needed. |
This is what I've found so far:
You are right that a MIME archive ( Here's my test:# Test data (binary)
mkdir dir
dd if=/dev/urandom of=file1.bin bs=1k count=2
dd if=/dev/urandom of=dir/file2.bin bs=1k count=1
# Create archive (I couldn't find a tool to do it)
printf -- 'MIME-Version: 1.0\nContent-Type: multipart/mixed;boundary=bb9aba0f55da\n\n' > archive.mime
printf -- '--bb9aba0f55da\n' >> archive.mime
printf -- 'Content-type: application/octet-stream\nContent-disposition: inline; filename="file1.bin"\n\n' >> archive.mime
cat file1.bin >> archive.mime
printf -- '\n--bb9aba0f55da\n' >> archive.mime
printf -- 'Content-type: application/octet-stream\nContent-disposition: inline; filename="dir/file2.bin"\n\n' >> archive.mime
cat dir/file2.bin >> archive.mime
printf -- '\n--bb9aba0f55da\n' >> archive.mime
# Unpack archive
rm -rf testextract && mkdir testextract && cd testextract && munpack ../archive.mime
ls -lR testextract/
# testextract/:
# drwxrwxr-x 2 remram remram 4096 May 3 12:10 dir
# -rw------- 1 remram remram 2048 May 3 12:10 file1.bin
# -rw------- 1 remram remram 0 May 3 12:10 part1
#
# testextract/dir:
# -rw------- 1 remram remram 1024 May 3 12:10 file2.bin I would say however that this concern seems like a separate step from this issue and maybe we should move this discussion to another ticket (once the TAR support is implemented). edit: Looking into the code, Restic wants the list of files first. So we'd need an archive format that lists all the files first, without sizes, then the file contents. |
Thank you @remram44 for looking into the formats. Based on your edit, I guess that TAR (or ZIP or MIME) support is not something which fits easily into the existing codebase, since it requires the filenames upfront. That being said, I came to this issue while looking for a solution to combine multiple STDIN backups into one snapshot, and the issue #1873 has been closed in favor of this one. However, given the streaming nature of STDIN program output and the necessity to know the size in advance, I think that TAR support would not actually be a solution for the original issue. There is, however, another related issue #2133 which is open. Just to get the different related topics together, there is also a pending pull request #3405 which aims to implement merging of snapshots. This would actually allow for building snapshots from archives such as TAR, ZIP or MIME, and also work for the multiple STDIN files case (by backing up single files into snapshots which are then merged). So maybe finishing the work on the merge functionality would be the way forward for all of these open issues? |
Output of
restic version
restic 0.9.4 compiled with go1.11.4 on darwin/amd64
What should restic do differently? Which functionality do you think we should add?
If someone streams tar data to
restic
to do a backup:tar -cf - -C /veryimportantfolder | restic backup --stdin
The whole thing will be saved as a single file. This will make restoring a single file very tedious as the whole tar has to be restored, it will get more painful the large the tar file gets.
If restic would parse the tar file and "convert" the entries into restic native file trees, it would be possible to create a virtual folder snapshot. This way a tar file is backed up, but single file restore is still available.
What are you trying to do?
We use restic quite heavily in Kubernetes and OpenShift workloads where it's not always possible to give direct filesystem access to restic. So we stream quite a lot of tar files between containers to get the backups. This creates the problem described above.
This feature would complement #2123.
What do you think? Would something like this make sense?
Did restic help you or made you happy in any way?
Restic rocks :)
The text was updated successfully, but these errors were encountered: