Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to reference and use external data without adding to repo #4758

Open
eharris opened this issue Apr 8, 2024 · 3 comments
Open

Comments

@eharris
Copy link

eharris commented Apr 8, 2024

Output of restic version

restic 0.16.4 (v0.16.4-0-g3786536dc) compiled with go1.21.8 on linux/amd64

What should restic do differently? Which functionality do you think we should add?

It would be great if it were possible to have restic be able to reference, index, and use (but not store in the repo) a source of external data to dramatically reduce the space needed for the repo when backing up very large datasets where an existing and already backed up prior version of the dataset is always available.

This feature would be somewhat analogous to rsync --compare-dest but at a block level, taking advantage of deduplication between that immutable data source and restic.

What are you trying to do? What problem would this solve?

I have a very large dataset (dozens of TB) stored in a zfs pool. An earlier instance of this zfs dataset has been fully backed up offsite using other methods, and is still available as a snapshot within zfs. That earlier version of the data is safe and immutable and does not need to be backed up, and can always be provided to restic along with the repo when needing to restore, even if the system being backed up by restic is catastrophically destroyed.

If there were the ability to do something like restic backup /some/zfs-snapshot --reference-only where those files/blocks were able to be referenced and indexed but not actually stored in the repo, this would allow for storing diff-only "incremental" backups of the changes to this dataset using restic and cloud services, while allowing the repo be kept very small in comparison to the dataset being backed up.

This would dramatically reduce the repo size and storage costs while still providing all the other benefits of restic.

Did restic help you today? Did it make you happy in any way?

restic is a very cool piece of kit, and it's great for use on smaller datasets.

@MichaelEischer
Copy link
Member

While it sounds tempting for your use case, I not sure that it's a good idea to implement something like that in restic. If I understand you correctly, you're suggesting to store references to raw files in the repository (opposed to using a second restic repository as reference). That seems to be rather fragile as restic would have to prepare for changes of the reference data (location). In addition, the repository layer would need a second data path to access the reference data and custom code to create the initial index for that data along with lots of corner cases for prune etc.

@eharris
Copy link
Author

eharris commented Apr 10, 2024

Yes, I do mean storing references to data in raw files.

I readily admit that this would make restic a bit more fragile in this specific use case. It would be an option that should only be used by people wearing the "I know what I'm doing and accept the consequences" hat. However, it would allow restic to be used in circumstances where having to maintain a complete extra copy of the entire dataset within a restic repo is space/cost prohibitive.

As for the second data path pointing to that raw data tree, yes, that would need to be stored, and potentially able to be modified at restore time. It doesn't seem like this would need to be something that should be too onerous to support, as the processing path should be largely the same. And prune shouldn't be complicated much either, as it would still need to do all the same processing with regard to the metadata of the blocks, and just bypasses anything with regard to "free"ing the actual data blocks.

@MichaelEischer
Copy link
Member

The more I think about this suggestion, the more complex things become. A second data path alone is a nontrivial undertaking, however, this one would use a completely different format. The pack files would need to store a new blob type to store the references to the underlying blobs (directly storing the references in the tree blobs would be even worse), this requires lots of new corner cases in quite a few places. If the restore performance should remain reasonable, the index will also require several changes that likely regress the memory usage of other use cases (or duplicate a lot of code again). Initially, adding the blob references to the repository also requires quite a significant amount of new code.

So, no this is anything but a simple change. We're talking about multiple thousand lines of code here along with a new repository format version.

This definitely won't happen anytime soon and I'm not sure that it ever will. The additional complexity is just not worth it in my opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants