New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to reference and use external data without adding to repo #4758
Comments
While it sounds tempting for your use case, I not sure that it's a good idea to implement something like that in restic. If I understand you correctly, you're suggesting to store references to raw files in the repository (opposed to using a second restic repository as reference). That seems to be rather fragile as restic would have to prepare for changes of the reference data (location). In addition, the repository layer would need a second data path to access the reference data and custom code to create the initial index for that data along with lots of corner cases for prune etc. |
Yes, I do mean storing references to data in raw files. I readily admit that this would make restic a bit more fragile in this specific use case. It would be an option that should only be used by people wearing the "I know what I'm doing and accept the consequences" hat. However, it would allow restic to be used in circumstances where having to maintain a complete extra copy of the entire dataset within a restic repo is space/cost prohibitive. As for the second data path pointing to that raw data tree, yes, that would need to be stored, and potentially able to be modified at restore time. It doesn't seem like this would need to be something that should be too onerous to support, as the processing path should be largely the same. And prune shouldn't be complicated much either, as it would still need to do all the same processing with regard to the metadata of the blocks, and just bypasses anything with regard to "free"ing the actual data blocks. |
The more I think about this suggestion, the more complex things become. A second data path alone is a nontrivial undertaking, however, this one would use a completely different format. The pack files would need to store a new blob type to store the references to the underlying blobs (directly storing the references in the tree blobs would be even worse), this requires lots of new corner cases in quite a few places. If the restore performance should remain reasonable, the index will also require several changes that likely regress the memory usage of other use cases (or duplicate a lot of code again). Initially, adding the blob references to the repository also requires quite a significant amount of new code. So, no this is anything but a simple change. We're talking about multiple thousand lines of code here along with a new repository format version. This definitely won't happen anytime soon and I'm not sure that it ever will. The additional complexity is just not worth it in my opinion. |
Output of
restic version
restic 0.16.4 (v0.16.4-0-g3786536dc) compiled with go1.21.8 on linux/amd64
What should restic do differently? Which functionality do you think we should add?
It would be great if it were possible to have
restic
be able to reference, index, and use (but not store in the repo) a source of external data to dramatically reduce the space needed for the repo when backing up very large datasets where an existing and already backed up prior version of the dataset is always available.This feature would be somewhat analogous to
rsync --compare-dest
but at a block level, taking advantage of deduplication between that immutable data source and restic.What are you trying to do? What problem would this solve?
I have a very large dataset (dozens of TB) stored in a zfs pool. An earlier instance of this zfs dataset has been fully backed up offsite using other methods, and is still available as a snapshot within zfs. That earlier version of the data is safe and immutable and does not need to be backed up, and can always be provided to
restic
along with the repo when needing to restore, even if the system being backed up byrestic
is catastrophically destroyed.If there were the ability to do something like
restic backup /some/zfs-snapshot --reference-only
where those files/blocks were able to be referenced and indexed but not actually stored in the repo, this would allow for storing diff-only "incremental" backups of the changes to this dataset usingrestic
and cloud services, while allowing the repo be kept very small in comparison to the dataset being backed up.This would dramatically reduce the repo size and storage costs while still providing all the other benefits of restic.
Did restic help you today? Did it make you happy in any way?
restic is a very cool piece of kit, and it's great for use on smaller datasets.
The text was updated successfully, but these errors were encountered: