New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Compression #21
Comments
When implementing this, add benchmarks and especially have a look at memory usage with -benchmem and benchcmp! |
lz4, lzo, lzma, null. bz2 is rather slow. |
snappy is fast with moderate compression |
If compression is done per-chunk, care should be taken that it doesn't leave restic backups open to watermarking/fingerprinting attacks. This is essentially the same problem we discussed related to fingerprinting the CDC deduplication process: With salted CDC, I assume compression would happen on each individual chunk, after splitting the problematic file into chunks. Restic chunks are in the range of 512 KB to 8 MB (but not evenly distributed - right?).
AS always, a paranoid and highly unscientific stream of consciousness. Thoughts? |
Interesting. I don't understand how the attack you describe depends on whether compression is used, is it necessary at all? Doesn't this attack work with and without compression? At the moment I'm thinking about how to implement #56. What are your thoughts on bundling together several blobs into a single file? |
The exact workings of the CDC implementation is a bit unclear to me:
To answer your first question: Example: I see some potential solutions, perhaps you have more ideas:
|
Thanks for your explanation, I understand your scenario now. There will probably be no option to turn off deduplication in restic, because that's really hard to do given the current structure of the program, restic is built around CDC. Adding compression support is of low priority right now and not a goal for the alpha release. Your third idea will be implemented in #56 (bundling together multiple chunks), I'm working on that right now. And I'll probably add more documentation to Thanks again for bringing up this scenario! |
I am not sure if I follow @cfcs - isn't the compressed size just like an incredibly bad hash? Given a compressed file size, the number of possible inputs that generate that file size is endless. But probably I just don't understand. Anyway. I just wanted to shamelessly point you to a modified deflate/gzip library I made. It might interest you that I put in a constant time compression mode, which enables ~150MB/s/core throughput on ANY data, which makes it almost invisible in backup scenarios. There is also a parallel gzip package that gzip-compresses bigger files on multiple cores. |
@klauspost: Bear in mind that we are discussing the compressed sizes of individual chunks rather than files. A 100GB file with an average chunk size of 1MB will have about 100 x 1024 chunks, each leaking the compression ratio for a certain piece of the file. This results in much more statistical data than the size of a single compressed file, making it possible to compare a known plaintext to a compressed and chunked file even if the CDC salt (and thus the exact alignment borders of chunks) is unknown. |
I my opinion, this information leak is only of minor importance, considering we have a seeded chunker (via the custom per-repo polynomial) and blob packing (where an attacker cannot see individual chunks, but only groups of chunks and the number of chunks in an group, via the cleartext header length). I think it's a good idea to offer disabling compression completely for speed or privacy concerns, but the default (when this is implemented) will probably be "enable". @klauspost I will definitely look at your library, thanks for pointing it out! |
I agree with your observations above, @fd0, but I'd like to add that I think there can be another important use-case for selectively disabling compression for specific target files/directories/devices, for example when backing up multimedia formats that already compressed, binary files with high entropy that do not compress well, or when doing incremental backups of encrypted volumes. |
@cfcs - ok - from your text I thought you implied you could deduce the data. So for this to make any difference, you would need the original data, which is rare. Regarding non-compressible files, that is the reason I mentioned the constant time Huffman only mode I put into deflate, as the name implies it compresses all data at the same speed, and has automatic fallback to storing uncompressed data. So the maximum overhead is about 0.04% if the content is compressed. Here are some benchmarks. The most applicable for backups is probably on the "Medium Compression", which is 10GB mixed content - compressible and uncompressible. Defaulting to gzip Huffman only and having an option of specifying more CPU-consuming compression level could make sense. |
In my opinion it could be valuable to selectively enabled/disable compression separately for data and tree objects. Especially large tree objects should compress really good. |
I looked at some "points" where it could make sense to insert compression. The most transparent and versatile place would be somewhere between the repository and the backend. I briefly looked at modifying Repository.Encrypt and Repository.DecryptTo, but we don't have types, and the varying sizes would make a mess of things. My proposition is to implement compression and encryption as a "backend", that both writes to an underlying backend. This will make compression and encryption transparent to the repository. The reason we need to separate out encryption is that encrypted data doesn't compress (as you probably know). repository.RepositoryThe compression algorithm can only be be set on initialization, and everything except the configuration is assumed to be compressed with that algorithm. repository.ConfigThe configuration cannot be compressed. We add a string that indicates the decompression type to use for everything. Empty ("") is uncompressed. Otherwise it is the last part of the package name of the compression library used. Note that compression levels can be changed between each run/type. There is no problem having a repo where some snapshots/types are deflate level 0 (store) and some at level 9 (best compression) - as long as the decompressor is the same. type Config struct {
Version uint `json:"version"`
ID string `json:"id"`
ChunkerPolynomial chunker.Pol `json:"chunker_polynomial"`
+ Compression string
} Compression is added as create parameter: -func CreateConfig(r JSONUnpackedSaver) (Config, error) {
+func CreateConfig(r JSONUnpackedSaver, compression string) (Config, error) { The backend is replaced after LoadConfig/CreateConfig. Here is an example of how it could look: // SearchKey finds a key with the supplied password, afterwards the config is
// read and parsed.
func (r *Repository) SearchKey(password string) error {
key, err := SearchKey(r, password)
if err != nil {
return err
}
- r.key = key.master
- r.keyName = key.Name()
r.Config, err = LoadConfig(r)
+ r.be, err = FindCompressor(r.Config.Compression, Encryption(key, r.be))
return err
} Compression ImplementationThe compressor can implement selective/adjustable compression for some types. Since it is seen as a "backend", the compressed size will never we visible to the repository. The compressor must be summetric with all settings. IssuesHELPME/FIXME: "Packed" files seems like a problem, since encryption re-starts at each file. If encryption is moved to the backend, encryption will be for the entire blob, not for each file. I assume that is a problem, and I have no good solution. TODO: Find some good way for parameters/configuration to be sent to the compressor. Not needed for a first implementation. TODO: Do we care about the on-disk sizes? If the above is implemented, the repository will not know it. |
Thanks for sharing your thoughts, here are mine: I think that compression and encryption need to be integrated with one another, I will think about this. At the moment, I don't like the thought of completely abstracting the compression/encryption layer from one another. As you already described, we must take care to not do stupid things, e.g. compress encrypted data. In addition, we should offer an option to disable compression in favor of speed and/or security concerns. Regarding encryption: There may be a non-crypto mode for restic later, but for now crypto isn't optional. Also the packed files are a level of abstraction in itself (e.g. stuff can be repacked), which doesn't work with abstracting crypto. I think the Regarding the different compression algorithms and options: I think we should select one algorithm (including a set of parameters) for data, and maybe a second one for compressing the JSON tree structures, but that's it. For all optional things (e.g. different configurable compression algorithms and/or parameters) I'd like to have a sound consideration weighting the benefit against the added complexity and amount of additional code. Please don't get me wrong, I'd like to add features to restic, but especially for changes that modify the on-disk format, I need very good arguments. In the general case of adding compression, I can see the benefit, but the complexity and changes to the on-disk format must be managable. |
you'll need different algorithms and parameters (and not just per repo, but even per backup run), there is no "best for every usecase" compression. just as a practical example: i do backups from my company server to my backup server at home. dsl uplink with ~700kbit/s. to the same repository i also backup my laptop when i am at home, there i have a wireless "N" connection. of course i don't want to slow down the connection by using lzma + high level there, but rather i want something very quick that doesn't slow down at all - like lz4. i still want compression though, not using lz4 would take about 2x the space. |
That's all bikeshedding imho... Lets pick a reasonable default and not expose too much configuration, that will just introduce lots of complexity for little gain, imho. |
Fair enough. I started looking through repository.go and found all the places you would insert a compression/decompression step, and the added complexity was not a good thing. By implementing it as a
Which is why backend chaining makes it so nice. You just leave out the encryption (or create a passthrough backend) and it seamlessly works.
This was the least intrusive way I can see. If there is a way to fix the 'packed file' issue, uncompressed repos remains fully backwards compatible; old client will be able to operate on them as before alongside new ones. Compressed repos will obviously not, but we can change
Agree. Most algorithms (lzma/deflate) have plenty of flexibility within the same decompression format. |
To test compressibility there is DataSmoke: https://github.com/Bulat-Ziganshin/DataSmoke Also, pcompress chooses a nice assortment of compression algorithms: https://github.com/moinakg/pcompress The squash compression abstraction library has a good list of algorithms and a benchmark: https://quixdb.github.io/squash/ There is a text compression benchmark here: http://mattmahoney.net/dc/text.html |
Hello @joerg , thank you for sharing your tests.
Seeing the test results, it seems that the CPU time needed by restic is 2x longer for local backup and 6x longer on restore. Not really good compared to Tar. |
tar is not a compression algorithm. of course it is fast. |
Seeing the test results, it seems that the CPU time used by restic is 2x slower in local backup and 6x slower on restore. Not really good compared to Tar.
I’m not completely sure of your point here. restic is slower than Tar, sure, but restic with compression is always faster than restic without, so restic would clearly benefit.
Tar is a useful comparison from a “best case on this hardware”, but it is lacking in most of restic’s other features (snapshots and data deduplication come to mind). Adding compression seems to only improve backup times, restore times and storage costs, all things which are important for a backup product.
|
@joerg Can your friends open a Pull Request and make their patches for restic with compression public available? Which compression algorithm do they use? |
@joerg @thedaveCA |
Please mind that we are not using bare tar archives, but gzipped tar ones with a special parallel zip implementation, otherwise archiving terabytes of data would take days instead of the "just" hours it takes right now: https://gist.github.com/joerg/b88bf1de0ce824894ffc38f597cfef5f#tarpigz |
Compression is a no go for encryption. It lets an attacker to make a guess if an encrypted repository contains a certain file as a section (chunk) of a file compresses to the same size independent of an encryption key used. This is a very well known vulnerability of encryption protocols, and that is why compression was removed from TLS 1.3. Let's not create a known problem where's none, shall we? (I think this problem was already mentioned, and probable not even once. Still this issue is open, where I feel like for this only reason it should be closed once and for all.) |
Why are you spamming the issue? :( It has been discussed so many times that it's almost offtopic. You will not be FORCED to enable the compression!! Moreover, I think your attack idea requires the attacker to be able to control the data to be compressed and encrypted (I am not sure though!). https://en.m.wikipedia.org/wiki/CRIME But in any case, even if it is a security concern, someone may want to use compression only to a storage which is under his own control to simply save the storage space. |
Having even an optional feature that weakens the encryption invites a false sense of security. Restic claims to be a secure backup program. Adding an optional compression will void this promise as you can't be secure sometimes, only full time, all the time. And there will be CVE reports, for sure. Who wants for their software this type of "popularity"? But I think that adding a compression in a way that it will never be used with encryption is a viable option. FWIW in 2017 I made a demo where I stripped encryption from Restic, and showed that compression can be very effective. Hundred times effective. IIRC compression can be add as a some kind of wrapper just like encryption, but I haven't looked at the code for a long time so things may be harder these days, or easier. |
actually CRIME needs to know the length of the ciphertext, which is basically impossible in restic. |
CRIME needs but you don't. Imagine you're a investigative journalist given a set of top secret files by your source. You back them up with encryption and no one will know you have these files. Now imagine you were not clever enough to enable compression, and now everyone else, who also happen to have these files, just judging by the sizes of compressed-then-encrypted chunks, will know that you have these top secret files in this archive without even needing to know the encryption key. This is very far from being secure. People can go to jail because of this "feature", get tortured, or worse. |
restic stores only packed chunks, so the size of chunks is not evident to
someone not having the keys.
…On Fri, Aug 09, 2019 at 02:09:23AM -0700, Alexey Kopytko wrote:
Compression is a no go for encryption. It lets an attacker to make a guess if an encrypted repository contains a certain file as a section (chunk) of a file compresses to the same size independent of an encryption key used. This is [a very well known vulnerability](https://en.wikipedia.org/wiki/CRIME) of encryption protocols, that is why compression was removed from TLS 1.3.
Let's not create a known problem where's none, shall we?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#21 (comment)
--
(Escriu-me xifrat si saps PGP / Write ciphered if you know PGP)
PGP key 7CBD1DA5 - https://emailselfdefense.fsf.org/
|
For who want to know more about these security concerns, there is a nice paper describing it http://www.iacr.org/cryptodb/archive/2002/FSE/3091/3091.pdf |
That is correct. But that'll not exactly help with efficient deduplication if I understand correctly since a compression algorithm may use a different vocabulary for each version of a file resulting in a very different binary result. Which obviously won't deduplicate. Otherwise put, it only makes sense to compress the resulting chunks.
That's a relief. My point still stands: that there are many ways one can add a hidden weakness into a program when it implements compression together with encryption, so best not to add one at all. Even encryption experts deciding about TLS chose to remove compression. Guess they had a similar reasoning. |
btw.:
...
that is bullshit. because it only works with a small sample size. it also can still be possible to go to jail without compression. especially at some point in time, when an attacker gained your backup files he might be able to brutforce them in the future. |
@sanmai, I do not get this example with
What is meant? That someone can guess that an encrypted snapshot has these files just by looking at the size? This assumes that the files are compressed alone, or together with other known files. But then the same guess can be done with an unencrypted shapshot. Actually, how about gzipping files before backing up? Does this open a security vulnurability too? I think this example is plain nonsense: if you claim you can determine whether a snapshot contains compressed versions of some (arbitrary) files known to you, you could as well determine whether it contains those files uncompressed. I do not believe compression can make encryption significantly less secure. |
Most compression side-channel attacks involve several factors:
Unlike web-based systems, in the vast majority involving restic backups, (1) and (2) will rarely hold at the same time. Furthermore, for block-based compression (3) is not really guaranteed, and for most backup regimes (4) certainly does not hold. Because backup frequency is usually once a day or so, it would take thousands of years to be able to manipulate data and monitor the compressed output size to notice any significant differences, and that's assuming that no other data is changing, which in most cases it would be. If you were making backups where the output size were visible then you might want to consider disabling compression. Otherwise, there are really no practical attacks against it and it wouldn't make it less secure to have it enabled. restic already does de-duplication which exposes it to the same theoretical attacks as compression side-channels anyway, and nobody has complained about this to my knowledge. The fact is, there are hundreds or thousands of users who would benefit from a compression feature with no down-sides whatsoever. Can we please just leave this 5-year old issue to the developers who are working on it? |
to be honest.... I prefer the concept of restic ... but I made tests in my usecase (lot of CSV files and SQL dumps) and had to switch to borg. I tested with four generations of incremental backup and my files get a compression of 7:1 and together with deduplication I achieve > 20:1. I can´t ignore that since already said I pay for my online backup storage per GB.
|
Exactly. Slice a plain text file into equal pieces, compress then, then encrypt. Slice again, compress and encrypt. As sizes of encrypted files do not change AES-wise, you would see that in both cases you have a range sizes that match each other like a fingerprint. They (and by they I mean mainly administrations of oppressive regimes like Iran or Russia) can make a reasonable assumption that these files are present here, which therefore gives the reason, say, to continue to torture the suspect. I don't understand why y'all get so offended by these ideas, aren't they simple to understand? This ain't CRIME per se, is it? But as noted before by @viric, technically Restic is not affected by these vulnerabilities as sizes of chunks are not seen without an encryption key. Yet if compression is added at some point, Restic may still not affected now, but may become affected later. |
Does adding compression expose Restic to any additional vulnerability, given that it already does deduplication? |
If your concern is an attacker guessing at compressed blocks sizes to infer the uncompressed size, okay, but does compression make this worse? Wouldn’t an attacker have the same basic information?
If an attacker could see the uncompressed AND compressed sizes of each file then identification might become more realistic, but this wouldn’t be possible in restic.
Ultimately the de-duplication already exposes you to every theoretical attack that I can see compression having an impact on, plus of course compression can be disabled to maintain the current state of affairs if this fits your situation better.
|
I simply don´t understand why you discuss about hypothetical security concerns about guessing a file presence by seeing the size of an encrypted chunk..,, You Guys use ZIP or GZ? Then you should be fine. You think that iranian authorities can guess my content by sizes? Then just dont use compression(!). That does simply not mean that compression should not be available. |
I think we've covered all relevant angles of adding compression to restic, thank you very much for all your input. I think we should add compression and enable it by default, but allow users to disable compression. Please have patience until I have some more time to work on it. It feels to me this discussion gets out of hand, so I'm locking this issue for now. If you like to continue this discussion, please head over to the forum. Thanks! |
Implemented in #3666 🎉 |
This issue is a tracking issue for the purposes of tracking discussions and other issues/PRs related to the request for implementing compression.
The following issues/PRs are related to this topic (and may therefore be closed in favor of this one):
The text was updated successfully, but these errors were encountered: