Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicacy #14

Open
blackbit47 opened this issue Sep 7, 2022 · 2 comments
Open

Duplicacy #14

blackbit47 opened this issue Sep 7, 2022 · 2 comments

Comments

@blackbit47
Copy link

blackbit47 commented Sep 7, 2022

Hi @deajan,
Awesome work!

I took a look at your script and I have a few suggestions:

1-For backup and restore commands, please use the -threads option with 8 threads for your setup. It will significantly increase speed.

Increase -threads from 8 until you saturate the network link or see a decrease in speed.

2-During init please play with chunk size:

-chunk-size, -c the average size of chunks (default is 4M)
-max-chunk-size, -max the maximum size of chunks (default is chunk-size*4)
-min-chunk-size, -min the minimum size of chunks (default is chunk-size/4)

With homogeneous data, you should see smaller backups and better deduplication. see Chunk size details

3-Some clarifications for your shopping list on Duplicacy:

1-Redundant index copies : duplicacy doesn't use indexes. (or db)
2-Continue restore on bad blocks in repository: yes, and Erasure Coding
3-Data checksumming: yes
4-Backup mounting as filesystem: No (fuse implementation PR but not likely short term)
5-File includes / excludes bases on regexes: yes
6-Automatically excludes CACHEDIR.TAG(3) directories: No
7-Are metadatas encrypted too ?: yes
8-Can encrypted / compressed data be guessed (CRIME/BREACH style attacks)?: No
9-Can a compromised client delete backups?: No (with pub key and immutable target->requires target setup)
10-Can a compromised client restore encrypted data? No (with pub key)
11-Does the backup software support pre/post execution hooks?: yes, see Pre Command and Post Command Scripts
12-Does the backup software provide a crypto benchmark ?: there is a Benchmark command.

Important:

13- Duplicacy is serverless: Less cost, less maintenance, less attack surface..
This also means that D will always be a bit slower since it has to list before it uploads a particular chunk.
14: Duplicacy works with a ton of storage backends: Infinitely scalable and more secure.
15-No indexes or databases.


16-You should test partial restore
17-Test data should be a little bit more diverse. But I guess this is difficult
Hope this helps a bit. Feel free to join the Forum.

Keep up the good work.

deajan added a commit that referenced this issue Sep 7, 2022
@deajan
Copy link
Owner

deajan commented Sep 7, 2022

I've updated the comparaison table with your remarks.

13- Duplicacy is serverless: Less cost, less maintenance, less attack surface..
14: Duplicacy works with a ton of storage backends: Infinitely scalable and more secure.

Does duplicacy have a preferred self hosted backend ?

15-No indexes or databases.

I'm a bit puzzled. Since there are data chunks, there need to be somewhere a description of where they are linked to... something like an index...?

For now, I've added the -threads option for the next test round.

If I go the chunk size route, I'll have to do this for all backup solutions.

@blackbit47
Copy link
Author

Hi ,

Indeed, the lack of index or db is one of the most amazing design features of Duplicacy
Let me quote from the Lock free deduplication algorithm

"What is novel about lock-free deduplication is the absence of a centralized indexing database for tracking all existing chunks and for determining which chunks are not needed any more. Instead, to check if a chunk has already been uploaded before, one can just perform a file lookup via the file storage API using the file name derived from the hash of the chunk. This effectively turns a cloud storage offering only a very limited set of basic file operations into a powerful modern backup backend capable of both block-level and file-level deduplication. More importantly, the absence of a centralized indexing database means that there is no need to implement a distributed locking mechanism on top of the file storage."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants