Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Supporting deduplication for reference window larger than main memory #138

Open
wychen opened this issue Apr 27, 2023 · 0 comments

Comments

@wychen
Copy link

wychen commented Apr 27, 2023

I would like to propose a new feature that enhances the deduplication capabilities of DwarFS for large datasets when memory is limited. Currently, the --max-lookback-blocks (-B) mechanism requires reference blocks to be kept in memory. However, for datasets much larger than the available memory, it is not practical to keep the entire dataset in memory. This can result in some missed deduplication opportunities due to the limited number of reference blocks. I understand that similarity sorting is already in place to keep similar files close together to increase the compression ratio and duplication matching, but there could still be a chance that duplicated segments fall outside of the window of -B blocks. For example, file sorting wouldn't help deduplication within a huge file.

To address this issue, I propose an additional deduplication mechanism with the following approach:

  1. Before a reference block is discarded due to the -B capacity constraints, generate cryptographic non-cyclic hashes aligned with the saved cyclic hashes.

  2. When a cyclic hash match is found:
    a. If the reference is still within the -B in-memory reference blocks, continue with the current behavior (extend the matching section in both directions).
    b. If the reference is outside of the -B in-memory blocks, use the cryptographic hash to validate the match and extend the matching section by window_step rather than bytes, as we no longer have access to the raw data in the reference.
    This approach would allow deduplicating a much larger window while maintaining a reasonable private memory footprint (hash instead of data outside of -B blocks), an almost zero false positive rate (2^-128 for a 128-bit cryptographic hash function), and slightly worse matching (granularity of window_step instead of bytes).

For these not-in-memory reference blocks, we may want to use a more sparse window_step for both the cyclic and non-cyclic hashes to lower memory footprint and reduce matching overhead. We might also want to use a longer cryptographic hash function, such as SHA-256, as used in ZFS, to reduce the probability of false positives.

Implementing this feature could unlock the possibility of using a reference window much larger than the main memory, thus potentially improving deduplication effectiveness for large datasets on systems with limited memory. I'm just thinking out loud, and I look forward to hearing your thoughts on this proposal and discussing its feasibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant