Feature request: Supporting deduplication for reference window larger than main memory #138

wychen · 2023-04-27T04:11:31Z

I would like to propose a new feature that enhances the deduplication capabilities of DwarFS for large datasets when memory is limited. Currently, the --max-lookback-blocks (-B) mechanism requires reference blocks to be kept in memory. However, for datasets much larger than the available memory, it is not practical to keep the entire dataset in memory. This can result in some missed deduplication opportunities due to the limited number of reference blocks. I understand that similarity sorting is already in place to keep similar files close together to increase the compression ratio and duplication matching, but there could still be a chance that duplicated segments fall outside of the window of -B blocks. For example, file sorting wouldn't help deduplication within a huge file.

To address this issue, I propose an additional deduplication mechanism with the following approach:

Before a reference block is discarded due to the -B capacity constraints, generate cryptographic non-cyclic hashes aligned with the saved cyclic hashes.
When a cyclic hash match is found:
a. If the reference is still within the -B in-memory reference blocks, continue with the current behavior (extend the matching section in both directions).
b. If the reference is outside of the -B in-memory blocks, use the cryptographic hash to validate the match and extend the matching section by window_step rather than bytes, as we no longer have access to the raw data in the reference.
This approach would allow deduplicating a much larger window while maintaining a reasonable private memory footprint (hash instead of data outside of -B blocks), an almost zero false positive rate (2^-128 for a 128-bit cryptographic hash function), and slightly worse matching (granularity of window_step instead of bytes).

For these not-in-memory reference blocks, we may want to use a more sparse window_step for both the cyclic and non-cyclic hashes to lower memory footprint and reduce matching overhead. We might also want to use a longer cryptographic hash function, such as SHA-256, as used in ZFS, to reduce the probability of false positives.

Implementing this feature could unlock the possibility of using a reference window much larger than the main memory, thus potentially improving deduplication effectiveness for large datasets on systems with limited memory. I'm just thinking out loud, and I look forward to hearing your thoughts on this proposal and discussing its feasibility.

wychen mentioned this issue Apr 27, 2023

Feature request: Improve block management for uncompressed blocks to save memory and enhance deduplication #139

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Supporting deduplication for reference window larger than main memory #138

Feature request: Supporting deduplication for reference window larger than main memory #138

wychen commented Apr 27, 2023

Feature request: Supporting deduplication for reference window larger than main memory #138

Feature request: Supporting deduplication for reference window larger than main memory #138

Comments

wychen commented Apr 27, 2023