Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Improve block management for uncompressed blocks to save memory and enhance deduplication #139

Open
wychen opened this issue Apr 27, 2023 · 1 comment

Comments

@wychen
Copy link

wychen commented Apr 27, 2023

I would like to propose optimizing block management for uncompressed blocks in DwarFS. As it currently stands, uncompressed blocks are treated the same way as compressed blocks, meaning they are still loaded into memory and read sequentially from the beginning of the block from disk. This approach can be inefficient, especially when there is frequent access to uncompressed blocks. By allowing random access to the block without reading everything before the segment we need, or even not loading the block into memory at all, we could potentially save a significant amount of private memory.

mmap() could potentially enable efficient random access to uncompressed blocks and possibly eliminate the need to manually load them into memory entirely.

This feature would also be beneficial for the mkdwarfs process. If uncompressed blocks do not occupy private memory, they would not need to be counted toward the --max-lookback-blocks (-B) quota. This approach could effectively enlarge the deduplication lookup window without increasing the memory footprint. This idea is orthogonal to the proposal in #138, and these two methods can be combined to further optimize the deduplication process. For uncompressed blocks, they can still extend with byte granularity since mmap() allows for cheap random access.

I hope this proposal makes sense and I look forward to hearing your thoughts on its feasibility.

@mhx
Copy link
Owner

mhx commented May 25, 2023

This is a great observation and for the first case, it's trivial to implement. I've got it working in a branch and will push the code once I've got a proper internet connection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants