Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observations and Recommendations using this chunker. #8

Open
dbaarda opened this issue Feb 17, 2021 · 0 comments
Open

Observations and Recommendations using this chunker. #8

dbaarda opened this issue Feb 17, 2021 · 0 comments

Comments

@dbaarda
Copy link

dbaarda commented Feb 17, 2021

This is not so much a bug as general observations and recommendations about using this chunker. I'm putting this here since this chunker is widely copied by other chunker implementations for other languages, and people copying this probably also want to know this. I'm not sure where else this kind of info should go. Apologies if this is annoying. This comes out of some testing and analysis I did of chunker algorithms here;

https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst

Observations related to this FastCDC implementation are;

  1. The average argument is not really the average block length. The actual average is very close to minimum + average/2 (for reasons explained below).
  2. This implementation changes the FastCDC "transition point" to max(0, average - 1.5*minimum). This means that the transition point is less than minimum unless minimum < 0.4*average.
  3. FastCDC gets most of its speed advantages from "cut-point-skipping" past minimum, and recommends minimum = 0.5*average or even minimum=average. This means for the FastCDC recommended minimum settings that give good speed, the first mask is never used, and only the second mask is used.
  4. When only the second mask is used, FastCDC's "normalized chunking" feature is not used. Since this implementation uses "NC1" with only one bit of normalization, it behaves like a normal exponential chunker with average set to half the value. An exponential chunker's actual average block length (excluding truncation from maximum) is minimum + average, hence this chunker's actual average being minimum + average/2 (provided maximum is set large enough to minimize truncations).
  5. Not using "normalized chunking" is probably a good thing anyway, since my testing suggests that is worse than simple exponential chunking when you compare using the same actual average and minimum chunk sizes.

With all of that in mind, based on my testing, for this chunker the best minimum/average/maximum settings for optimal deduplication with good speed would be minimum=average/2, maximum>2.5*average with an average chunk length of average. For faster chunking speed with slightly reduced deduplication use minimum=average, maximum>3*average with an average chunk length of 1.5*average. Unfortunately this chunker throws an error if minimum >= average, so it doesn't let you do that. However, you can get away with setting minimum = average -1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant