Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparseHll inserts incorrect value for high number of leading zeros #56

Open
jonhehir opened this issue Jun 24, 2022 · 0 comments
Open

Comments

@jonhehir
Copy link

jonhehir commented Jun 24, 2022

This is an edge case that may not be worth fixing.

SparseHll has a small bug that causes the wrong bucket value to be recorded when the number of leading zeros is within 6 of the maximum possible number. For example, with 4096 buckets (indexBitLength = 12), the maximum possible bucket value is 64 - indexBitLength + 1 = 53 (52 leading zeros). Values up to 46 are correctly recorded, but attempting to insert a hash with more 46 or more leading zeros records an incorrect value. This is reflected in the added unit tests in #55.

I have not thoroughly investigated why this is, but I believe the implementation of HyperLogLog++ in SparseHll differs slightly from that laid out in the original paper. (See in particular the encoding and decoding of hashes in the sparse setting, and note that 6 is equal to VALUE_BITS in SparseHll.) I suspect fixing this bug would require making breaking changes to the format of SparseHll. Meanwhile, the probability of this bug occurring is very remote, so its effects are virtually nil. Nonetheless, I figured it would be worthwhile to document this behavior for any future reference.

jonhehir added a commit to jonhehir/airlift that referenced this issue Jun 24, 2022
This commit makes no functional changes and only adds tests. Beyond merely improving test
coverage, this commit serves as partial documentation of one minor (but surprising) edge case
(prestodb#56) and as verification of behavior that was contested in prestodb#42.
highker pushed a commit that referenced this issue Jun 26, 2022
This commit makes no functional changes and only adds tests. Beyond merely improving test
coverage, this commit serves as partial documentation of one minor (but surprising) edge case
(#56) and as verification of behavior that was contested in #42.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant