Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(cmSketch test) and add better cmSketch test #325

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

FrankReh
Copy link

One Sketch test was intended to check that the seeds had caused each sketch row to be unique but the way the test was iterating, the failure wasn't going to be triggered.

With this fix, the test passes seeming to indicate the seeds are
working as intended on the row creation. But there is a problem.

A new Sketch test is added that goes to the trouble of sorting the counters of each row to ensure that each row isn't just essentially a copy of another row, where the only difference is which position the counters occupy.

And the new Sketch test is performed twice, once with high-entropy hashes, because they come from a PRNG, and once with low-entropy hashes, which in many cases will be normal, because they are all small numbers, good for indexing, but not good for getting a spread when simply applied to a bitwise XOR operation.

These new tests show a problem with the counter increment logic
within the cmSketch.Increment method which was most likely
introduced by commit f823dc4a.

A subsequent commit addresses the problems surfaced. But as the
discussion from issue #108 shows (discussion later moved to
https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712 ),
other ideas for incrementing the counters were considered
by previous authors as well.

Fix existing cmSketch test and add improved cmSketch test

(Marked as draft because this introduces a test that fails for now. I can commit a fix to the cmSketch increment method to get the new test to pass - if a maintainer agrees there is a problem to be fixed. See #108. I tried a few years ago.)

One Sketch test was intended to check that the seeds had caused each sketch row to be unique but the way the test was iterating, the failure wasn't going to be triggered.

With this fix, the test passes seeming to indicate the seeds are
working as intended on the row creation. But there is a problem with
the actual increment method. A new Sketch test is added that goes
further and sorts the counters of each row to ensure that each row
isn't just essentially a copy of another row, where the only
difference is which position the counters occupy.

And the new Sketch test is performed twice, once with high-entropy hashes, because they come from a PRNG, and once with low-entropy hashes, which in many cases will be normal, because they are all small numbers, good for indexing, but not good for getting a spread when simply applied with a bitwise XOR operation.

These new tests show a problem with the counter increment logic
within the cmSketch.Increment method which was most likely
introduced by commit f823dc4a.

A subsequent commit addresses the problems surfaced. But as the
discussion from issue #108 shows (discussion later moved to
https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712),
other ideas for incrementing the counters were considered by
previous authors of the original Java code as well.

Problem

Solution

One Sketch test was intended to check that the seeds had caused each
sketch row to be unique but the way the test was iterating, the failure
wasn't going to be triggered.

    With this fix, the test passes seeming to indicate the seeds are
    working as intended on the row creation. But there is a problem.

A new Sketch test is added that goes to the trouble of sorting the
counters of each row to ensure that each row isn't just essentially a
copy of another row, where the only difference is which position the
counters occupy.

And the new Sketch test is performed twice, once with high-entropy
hashes, because they come from a PRNG, and once with low-entropy hashes,
which in many cases will be normal, because they are all small numbers,
good for indexing, but not good for getting a spread when simply applied
to a bitwise XOR operation.

    These new tests show a problem with the counter increment logic
    within the cmSketch.Increment method which was most likely
    introduced by commit f823dc4.

    A subsequent commit addresses the problems surfaced. But as the
    discussion from issue dgraph-io#108 shows (discussion later moved to
    https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712 ),
    other ideas for incrementing the counters were considered
    by previous authors as well.

Fix existing cmSketch test and add improved cmSketch test

(Marked as draft because this introduces a test that fails for now. I can
commit a fix to the cmSketch increment method to get the new test to
pass - if a maintainer agrees there is a problem to be fixed. See dgraph-io#108.
I tried a few years ago.)

One Sketch test was intended to check that the seeds had caused each
sketch row to be unique but the way the test was iterating, the failure
wasn't going to be triggered.

    With this fix, the test passes seeming to indicate the seeds are
    working as intended on the row creation. But there is a problem with
    the actual increment method. A new Sketch test is added that goes
    further and sorts the counters of each row to ensure that each row
    isn't just essentially a copy of another row, where the only
    difference is which position the counters occupy.

And the new Sketch test is performed twice, once with high-entropy
hashes, because they come from a PRNG, and once with low-entropy hashes,
which in many cases will be normal, because they are all small numbers,
good for indexing, but not good for getting a spread when simply applied
with a bitwise XOR operation.

    These new tests show a problem with the counter increment logic
    within the cmSketch.Increment method which was most likely
    introduced by commit f823dc4.

    A subsequent commit addresses the problems surfaced. But as the
    discussion from issue dgraph-io#108 shows (discussion later moved to
    https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712),
    other ideas for incrementing the counters were considered by
    previous authors of the original Java code as well.
@CLAassistant
Copy link

CLAassistant commented Dec 23, 2022

CLA assistant check
All committers have signed the CLA.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants