Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(cmSketch) #326

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

FrankReh
Copy link

Fixes #108.

Actually there is one fix and one improvement.

Fix:

The fix addresses the initial concern raised in #108, namely that there
had been a prior commit that removed the influence of the row index i
on the index used to access the counter field. So while there were four
rows, the results were like there was just one row. The benefit of three
extra rows in the sketch was lost. This commit fixes that by using part
of the spread key differently for each row again. Now the minimum
that is computed by the Estimate function really has four distinct
values to work with, rather than four copies of the same counter value.

Improvement:

A few years ago I had the discussion in #108 and ended up getting xor
and shift operations from the Caffeine author and the
hash-prospector tool to be
able to index into the cmSketch rows much more uniformly, even when the
keys being given had low entropy.

In further testing, this spread the indexes being used much more
evenly, meaning there would be fewer times the counter value had to
be clipped at the max value, and this should logically make the value
returned by Estimate more accurate.

I used the hash-propector tool for 64 bit hashes to come up with the
order of xor and shift operations and the constants to use but given the
size of the search space, of course could not get an exhaustive answer.
And in the years since, the tool has improved. I see a lot of progress
was made for 32 bit hashes. Someone is welcome to look for new constants
for an even better result, or even if they just wanted to document where
the constants come from better than I can now.

One Sketch test was intended to check that the seeds had caused each
sketch row to be unique but the way the test was iterating, the failure
wasn't going to be triggered.

    With this fix, the test passes seeming to indicate the seeds are
    working as intended on the row creation. But there is a problem.

A new Sketch test is added that goes to the trouble of sorting the
counters of each row to ensure that each row isn't just essentially a
copy of another row, where the only difference is which position the
counters occupy.

And the new Sketch test is performed twice, once with high-entropy
hashes, because they come from a PRNG, and once with low-entropy hashes,
which in many cases will be normal, because they are all small numbers,
good for indexing, but not good for getting a spread when simply applied
to a bitwise XOR operation.

    These new tests show a problem with the counter increment logic
    within the cmSketch.Increment method which was most likely
    introduced by commit f823dc4.

    A subsequent commit addresses the problems surfaced. But as the
    discussion from issue dgraph-io#108 shows (discussion later moved to
    https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712 ),
    other ideas for incrementing the counters were considered
    by previous authors as well.

Fix existing cmSketch test and add improved cmSketch test

(Marked as draft because this introduces a test that fails for now. I can
commit a fix to the cmSketch increment method to get the new test to
pass - if a maintainer agrees there is a problem to be fixed. See dgraph-io#108.
I tried a few years ago.)

One Sketch test was intended to check that the seeds had caused each
sketch row to be unique but the way the test was iterating, the failure
wasn't going to be triggered.

    With this fix, the test passes seeming to indicate the seeds are
    working as intended on the row creation. But there is a problem with
    the actual increment method. A new Sketch test is added that goes
    further and sorts the counters of each row to ensure that each row
    isn't just essentially a copy of another row, where the only
    difference is which position the counters occupy.

And the new Sketch test is performed twice, once with high-entropy
hashes, because they come from a PRNG, and once with low-entropy hashes,
which in many cases will be normal, because they are all small numbers,
good for indexing, but not good for getting a spread when simply applied
with a bitwise XOR operation.

    These new tests show a problem with the counter increment logic
    within the cmSketch.Increment method which was most likely
    introduced by commit f823dc4.

    A subsequent commit addresses the problems surfaced. But as the
    discussion from issue dgraph-io#108 shows (discussion later moved to
    https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712),
    other ideas for incrementing the counters were considered by
    previous authors of the original Java code as well.
Fixes dgraph-io#108.

Actually there is one fix and one improvement.

Fix:

The fix addresses the initial concern raised in dgraph-io#108, namely that there
had been a prior commit that removed the influence of the row index `i`
on the index used to access the counter field. So while there were four
rows, the results were like there was just one row. The benefit of three
extra rows in the sketch was lost. This commit fixes that by using part
of the spread key differently for each row again. Now the minimum
that is computed by the `Estimate` function really has four distinct
values to work with, rather than four copies of the same counter value.

Improvement:

A few years ago I had the discussion in dgraph-io#108 and ended up getting xor
and shift operations from the Caffeine author and the
[hash-prospector](https://github.com/skeeto/hash-prospector) tool to be
able to index into the cmSketch rows much more uniformly, even when the
keys being given had low entropy.

In further testing, this spread the indexes being used much more
evenly, meaning there would be fewer times the counter value had to
be clipped at the max value, and this should logically make the value
returned by `Estimate` more accurate.

I used the hash-propector tool for 64 bit hashes to come up with the
order of xor and shift operations and the constants to use but given the
size of the search space, of course could not get an exhaustive answer.
And in the years since, the tool has improved. I see a lot of progress
was made for 32 bit hashes. Someone is welcome to look for new constants
for an even better result, or even if they just wanted to document where
the constants come from better than I can now.
@FrankReh
Copy link
Author

I don't understand the ci Send Coverage failure.

@FrankReh
Copy link
Author

I guess there is no-one there. Am closing.

But for anyone wanting to pick up this library, know there is still a bug in the cmSketch as I've tried to outline a couple times now.

@FrankReh FrankReh closed this Mar 15, 2023
@skrdgraph
Copy link

@FrankReh apologies for delays from our end ... I was looking into your CI failures. Could you please rebase your branch & pull the latest changes?

There is particular fix that we did to fix the CI issues: #330 (this one), if you rebase - this should take effect.

Let me know if you have any questions.

@FrankReh FrankReh reopened this Mar 16, 2023
@skrdgraph
Copy link

@FrankReh thanks for reopening the PR & rebasing it, I am approving a CI run on this.

@skrdgraph
Copy link

Okay - all checks are passing. We will now take a look at your PR and respond back - please be a little patient with us. I am going to add a few folks here.

@FrankReh
Copy link
Author

Hey. No problem. Thanks for getting the CI issue fixed.

It's been a while since I first found this problem - years ago. And then wanted to raise it again last year.

If someone looking at it wants a different way to approach it, they can take their own branch and change the fours rows in the sketch to one and they'll see they get exactly the same results. The sketch has lost its function because of the xor change that went in a few years ago. If you get that far, then you may want to change it some other way. There are lots of ways of making it functional again.

Thanks.

@Yiling-J
Copy link

Yiling-J commented May 4, 2023

Another optimization to sketch could be cache line fit, this may decrease hit ratio results slightly, around 1% in my Theine benchmark case. But will improve sketch performance when sketch size is very large, 3x faster in my benchmark case.

Here is a modified BenchmarkSketchEstimate case base on my Theine version, which use different size and introduce some randomness to keys:

func BenchmarkSketchEstimate(b *testing.B) {
	for _, size := range []int{10, 100, 1000, 10000, 100000, 1000000, 10000000, 50000000} {
		b.Run(strconv.Itoa(size), func(b *testing.B) {
			s := newCmSketch(int64(size))
			nums := []uint64{}
			for i := 0; i < 100000; i++ {
				h := xxhash.Sum64String(strconv.Itoa(rand.Intn(100000)))
				nums = append(nums, h)
			}
			b.ResetTimer()
			for n := 0; n < b.N; n++ {
				s.Estimate(nums[n%100000])
			}
		})
	}
}

And here is the results:

cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkSketchEstimate/10-12   132780583                8.706 ns/op
BenchmarkSketchEstimate/100-12  129838420                9.102 ns/op
BenchmarkSketchEstimate/1000-12                 139203577                8.629 ns/op
BenchmarkSketchEstimate/10000-12                139754286                8.552 ns/op
BenchmarkSketchEstimate/100000-12               100000000               10.33 ns/op
BenchmarkSketchEstimate/1000000-12              94156653                12.80 ns/op
BenchmarkSketchEstimate/10000000-12             22597843                56.92 ns/op
BenchmarkSketchEstimate/50000000-12             17342504                75.80 ns/op

@FrankReh
Copy link
Author

FrankReh commented May 4, 2023

Theine

That is interesting but I have to say, given how long I believe this problem has existed in the Ristretto code base now, and nobody has yet found the time to investigate the claim I made back in 2019, I would rather not muddy the waters on this one particular issue. I was guilty of trying to propose a solution that probably made the issue more complicated than it should have been.

I found the problem by just reading the open source and when I stumbled on this function, cmSketch::Estimate, I wondered how the val gotten from the first row would ever be different from the val from the 2nd, 3rd or 4th row so I added some logic to printout when they were different. They weren't. The logic was busted by the PR I've listed before: f823dc4. In particular, this diff:

// Estimate returns the value of the specified key.
 func (s *cmSketch) Estimate(hashed uint64) int64 {
-       l, r := uint32(hashed), uint32(hashed>>32)
        min := byte(255)
        for i := range s.rows {
-               // find the smallest counter value from all the rows
-               if v := s.rows[i].get((l + uint32(i)*r) & s.mask); v < min {
-                       min = v
+               val := s.rows[i].get((hashed ^ s.seed[i]) & s.mask)
+               if val < min {
+                       min = val
                }
        }

That changed code is still there, it's still wrong.

ristretto/sketch.go

Lines 72 to 82 in 34fd2d3

// Estimate returns the value of the specified key.
func (s *cmSketch) Estimate(hashed uint64) int64 {
min := byte(255)
for i := range s.rows {
val := s.rows[i].get((hashed ^ s.seed[i]) & s.mask)
if val < min {
min = val
}
}
return int64(min)
}

The rows[i].get takes an index and returns a value. The broken code doesn't use all the bits for determining which index to use, it just uses the same low bits (the ones that fit into s.mask) over and over, four times. It is slightly obfuscated by the fact the low bits that are used get moved around by a different xor value for each row, but that's doesn't change the fact that most of the input hash bits are being ignored. The original code tried to use all the hash bits in determining the index. One could argue it didn't do a great job but at least it was an attempt.

If folks are happy with the sketch results for the last four years, they might as well change the number of rows from 4 to 1, saving time and space.

@Yiling-J
Copy link

Yiling-J commented May 4, 2023

I think Caffeine's sketch(new cache line fit version) is very good. For people who want to fork ristretto and do some changes, maybe just copying Caffeine's sketch is enough. BTW @FrankReh did you run some hit ratio benchmarks? Because I can't reproduce ristretto's benchmark results in README, I want to know if some one also try to do that

@FrankReh
Copy link
Author

FrankReh commented May 4, 2023

did you run some hit ratio benchmarks?

yes, I did. But I can't speak to reproducing benchmark results so I can't help with you wanting to do an apples to apples comparison. I figure code changes faster than READMEs.

The current implementation doesn't benefit from four rows in the sketch - might as well make it one, or fix the indexing scheme. That's the only angle I can help with here.

@ben-manes
Copy link

ben-manes commented May 4, 2023

@vmg would probably be interested in this issue since Vitess integrated a fork of this cache and it contains this deficiency. This sketch is not based on Caffeine's (either its original or cache aligned version), with the former helping @FrankReh when working on this patch.

@FrankReh
Copy link
Author

FrankReh commented May 5, 2023

Thanks for the link to Caffeine's original @ben-manes .

For the maintainer's, I think it shows that the original Caffeine logic for determining the sketch index to use for each row (item is the hash, i is the row):

https://github.com/ben-manes/caffeine/blob/a6be555f0f2a44d33a9d7e52ea923622e373ac7f/caffeine/src/main/java/com/github/benmanes/caffeine/cache/FrequencySketch.java#L190-L194

used the entire hash for each row (primarily through the multiplication operator), not just the same portion of the hash.

From that, it is easier to see how the mistake crept in. Someone could rationalize that XOR-ing the row seed is more efficient than adding the seed and then multiplying by the seed (and then even adding the high 32 bits to the low 32 bits), before masking for the table size. I think the multiplication step in particular is what gets all the hash bits used in determining which hash bits end up being at the low end when masked for the table size.


I'm guilty again of over complicating the explanation I think. To the maintainers, satisfy yourselves that there is a deficiency by actually checking whether the final min value is ever different from the value pulled out of the first row. You will find that they are always the same, so the three extra rows are adding nothing but taking time and space (meaning that the frequency being reported for a given item will be on the high side, unnecessarily). To get the benefit of four sketch rows, there are many calculations possible, each with speed and uniformity tradeoffs.

@vmg
Copy link
Contributor

vmg commented May 10, 2023

@ben-manes: Really appreciate the heads up. Will review this PR and possibly backport into Vitess this week. 🙇

@ben-manes
Copy link

great @vmg. I would recommend you prefer Caffeine's (either original or cache aligned), where @Yiling-J has a Go version that he uses (original and cache aligned). @FrankReh's fixes look great, just that probably fewer eyes and testing has been done on Ristretto's leading to these mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cmSketch not benefitting from four rows
5 participants