fix(cmSketch) #326

FrankReh · 2022-12-23T15:28:26Z

Fixes #108.

Actually there is one fix and one improvement.

Fix:

The fix addresses the initial concern raised in #108, namely that there
had been a prior commit that removed the influence of the row index i
on the index used to access the counter field. So while there were four
rows, the results were like there was just one row. The benefit of three
extra rows in the sketch was lost. This commit fixes that by using part
of the spread key differently for each row again. Now the minimum
that is computed by the Estimate function really has four distinct
values to work with, rather than four copies of the same counter value.

Improvement:

A few years ago I had the discussion in #108 and ended up getting xor
and shift operations from the Caffeine author and the
hash-prospector tool to be
able to index into the cmSketch rows much more uniformly, even when the
keys being given had low entropy.

In further testing, this spread the indexes being used much more
evenly, meaning there would be fewer times the counter value had to
be clipped at the max value, and this should logically make the value
returned by Estimate more accurate.

I used the hash-propector tool for 64 bit hashes to come up with the
order of xor and shift operations and the constants to use but given the
size of the search space, of course could not get an exhaustive answer.
And in the years since, the tool has improved. I see a lot of progress
was made for 32 bit hashes. Someone is welcome to look for new constants
for an even better result, or even if they just wanted to document where
the constants come from better than I can now.

One Sketch test was intended to check that the seeds had caused each sketch row to be unique but the way the test was iterating, the failure wasn't going to be triggered. With this fix, the test passes seeming to indicate the seeds are working as intended on the row creation. But there is a problem. A new Sketch test is added that goes to the trouble of sorting the counters of each row to ensure that each row isn't just essentially a copy of another row, where the only difference is which position the counters occupy. And the new Sketch test is performed twice, once with high-entropy hashes, because they come from a PRNG, and once with low-entropy hashes, which in many cases will be normal, because they are all small numbers, good for indexing, but not good for getting a spread when simply applied to a bitwise XOR operation. These new tests show a problem with the counter increment logic within the cmSketch.Increment method which was most likely introduced by commit f823dc4. A subsequent commit addresses the problems surfaced. But as the discussion from issue dgraph-io#108 shows (discussion later moved to https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712 ), other ideas for incrementing the counters were considered by previous authors as well. Fix existing cmSketch test and add improved cmSketch test (Marked as draft because this introduces a test that fails for now. I can commit a fix to the cmSketch increment method to get the new test to pass - if a maintainer agrees there is a problem to be fixed. See dgraph-io#108. I tried a few years ago.) One Sketch test was intended to check that the seeds had caused each sketch row to be unique but the way the test was iterating, the failure wasn't going to be triggered. With this fix, the test passes seeming to indicate the seeds are working as intended on the row creation. But there is a problem with the actual increment method. A new Sketch test is added that goes further and sorts the counters of each row to ensure that each row isn't just essentially a copy of another row, where the only difference is which position the counters occupy. And the new Sketch test is performed twice, once with high-entropy hashes, because they come from a PRNG, and once with low-entropy hashes, which in many cases will be normal, because they are all small numbers, good for indexing, but not good for getting a spread when simply applied with a bitwise XOR operation. These new tests show a problem with the counter increment logic within the cmSketch.Increment method which was most likely introduced by commit f823dc4. A subsequent commit addresses the problems surfaced. But as the discussion from issue dgraph-io#108 shows (discussion later moved to https://discuss.dgraph.io/t/cmsketch-not-benefitting-from-four-rows/8712), other ideas for incrementing the counters were considered by previous authors of the original Java code as well.

Fixes dgraph-io#108. Actually there is one fix and one improvement. Fix: The fix addresses the initial concern raised in dgraph-io#108, namely that there had been a prior commit that removed the influence of the row index `i` on the index used to access the counter field. So while there were four rows, the results were like there was just one row. The benefit of three extra rows in the sketch was lost. This commit fixes that by using part of the spread key differently for each row again. Now the minimum that is computed by the `Estimate` function really has four distinct values to work with, rather than four copies of the same counter value. Improvement: A few years ago I had the discussion in dgraph-io#108 and ended up getting xor and shift operations from the Caffeine author and the [hash-prospector](https://github.com/skeeto/hash-prospector) tool to be able to index into the cmSketch rows much more uniformly, even when the keys being given had low entropy. In further testing, this spread the indexes being used much more evenly, meaning there would be fewer times the counter value had to be clipped at the max value, and this should logically make the value returned by `Estimate` more accurate. I used the hash-propector tool for 64 bit hashes to come up with the order of xor and shift operations and the constants to use but given the size of the search space, of course could not get an exhaustive answer. And in the years since, the tool has improved. I see a lot of progress was made for 32 bit hashes. Someone is welcome to look for new constants for an even better result, or even if they just wanted to document where the constants come from better than I can now.

FrankReh · 2022-12-23T22:50:22Z

I don't understand the ci Send Coverage failure.

FrankReh · 2023-03-15T21:09:45Z

I guess there is no-one there. Am closing.

But for anyone wanting to pick up this library, know there is still a bug in the cmSketch as I've tried to outline a couple times now.

skrdgraph · 2023-03-16T19:20:28Z

@FrankReh apologies for delays from our end ... I was looking into your CI failures. Could you please rebase your branch & pull the latest changes?

There is particular fix that we did to fix the CI issues: #330 (this one), if you rebase - this should take effect.

Let me know if you have any questions.

skrdgraph · 2023-03-16T19:53:46Z

@FrankReh thanks for reopening the PR & rebasing it, I am approving a CI run on this.

skrdgraph · 2023-03-16T19:54:43Z

Okay - all checks are passing. We will now take a look at your PR and respond back - please be a little patient with us. I am going to add a few folks here.

FrankReh · 2023-03-16T19:58:39Z

Hey. No problem. Thanks for getting the CI issue fixed.

It's been a while since I first found this problem - years ago. And then wanted to raise it again last year.

If someone looking at it wants a different way to approach it, they can take their own branch and change the fours rows in the sketch to one and they'll see they get exactly the same results. The sketch has lost its function because of the xor change that went in a few years ago. If you get that far, then you may want to change it some other way. There are lots of ways of making it functional again.

Thanks.

Yiling-J · 2023-05-04T08:52:07Z

Another optimization to sketch could be cache line fit, this may decrease hit ratio results slightly, around 1% in my Theine benchmark case. But will improve sketch performance when sketch size is very large, 3x faster in my benchmark case.

Here is a modified BenchmarkSketchEstimate case base on my Theine version, which use different size and introduce some randomness to keys:

func BenchmarkSketchEstimate(b *testing.B) {
	for _, size := range []int{10, 100, 1000, 10000, 100000, 1000000, 10000000, 50000000} {
		b.Run(strconv.Itoa(size), func(b *testing.B) {
			s := newCmSketch(int64(size))
			nums := []uint64{}
			for i := 0; i < 100000; i++ {
				h := xxhash.Sum64String(strconv.Itoa(rand.Intn(100000)))
				nums = append(nums, h)
			}
			b.ResetTimer()
			for n := 0; n < b.N; n++ {
				s.Estimate(nums[n%100000])
			}
		})
	}
}

And here is the results:

cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkSketchEstimate/10-12   132780583                8.706 ns/op
BenchmarkSketchEstimate/100-12  129838420                9.102 ns/op
BenchmarkSketchEstimate/1000-12                 139203577                8.629 ns/op
BenchmarkSketchEstimate/10000-12                139754286                8.552 ns/op
BenchmarkSketchEstimate/100000-12               100000000               10.33 ns/op
BenchmarkSketchEstimate/1000000-12              94156653                12.80 ns/op
BenchmarkSketchEstimate/10000000-12             22597843                56.92 ns/op
BenchmarkSketchEstimate/50000000-12             17342504                75.80 ns/op

FrankReh · 2023-05-04T12:29:12Z

Theine

That is interesting but I have to say, given how long I believe this problem has existed in the Ristretto code base now, and nobody has yet found the time to investigate the claim I made back in 2019, I would rather not muddy the waters on this one particular issue. I was guilty of trying to propose a solution that probably made the issue more complicated than it should have been.

I found the problem by just reading the open source and when I stumbled on this function, cmSketch::Estimate, I wondered how the val gotten from the first row would ever be different from the val from the 2nd, 3rd or 4th row so I added some logic to printout when they were different. They weren't. The logic was busted by the PR I've listed before: f823dc4. In particular, this diff:

// Estimate returns the value of the specified key.
 func (s *cmSketch) Estimate(hashed uint64) int64 {
-       l, r := uint32(hashed), uint32(hashed>>32)
        min := byte(255)
        for i := range s.rows {
-               // find the smallest counter value from all the rows
-               if v := s.rows[i].get((l + uint32(i)*r) & s.mask); v < min {
-                       min = v
+               val := s.rows[i].get((hashed ^ s.seed[i]) & s.mask)
+               if val < min {
+                       min = val
                }
        }

That changed code is still there, it's still wrong.

ristretto/sketch.go

Lines 72 to 82 in 34fd2d3

    
           // Estimate returns the value of the specified key. 
        
           func (s *cmSketch) Estimate(hashed uint64) int64 { 
        
           	min := byte(255) 
        
           	for i := range s.rows { 
        
           		val := s.rows[i].get((hashed ^ s.seed[i]) & s.mask) 
        
           		if val < min { 
        
           			min = val 
        
           		} 
        
           	} 
        
           	return int64(min) 
        
           }

The rows[i].get takes an index and returns a value. The broken code doesn't use all the bits for determining which index to use, it just uses the same low bits (the ones that fit into s.mask) over and over, four times. It is slightly obfuscated by the fact the low bits that are used get moved around by a different xor value for each row, but that's doesn't change the fact that most of the input hash bits are being ignored. The original code tried to use all the hash bits in determining the index. One could argue it didn't do a great job but at least it was an attempt.

If folks are happy with the sketch results for the last four years, they might as well change the number of rows from 4 to 1, saving time and space.

Yiling-J · 2023-05-04T12:55:22Z

I think Caffeine's sketch(new cache line fit version) is very good. For people who want to fork ristretto and do some changes, maybe just copying Caffeine's sketch is enough. BTW @FrankReh did you run some hit ratio benchmarks? Because I can't reproduce ristretto's benchmark results in README, I want to know if some one also try to do that

FrankReh · 2023-05-04T13:11:58Z

did you run some hit ratio benchmarks?

yes, I did. But I can't speak to reproducing benchmark results so I can't help with you wanting to do an apples to apples comparison. I figure code changes faster than READMEs.

The current implementation doesn't benefit from four rows in the sketch - might as well make it one, or fix the indexing scheme. That's the only angle I can help with here.

ben-manes · 2023-05-04T19:14:47Z

@vmg would probably be interested in this issue since Vitess integrated a fork of this cache and it contains this deficiency. This sketch is not based on Caffeine's (either its original or cache aligned version), with the former helping @FrankReh when working on this patch.

FrankReh · 2023-05-05T15:49:29Z

Thanks for the link to Caffeine's original @ben-manes .

For the maintainer's, I think it shows that the original Caffeine logic for determining the sketch index to use for each row (item is the hash, i is the row):

https://github.com/ben-manes/caffeine/blob/a6be555f0f2a44d33a9d7e52ea923622e373ac7f/caffeine/src/main/java/com/github/benmanes/caffeine/cache/FrequencySketch.java#L190-L194

used the entire hash for each row (primarily through the multiplication operator), not just the same portion of the hash.

From that, it is easier to see how the mistake crept in. Someone could rationalize that XOR-ing the row seed is more efficient than adding the seed and then multiplying by the seed (and then even adding the high 32 bits to the low 32 bits), before masking for the table size. I think the multiplication step in particular is what gets all the hash bits used in determining which hash bits end up being at the low end when masked for the table size.

I'm guilty again of over complicating the explanation I think. To the maintainers, satisfy yourselves that there is a deficiency by actually checking whether the final min value is ever different from the value pulled out of the first row. You will find that they are always the same, so the three extra rows are adding nothing but taking time and space (meaning that the frequency being reported for a given item will be on the high side, unnecessarily). To get the benefit of four sketch rows, there are many calculations possible, each with speed and uniformity tradeoffs.

vmg · 2023-05-10T16:04:24Z

@ben-manes: Really appreciate the heads up. Will review this PR and possibly backport into Vitess this week. 🙇

ben-manes · 2023-05-10T16:45:19Z

great @vmg. I would recommend you prefer Caffeine's (either original or cache aligned), where @Yiling-J has a Go version that he uses (original and cache aligned). @FrankReh's fixes look great, just that probably fewer eyes and testing has been done on Ristretto's leading to these mistakes.

FrankReh added 2 commits December 23, 2022 07:53

FrankReh requested review from akon-dey, joshua-goldstein and skrdgraph as code owners December 23, 2022 15:28

FrankReh closed this Mar 15, 2023

Merge branch 'main'

70ae355

FrankReh reopened this Mar 16, 2023

skrdgraph requested review from all-seeing-code, harshil-goel and mangalaman93 March 16, 2023 19:55

FrankReh mentioned this pull request Mar 3, 2024

sketch bug from 2019 hasn't been fixed outcaste-io/ristretto#12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cmSketch) #326

fix(cmSketch) #326

FrankReh commented Dec 23, 2022

FrankReh commented Dec 23, 2022

FrankReh commented Mar 15, 2023

skrdgraph commented Mar 16, 2023

skrdgraph commented Mar 16, 2023

skrdgraph commented Mar 16, 2023

FrankReh commented Mar 16, 2023

Yiling-J commented May 4, 2023

FrankReh commented May 4, 2023

Yiling-J commented May 4, 2023

FrankReh commented May 4, 2023

ben-manes commented May 4, 2023 •

edited

FrankReh commented May 5, 2023

vmg commented May 10, 2023

ben-manes commented May 10, 2023

fix(cmSketch) #326

Are you sure you want to change the base?

fix(cmSketch) #326

Conversation

FrankReh commented Dec 23, 2022

FrankReh commented Dec 23, 2022

FrankReh commented Mar 15, 2023

skrdgraph commented Mar 16, 2023

skrdgraph commented Mar 16, 2023

skrdgraph commented Mar 16, 2023

FrankReh commented Mar 16, 2023

Yiling-J commented May 4, 2023

FrankReh commented May 4, 2023

Yiling-J commented May 4, 2023

FrankReh commented May 4, 2023

ben-manes commented May 4, 2023 • edited

FrankReh commented May 5, 2023

vmg commented May 10, 2023

ben-manes commented May 10, 2023

ben-manes commented May 4, 2023 •

edited