Skip to content

ivoras/goestimators

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fast Estimator Algorithms for Go: (Hyper|Super)?LogLog, Bloom filter

HyperLogLog (and LogLog and SuperLogLog) for Go

This algorithms in this family implement fast approximate counting of arbitrary data (here implemented as binary keys), with a trivially small memory footprint. Their workings are similar enough that the only point of difference is the summarisation step, where the approximation is calculated from internal data. The same internal data can be used for both LogLog, SuperLogLog and HyperLogLog approximation.

The number of LogLog buckets (which affects accuracy, but soon saturates to diminishing returns) is adjustable, with 1024 being the recommended default, and 32 being the smallest practical size. This number must be an integral power of 2.

This library was implemented from scratch, by interpreting the following papers and tutorials:

Any implementation errors are probably my own.

This library uses the Murmur3 hash function implementation from github.com/spaolacci/murmur3.

Example

ll, _ := NewLogLog(1024)
var buf := make([]byte, 8)
// Add 100000 random b-byte buffers to the set
for i := 0; i < 100000; i++ {
	rand.Read(buf)
	ll.Observe(buf)  // Observes the buffer, i.e. updates internal representation from it
}
est := ll.Estimate() // Returns the estimated number of unique observed buffers
fmt.Println("Estimate of a set of 100k random entries: ", est)

This example uses the plain old LogLog estimation algorithm, implemented in the Estimate() function. All three algorithms can be used from the same observation results:

  • Estimate() - LogLog (fastest)
  • SuperEstimate() - SuperLogLog
  • HyperEstimate() - HyperLogLog (slowest)

Performance

Measured on an i5-5200U, the benchmark results (using 1024 buckets) are:

BenchmarkLogLog-4                        2000000               712 ns/op
BenchmarkSuperLogLog-4                   2000000               763 ns/op
BenchmarkHyperLogLog-4                    500000              2367 ns/op
BenchmarkObservationLogLog-4            20000000                61.6 ns/op

Since HyperLogLog improves accuracy only slightly compared to LogLog and SuperLogLog, users should decide for themselves if they can accept the hit in performance.

Accuracy

A typical run of the test functions (using 1024 buckets) looks like this:

LogLogRandom_1M 918170
SuperLogLogRandom_1M 932190
HyperLogLogRandom_1M 1059335

These are the estimates for observing 1M sequential 8-byte buffers (i.e. reasonably unique) by the respective algorithms. YMMV.

Bloom filter

"Bloom filters" test for the absence of an element in a large set. If a Bloom filter returns negative, the element certainly isn't in the set. If it returns positive, the element might be in the set. Memory usage for this is trivially small.

This algorithm was implemented from scratch, mostly by reading the Wikipedia article. It uses the Murmur3 function for the 64-bit hash implementation.

For performance reasons, there are only two supported Bloom filter sizes: 256-bit (which is 32 bytes), and 65536-bit (8 KiB). The 256-bit Bloom filter implements two hash functions (k=2), and the 65536-bit one implements four hash functions (k=4).

Example

const N = 100
b := NewBloom(goestimators.Size256)

// Add N random 64-bit numbers to the set
var i uint64
buf := make([]byte, 8)
for i = 0; i < N; i++ {
	uint64ToBytes(i, buf)
	b.Observe(buf)
}

// Test for existance of N+10 elements
notfound := 0
for i = 0; i < N+10; i++ {
	uint64ToBytes(i, buf)
	if !b.Check(buf) {
		notfound++
	}
}

About

Fast Cardinality Estimation algorithms for Go: HyperLogLog/SuperLogLog/LogLog, Bloom

Topics

Resources

License

Stars

Watchers

Forks

Languages