Skip to content

A stream sampler extracts one or more sample sets, each with a given number of elements, from a stream. Each possible sample set (of the given size) has an equal probability of being extracted. A stream sampler is an online algorithm: The size of the input is unknown, and only one pass over the stream is possible.

License

LiorKogan/StreamSampler

Repository files navigation

StreamSampler

Header-only C++11 library

Copyright 2015 Lior Kogan (koganlior1@gmail.com)

Released under the Apache License, Version 2.0

--

A stream is a sequence of data elements made available over time. The number of elements in the stream is usually large and unknown a priori. A stream sampler maintains an up-to-date one or more sample sets, each has a fixed-size number of elements.

A stream sampler is an online algorithm: The size of the input is unknown, and only one pass over the stream is possible. The sample sets are always up-to-date.

We concentrate here on simple random sampling: for each sample set, each stream element (since the start of stream) has an exactly equal chance of being selected.

The following seven unweighted sampling without replacement reservoir randomized algorithms are implemented:

Algorithm R is the standard 'textbook algorithm'. Algorithms X,Y,Z,K,L and M offer huge performance improvement by drawing the number of stream elements to skip at each stage, so much less random numbers are generated, especially for very large streams. Z,K,L and M are typically 100's of times faster than R, while M is usually the most performant.

In all these papers, the algorithms were formulated such that the fetching of elements from the stream is controlled by the algorithm (An external function, GetNextElement(), is called from within the algorithms). Such flow-control is generally less suitable for real-world scenarios. In this implementation, the algorithms were reformulated such that a process can fetch elements from the stream, and a member-function of the stream sampler class (AddElement) should be called.

This implementation also extends the algorithms by supporting simultaneous extraction of any given number of independent sample sets.

Two versions of AddElement are implemented: one using copy semantics (AddElement(const ElementType& Element)) and one using move semantics (AddElement(ElementType&& Element)). Both functions return the number of future stream elements the caller should skip before calling AddElement again.

StreamSamplerTest contains a usage example: StreamSamplerExample(), a comparative performance benchmark function StreamSamplerPerformanceBenchmark() and a uniformity test function StreamSamplerTestUniformity().

About

A stream sampler extracts one or more sample sets, each with a given number of elements, from a stream. Each possible sample set (of the given size) has an equal probability of being extracted. A stream sampler is an online algorithm: The size of the input is unknown, and only one pass over the stream is possible.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published