Reservoir sampling

Reservoir sampling is a sampling technique used when you want a fixed-sized sample of a dataset with unknown size.

This is a Python implementation of based on this blog, using high-fidelity approximation to the reservoir sampling-gap distribution. This technique is really fast!

$ cat BILLIONS_OF_LINES.txt | ./reservoir.py 1000
#
# Creating a sample of 1000 elements
#
# If bored, you can hit ctrl+C to stop the sampling where it was.
# The signal will be caught and the sample will be printed on stdout.
line_154
line_786
# ... and 998 other lines selected with an equal probability in BILLIONS_OF_LINES.txt

You can get more details on what is going on under the hood:

$ head -n 5 letters.txt
A
B
C
D
E
$ cat letters.txt | ./reservoir.py 6 --verbose
> Adding element nb 0: 'A\n'
> Adding element nb 1: 'B\n'
> Adding element nb 2: 'C\n'
> Adding element nb 3: 'D\n'
> Adding element nb 4: 'E\n'
> Adding element nb 5: 'F\n'
> [p=6/7] Swap element nb 3: 'G\n' replaces 'D\n'
> [p=6/9] Swap element nb 5: 'I\n' replaces 'F\n'
> [p=6/10] Swap element nb 0: 'J\n' replaces 'A\n'
> [p=6/11] Swap element nb 2: 'K\n' replaces 'C\n'
> [p=6/16] Swap element nb 0: 'P\n' replaces 'J\n'
> [p=6/23] Swap element nb 4: 'W\n' replaces 'E\n'
> After skipping 1 line, swap element nb 5: 'Y\n' replaces 'I\n'
P
B
K
G
W
Y

If you want to make tests on bigger data sets, you can build arbitrarily large files using this (600Mo needed!):

for i in `seq 1 100000`; do echo $i ; done > 100_000.txt
for i in `seq 1 1000`; do cat 100_000.txt ; done > 100_000_000.txt

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
letters.txt		letters.txt
reservoir.py		reservoir.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

letters.txt

letters.txt

reservoir.py

reservoir.py

Repository files navigation

Reservoir sampling

About

Releases

Packages

Languages

License

alexprengere/reservoir

Folders and files

Latest commit

History

Repository files navigation

Reservoir sampling

About

Topics

Resources

License

Stars

Watchers

Forks

Languages