Skip to content

htimur/stream-sampler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#Stream Sampler

Stream sampler that picks a random (representative) sample of size k from a stream of values with unknown and possibly very large length.

###Reservoir Sampling

Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of k items from a list S containing n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn't fit into main memory. The most common example was labelled Algorithm R by Jeffrey Vitter in his paper on the subject. (© wikipedia)


Build Status

##Usage

####Prerequisites

This project needs PHP 5.5+.

It has been tested using PHP5.5, PHP5.6 and HHVM.

####Parameters

Name Description
--length, -k Sample length, default 5
--input, -i Input data (string, url)
--random, -r Use random string as input
$ bin/sampler -r
$ bin/sampler -i THEQUICKBROWNFOXJUMPSOVERTHELAZYDOG
$ bin/sampler -i "http://www.random.org/strings/?num=100&len=8&digits=on&upperalpha=on&loweralpha=on&unique=on&format=plain&rnd=new"
$ curl -s 0 "http://www.random.org/strings/?num=100&len=8&digits=on&upperalpha=on&loweralpha=on&unique=on&format=plain&rnd=new" | bin/sampler --length=20

##Tests

You can run tests locally with

$ phpunit

##TODO

Implemnent stratagy to use different algorithm if k >= len(S) / 3 (Reservoir Sampling isn't efficient in this case there are more suitable ones)

##Feedback

Add an issue, open a PR, drop an email! I would love to hear from you!

About

Stream sampler that picks a random (representative) sample of size k from a stream of values with unknown and possibly very large length.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages