Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probabilistic Data Structures #95

Open
3 of 5 tasks
joshday opened this issue Nov 22, 2017 · 11 comments
Open
3 of 5 tasks

Probabilistic Data Structures #95

joshday opened this issue Nov 22, 2017 · 11 comments

Comments

@joshday
Copy link
Owner

joshday commented Nov 22, 2017

  • HyperLogLog
  • HyperLogLog ++
  • Count Min Sketch
  • Bloom Filter
  • MinHash
@joshday
Copy link
Owner Author

joshday commented May 24, 2022

Sure! That looks very interesting. I'll give the paper a closer look, but I'm not sure if I'll have the bandwidth to work on implementing it for a while.

@tdunning
Copy link

Let me see if I can get somewhere quickly from my Java version.

@tdunning
Copy link

So I have a working version for the cdf function. The quantile function isn't there yet.

It needs more tests. And it needs to conform to your API expectations. And I would like to add in a logHistogram implementation as well.

How did you imagine this should work? Right now I have a fit!(MergingDigest, Vector{<:Number}) method (and a non vector parallel) that adds data to the digest and a cdf(MergingDigest, value::Number) call to compute the empirical CDF of a particular value. I plan to add a quantile(MergingDigest, q::Number) version.

I currently have a length(MergingDigest) function, but that seems to be inconvenient on balance because it force a user to wrap a digest with Ref whenever broadcasting is used. I may retreat to a samples(MergingDigest) method or just document the idiom length(m.sketch).

Is this what you are expecting?

@joshday
Copy link
Owner Author

joshday commented May 27, 2022

How did you imagine this should work? Right now I have a fit!(MergingDigest, Vector{<:Number}) method (and a non vector parallel) that adds data to the digest and a cdf(MergingDigest, value::Number) call to compute the empirical CDF of a particular value. I plan to add a quantile(MergingDigest, q::Number) version.

Check out the OnlineStat interface in the README here: https://github.com/joshday/OnlineStatsBase.jl. For quantile, just make sure you are extending the Statistics.quantile method. For cdf, we can't export it in order to avoid a conflict with Distributions.cdf, but we can point out that it exists as OnlineStats.cdf in the docs.

I currently have a length(MergingDigest) function, but that seems to be inconvenient on balance because it force a user to wrap a digest with Ref whenever broadcasting is used. I may retreat to a samples(MergingDigest) method or just document the idiom length(m.sketch).

As long as MergingDigest is a subtype of OnlineStat, you shouldn't need to worry about broadcasting/Refs, since this lives in OnlineStatsBase.jl:

Broadcast.broadcastable(o::OnlineStat) = Ref(o)

@tdunning
Copy link

That's very helpful. In the short-term, even before I step up to integrating I will add the broadcastable snippet directly.

Your comment about cdf being a collision makes me wonder if I shouldn't make a MergingDigest be a Distribution somehow. There is no real reason I couldn't support sampling from the empirical distribution that way.

@tdunning
Copy link

In fact, anything that supports a quantile operation can be sampled.

@joshday
Copy link
Owner Author

joshday commented May 27, 2022

In fact, anything that supports a quantile operation can be sampled.

A package that implements this would be pretty cool. e.g.

x = QuantileSampler(thing_with_quantile_method)

rand(x)

But for all I know this already exists in Distributions.jl or elsewhere.

Edit: Maybe never mind. You're thinking of this?:

quantile(thing, rand())

@tdunning
Copy link

See https://github.com/tdunning/TDigest/ for a beginning.

I think this does everything needed, but it isn't tested hard enough to make me super confident of that.

@tdunning
Copy link

Comments and criticisms are VERY welcome

@tdunning
Copy link

There are some issues still on the TDigest.jl largely around sort stability. Secondary problem is lack of mind share on my part due to day job.

Not surprisingly, the Julia implementation is simpler than the Java version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants