Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a faster and safer implementation of alias_sample! #927

Merged
merged 3 commits into from
Apr 12, 2024

Conversation

LilithHafner
Copy link
Contributor

Safer:

Before

julia> StatsBase.alias_sample!(rand(10), weights(randn(10)), rand(10))
10-element Vector{Float64}:
 0.5676653052762575
 0.5676653052762575
 0.5676653052762575
 0.5676653052762575
 0.5676653052762575
 0.5676653052762575
 0.1984287484280587
 0.5676653052762575
 0.1984287484280587
 0.8567391687334422

julia> StatsBase.alias_sample!(rand(10), weights(fill(0, 10)), rand(10))
ERROR: BoundsError: attempt to access 10-element Vector{Float64} at index [281471800382896] # This came from reading undef memory
Stacktrace:
 [1] throw_boundserror(A::Vector{Float64}, I::Tuple{Int64})
   @ Base ./essentials.jl:14
 [2] getindex
   @ ./essentials.jl:891 [inlined]
 [3] alias_sample!(rng::TaskLocalRNG, a::Vector{Float64}, wv::Weights{Int64, Int64, Vector{Int64}}, x::Vector{Float64})
   @ StatsBase ~/.julia/packages/StatsBase/ebrT3/src/sampling.jl:729
 [4] top-level scope
   @ REPL[10]:1

julia> StatsBase.alias_sample!(rand(10), weights(fill(0, 10)), rand(10)) # Got "lucky" this time
10-element Vector{Float64}:
 0.07577419536126007
 0.9233876530591941
 0.1530016664475169
 0.07577419536126007
 0.07577419536126007
 0.3159766423652197
 0.1530016664475169
 0.6780730450968911
 0.01012788415877619
 0.6780730450968911

After

julia> StatsBase.alias_sample!(rand(10), weights(randn(10)), rand(10))
ERROR: ArgumentError: found negative weight -0.1164833812103052
Stacktrace:
 [1] checked_sum
   @ ~/.julia/packages/AliasTables/yt2Qj/src/AliasTables.jl:437 [inlined]
 [2] AliasTables.AliasTable{UInt64, Int64}(weights::Weights{Float64, Float64, Vector{Float64}}; _normalize::Bool)
   @ AliasTables ~/.julia/packages/AliasTables/yt2Qj/src/AliasTables.jl:81
 [3] AliasTable
   @ ~/.julia/packages/AliasTables/yt2Qj/src/AliasTables.jl:78 [inlined]
 [4] AliasTable
   @ ~/.julia/packages/AliasTables/yt2Qj/src/AliasTables.jl:76 [inlined]
 [5] alias_sample!(rng::TaskLocalRNG, a::Vector{Float64}, wv::Weights{Float64, Float64, Vector{Float64}}, x::Vector{Float64})
   @ StatsBase ~/.julia/dev/StatsBase/src/sampling.jl:719
 [6] top-level scope
   @ REPL[33]:1

julia> StatsBase.alias_sample!(rand(10), weights(fill(0, 10)), rand(10))
ERROR: ArgumentError: all weights are zero
Stacktrace:
 [1] checked_sum
   @ ~/.julia/packages/AliasTables/yt2Qj/src/AliasTables.jl:418 [inlined]
 [2] AliasTables.AliasTable{UInt64, Int64}(weights::Weights{Int64, Int64, Vector{Int64}}; _normalize::Bool)
   @ AliasTables ~/.julia/packages/AliasTables/yt2Qj/src/AliasTables.jl:81
 [3] AliasTable
   @ ~/.julia/packages/AliasTables/yt2Qj/src/AliasTables.jl:78 [inlined]
 [4] AliasTable
   @ ~/.julia/packages/AliasTables/yt2Qj/src/AliasTables.jl:76 [inlined]
 [5] alias_sample!(rng::TaskLocalRNG, a::Vector{Float64}, wv::Weights{Int64, Int64, Vector{Int64}}, x::Vector{Float64})
   @ StatsBase ~/.julia/dev/StatsBase/src/sampling.jl:719
 [6] top-level scope
   @ REPL[38]:1

Faster (benchmarks from #695 (comment))

Before

julia> using Chairmarks

julia> @b sample(1:5030, StatsBase.Weights(rand(Float32,5030)), 141230, replace=true)
1.558 ms (19 allocs: 1.251 MiB)

julia> @b sample(1:5030, StatsBase.Weights(rand(Float64,5030)), 141230, replace=true)
1.575 ms (19 allocs: 1.270 MiB)

After

julia> @b sample(1:5030, StatsBase.Weights(rand(Float32,5030)), 141230, replace=true)
294.460 μs (12 allocs: 1.260 MiB)

julia> @b sample(1:5030, StatsBase.Weights(rand(Float64,5030)), 141230, replace=true)
296.418 μs (12 allocs: 1.280 MiB)

Closes #630 (AliasTables.jl uses @inbounds in sampling and contains a correctness proof that relies only on local information and basic properties of unsigned integers)

Closes #916 by making alias_sample! much faster than even using ifelse would. See https://aliastables.lilithhafner.com/dev/#Implementation-details for how.

See also: JuliaStats/Distributions.jl#1848

cc @devmotion

ap = Vector{Float64}(undef, n)
alias = Vector{Int}(undef, n)
make_alias_table!(wv, sum(wv), ap, alias)
at = AliasTable(wv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there is no reason to keep make_alias_table! (defined above) around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the public API of StatsBase, no, there's no reason. However there are some usages downstream of that internal function. Most notably, Distributions.jl, which would break if we remove make_alias_table! before we merge JuliaStats/Distributions.jl#1848. Once Distributions.jl no longer depends on this, I think it's worth removing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems Distributions might be the only more widely used package that relies on these internals? I was aware of potential problems with using an internal function when writing ConsistencyResampling, and would be fine with breaking it (and probably switching to AliasTables) 😛 But given the impact on Distributions, maybe it would be safer to deprecate make_alias_table! and remove it in a future breaking release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would even be acceptable to deprecate it and remove it in a future non-breaking release once Distributions has a few releases out that don't use it.

Copy link
Member

@devmotion devmotion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me but let's wait a bit so others have time to chime in.

@LilithHafner
Copy link
Contributor Author

Bump, is there anyone in particular or some channel we should ping for feedback?

@devmotion
Copy link
Member

AFAIK there's no standard rule to follow. I just wanted to wait a few days in case someone has additional comments, but it seems nobody objected to the PR. IMO the PR is a clear improvement so I'll merge and leave further improvements or comments for future PRs.

@devmotion devmotion merged commit be809a8 into JuliaStats:master Apr 12, 2024
16 of 18 checks passed
@LilithHafner LilithHafner deleted the lh/alias-tables branch April 12, 2024 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

alias_sample! can be faster Add @inbounds in alias table sampling
2 participants