Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster algorithm for ordered sampling with replacement #913

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

Tortar
Copy link

@Tortar Tortar commented Jan 1, 2024

This is based on a classical result for example described here https://stats.stackexchange.com/questions/348358/a-fast-uniform-order-statistic-generator (and in the reference of the most upvoted answer). I wasn't able to find a reference describing its modification for sampling a finite population, but I adapted it to such a case.

In particular, the performance increase is substantial, more than 5 times faster when the performance increase stabilize, e.g.

This PR:

julia> using StatsBase, BenchmarkTools

julia> a = [1:1000;];

julia> @btime sample($a, 10^2, ordered=true);
  461.274 ns (3 allocations: 2.62 KiB)

julia> @btime sample($a, 10^6, ordered=true);
  3.527 ms (6 allocations: 22.89 MiB)

Main:

julia> using StatsBase, BenchmarkTools

julia> a = [1:1000;];

julia> @btime sample($a, 10^2, ordered=true);
  1.564 μs (2 allocations: 1.75 KiB)

julia> @btime sample($a, 10^6, ordered=true);
  21.273 ms (4 allocations: 15.26 MiB)

the switching point between this algorithm and the one implemented in main is set at k=10 because I found that empirically at that point the timings were almost equal.

Numerically it should be stable enough, but let me know what you think

@Tortar
Copy link
Author

Tortar commented Jan 1, 2024

Test that is failing is using the previous algorithm at lines 83-84 of sampling.jl

83    aa = Int.(sample(r, 10; ordered=true))
84    check_sample_wrep(aa, (3, 12), 0; ordered=true, rev=rev)

so I think it is unrelated right? It is maybe due to the difference in the random number consumed by the new algorithm

@Tortar
Copy link
Author

Tortar commented Jan 1, 2024

Actually it seems to me that the way tests are written is not very good because no rng is set, indeed trying to run those tests 100 times, sometimes gets to failure anyway (before and after this pr), should we set an rng?

edit: I tried to do it on those tests and it works, but it happens also with other parts of the sampling tests, if I loop over 100 times e.g.

direct_sample!([11:20;], zeros(Int, n, 3))
check_sample_wrep(a, (11, 20), 5.0e-3; ordered=false)

at lines 69-70 tests fails, so I guess establishing a rng for everything should be a good idea (actually maybe some confidence interval testing could be good practice in this case)

@Tortar
Copy link
Author

Tortar commented Jan 9, 2024

closing because I want to do same more experimentation with the algorithm before proposing it, you can find them here:https://github.com/Tortar/SortedRands.jl

edit: I conducted some local tests, everything seems good to me, wait to hear the opinion of someone else :-)

@Tortar Tortar closed this Jan 9, 2024
@Tortar Tortar deleted the patch-1 branch January 9, 2024 15:40
@Tortar Tortar restored the patch-1 branch January 14, 2024 01:40
@Tortar Tortar reopened this Jan 14, 2024
@Tortar
Copy link
Author

Tortar commented Apr 12, 2024

gentle bump, since #927 was merged, what about taking a look at another speed-up? :D

src/sampling.jl Outdated Show resolved Hide resolved
@Tortar Tortar requested a review from devmotion April 15, 2024 10:58
@Tortar
Copy link
Author

Tortar commented Apr 18, 2024

do you have any more review comments @devmotion? :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants