Faster algorithm for ordered sampling with replacement #913

Tortar · 2024-01-01T21:27:52Z

This is based on a classical result for example described here https://stats.stackexchange.com/questions/348358/a-fast-uniform-order-statistic-generator (and in the reference of the most upvoted answer). I wasn't able to find a reference describing its modification for sampling a finite population, but I adapted it to such a case.

In particular, the performance increase is substantial, more than 5 times faster when the performance increase stabilize, e.g.

This PR:

julia> using StatsBase, BenchmarkTools

julia> a = [1:1000;];

julia> @btime sample($a, 10^2, ordered=true);
  461.274 ns (3 allocations: 2.62 KiB)

julia> @btime sample($a, 10^6, ordered=true);
  3.527 ms (6 allocations: 22.89 MiB)

Main:

julia> using StatsBase, BenchmarkTools

julia> a = [1:1000;];

julia> @btime sample($a, 10^2, ordered=true);
  1.564 μs (2 allocations: 1.75 KiB)

julia> @btime sample($a, 10^6, ordered=true);
  21.273 ms (4 allocations: 15.26 MiB)

the switching point between this algorithm and the one implemented in main is set at k=10 because I found that empirically at that point the timings were almost equal.

Numerically it should be stable enough, but let me know what you think

… stable

Tortar · 2024-01-01T22:39:19Z

Test that is failing is using the previous algorithm at lines 83-84 of sampling.jl

83    aa = Int.(sample(r, 10; ordered=true))
84    check_sample_wrep(aa, (3, 12), 0; ordered=true, rev=rev)

so I think it is unrelated right? It is maybe due to the difference in the random number consumed by the new algorithm

Tortar · 2024-01-01T23:33:28Z

Actually it seems to me that the way tests are written is not very good because no rng is set, indeed trying to run those tests 100 times, sometimes gets to failure anyway (before and after this pr), should we set an rng?

edit: I tried to do it on those tests and it works, but it happens also with other parts of the sampling tests, if I loop over 100 times e.g.

direct_sample!([11:20;], zeros(Int, n, 3))
check_sample_wrep(a, (11, 20), 5.0e-3; ordered=false)

at lines 69-70 tests fails, so I guess establishing a rng for everything should be a good idea (actually maybe some confidence interval testing could be good practice in this case)

Tortar · 2024-01-09T15:15:36Z

closing because I want to do same more experimentation with the algorithm before proposing it, you can find them here:https://github.com/Tortar/SortedRands.jl

edit: I conducted some local tests, everything seems good to me, wait to hear the opinion of someone else :-)

Tortar · 2024-04-12T23:43:18Z

gentle bump, since #927 was merged, what about taking a look at another speed-up? :D

src/sampling.jl

Tortar · 2024-04-18T10:05:41Z

do you have any more review comments @devmotion? :-)

Tortar added 4 commits January 1, 2024 22:16

Faster algorithm for ordered sampling with replacement

dba8ba3

Update sampling.jl

c922c90

let's keep only uniform_orderstat_sample! to see if it is numerically…

a7ef972

… stable

previous methodology then

b424eaf

Tortar added 5 commits January 1, 2024 23:46

Update sampling.jl

3ae19bf

use better test for small ordered sampling

c5d90b3

Update sampling.jl

e032965

Update sampling.jl

397b8b3

Update sampling.jl

1e9c9ec

try stablerng(1)

2a4683c

Tortar closed this Jan 9, 2024

Tortar deleted the patch-1 branch January 9, 2024 15:40

Tortar restored the patch-1 branch January 14, 2024 01:40

Tortar reopened this Jan 14, 2024

devmotion reviewed Apr 13, 2024

View reviewed changes

src/sampling.jl Outdated Show resolved Hide resolved

Tortar added 2 commits April 13, 2024 12:06

use cumsum

5367e69

Merge branch 'JuliaStats:master' into patch-1

c0ec10d

Tortar requested a review from devmotion April 15, 2024 10:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster algorithm for ordered sampling with replacement #913

Faster algorithm for ordered sampling with replacement #913

Tortar commented Jan 1, 2024 •

edited

Tortar commented Jan 1, 2024 •

edited

Tortar commented Jan 1, 2024 •

edited

Tortar commented Jan 9, 2024 •

edited

Tortar commented Apr 12, 2024

Tortar commented Apr 18, 2024

Faster algorithm for ordered sampling with replacement #913

Are you sure you want to change the base?

Faster algorithm for ordered sampling with replacement #913

Conversation

Tortar commented Jan 1, 2024 • edited

Tortar commented Jan 1, 2024 • edited

Tortar commented Jan 1, 2024 • edited

Tortar commented Jan 9, 2024 • edited

Tortar commented Apr 12, 2024

Tortar commented Apr 18, 2024

Tortar commented Jan 1, 2024 •

edited

Tortar commented Jan 1, 2024 •

edited

Tortar commented Jan 1, 2024 •

edited

Tortar commented Jan 9, 2024 •

edited