quantile is inefficient in common case of CI calculation #84

bkamins · 2021-08-30T18:34:59Z

Example:

julia> x=rand(10^4);

julia> @btime [quantile($x, 0.05), quantile($x, 0.95)]
  160.000 μs (5 allocations: 156.50 KiB)
2-element Vector{Float64}:
 0.04742732947951543
 0.9507927513232511

julia> @btime quantile($x, [0.05, 0.95])
  597.700 μs (4 allocations: 78.39 KiB)
2-element Vector{Float64}:
 0.04742732947951543
 0.9507927513232511

julia> x = rand(10^6);

julia> @btime quantile($x, [0.05, 0.95])
  92.449 ms (4 allocations: 7.63 MiB)
2-element Vector{Float64}:
 0.049906389147197056
 0.9498833775516955

julia> @btime [quantile($x, 0.05), quantile($x, 0.95)]
  18.751 ms (5 allocations: 15.26 MiB)
2-element Vector{Float64}:
 0.049906389147197056
 0.9498833775516955

julia> @btime quantile($x, [0.005, 0.995])
  98.948 ms (4 allocations: 7.63 MiB)
2-element Vector{Float64}:
 0.005030018622636852
 0.9949264399317231

julia> @btime [quantile($x, 0.005), quantile($x, 0.995)]
  17.697 ms (5 allocations: 15.26 MiB)
2-element Vector{Float64}:
 0.005030018622636852
 0.9949264399317231

but

julia> @btime quantile($x, [0.49, 0.51])
  12.591 ms (4 allocations: 7.63 MiB)
2-element Vector{Float64}:
 0.48882655343720904
 0.5090073220557704

julia> @btime [quantile($x, 0.49), quantile($x, 0.51)]
  22.728 ms (5 allocations: 15.26 MiB)
2-element Vector{Float64}:
 0.48882655343720904
 0.5090073220557704

The reason is that doing partial sort twice on tails if we are near the end is faster than doing one partial sort. Unfortunately this is a common case I would assume. Maybe we should introduce some optimization here?

The text was updated successfully, but these errors were encountered:

nalimilan · 2021-09-07T19:24:08Z

The reason is that doing partial sort twice on tails if we are near the end is faster than doing one partial sort

Why is that the case? I understand why there's no performance gain, but I wouldn't have expected it to be slower.

bkamins · 2021-09-07T19:48:55Z

The reason is here https://github.com/JuliaLang/Statistics.jl/blob/54f9b0d999813aa9fab039f632df222ffd2a96a8/src/Statistics.jl#L972

for one value lo is close to hi and you need to sort little if you are near the boundary. But for two values where one is very low and the other is very high you have to sort almost everything.

nalimilan · 2021-09-07T20:21:58Z

Wow, OK. We should probably just call sort! once for each quantile instead (which I thought we were doing). We should still benefit from the fact that we sorted some entries, right? Anyway the most common case is probably not to compute nearby quantiles (rather the contrary).

bkamins · 2021-09-08T06:03:27Z

Yes, but only of 2 quantiles. Probably for more than 2 quantiles it is better to do what we do now I think.

bkamins linked a pull request Sep 8, 2021 that will close this issue

Improve quantile performance #86

Open

nalimilan mentioned this issue Oct 15, 2021

Faster computation of quantiles in describe JuliaData/DataFrames.jl#2909

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quantile is inefficient in common case of CI calculation #84

quantile is inefficient in common case of CI calculation #84

bkamins commented Aug 30, 2021

nalimilan commented Sep 7, 2021

bkamins commented Sep 7, 2021

nalimilan commented Sep 7, 2021

bkamins commented Sep 8, 2021

quantile is inefficient in common case of CI calculation #84

quantile is inefficient in common case of CI calculation #84

Comments

bkamins commented Aug 30, 2021

nalimilan commented Sep 7, 2021

bkamins commented Sep 7, 2021

nalimilan commented Sep 7, 2021

bkamins commented Sep 8, 2021