Parallelizing `BallTree` Construction #132

SebastianAment · 2022-01-07T19:14:43Z

Overview

This PR parallelizes the construction of BallTree structures, achieving a speedup of a factor of 5 for n = 1_000_000 points with 8 threads.

The implementation uses @spawn and @sync, which requires raising the Julia compatibility entry to 1.3 and incrementing the minor version of this package.

Benchmarks

Setup

using NearestNeighbors
using BenchmarkTools
d = 100

On Master

n = 100;
X = randn(d, n);
@btime T = BallTree(X);
  1.244 ms (23 allocations: 174.83 KiB)

n = 10_000;
X = randn(d, n);
@btime T = BallTree(X);
  372.398 ms (26 allocations: 16.95 MiB)

n = 100_000;
X = randn(d, n);
@btime T = BallTree(X);
  7.989 s (26 allocations: 169.53 MiB)

n = 1_000_000;
X = randn(d, n);
@btime T = BallTree(X);
  161.170 s (26 allocations: 1.66 GiB)

With this PR (updated after further edits with improved allocations)

n = 100;
X = randn(d, n);
@btime T = BallTree(X);
  813.417 μs (244 allocations: 189.97 KiB)

n = 10_000;
X = randn(d, n);
@btime T = BallTree(X);
  101.158 ms (25348 allocations: 18.70 MiB)

n = 100_000;
X = randn(d, n);
@btime T = BallTree(X);
  2.816 s (253697 allocations: 187.03 MiB)

n = 1_000_000;
X = randn(d, n);
@btime T = BallTree(X);
  33.461 s (2527680 allocations: 2.13 GiB)

Further, the PR still allows for sequential execution with the parallel = false keyword:

n = 100;
X = randn(d, n);
@btime T = BallTree(X, parallel = false);
  1.090 ms (24 allocations: 174.06 KiB)

n = 10_000;
X = randn(d, n);
@btime T = BallTree(X, parallel = false);
  362.205 ms (27 allocations: 16.95 MiB)

n = 100_000;
X = randn(d, n);
@btime T = BallTree(X, parallel = false);
  8.262 s (27 allocations: 169.53 MiB)

n = 1_000_000;
X = randn(d, n);
@btime T = BallTree(X, parallel = false);
  150.437 s (25 allocations: 1.66 GiB)

Summary

The parallel implementation yields a speed up for even small datasets of n = 100 data points, and achieves a speedup of a factor of 3 for n = 100_000 points.
Compared to the sequential code, the memory allocation is up by about 10-20% in size and considerably in number, which is due to the parallel code needing to allocate temporary arrays to avoid race conditions, while the sequential code reuses a single temporary. If allocations, rather than execution speed are the concern, one can always use the parallel = false flag this PR provides.
The sequential option parallel = false maintains the same allocation behavior and comparable performance as the master branch. Notably, the sequential branch of this PR is consistently 20% faster on the n = 100 test case compared to master.

The experiments were run on a 2021 MacBook Pro with an M1 Pro and 8 threads.

KristofferC

Thanks for working on this.

The parallel implementation yields a speed up for even small datasets of n = 100 data points,

But from what I understand, the parallel building only happens if the size is smaller than DEFAULT_BALLTREE_MIN_PARALLEL_SIZE which is 1024? What gives the speed improvement for small trees?

Since the structure of creating a BallTree and a KDTree is pretty much the same, the same could be applied there?

You seem to have an extra commit not related to the tree building in this PR.

KristofferC · 2022-01-17T12:51:58Z

src/hyperspheres.jl

@@ -88,3 +57,14 @@ function create_bsphere(m::Metric,

    return HyperSphere(SVector{N,T}(center), rad)
 end
+
+@inline function interpolate(::M, c1::V, c2::V, x, d) where {V <: AbstractVector, M <: NormMetric}


Why move this function?

I had two versions locally, the previous one, and this one without the array buffer variable ab. It turns out that in the sequential code, the compiler is able to get rid of the allocations without explicitly pre-allocating an ArrayBuffer variable. In the parallel code, having an array buffer leads to race conditions, which is why I wrote this modification.

I can move it back to where it was in the file.

KristofferC · 2022-01-17T12:56:34Z

src/ball_tree.jl

+                        high::Int,
+                        tree_data::TreeData,
+                        reorder::Bool,
+                        parallel::Val{true},


Using a Val and use a separate function like this feels a bit awkward. Couldn't one just look at parallel_size in the original build_BallTree function and then decide whether to call the parallel function or the serial one?

Using type dispatch on the parallel variable is important, because the compiler is able to get rid of temporary allocations during sequential execution. I can isolate the recursive component of the function though, and only use the Val(true) dispatch for that. If we only use a regular if statement on a Bool, performance during sequential execution will take a hit compared to the status quo.

SebastianAment · 2022-01-17T13:45:33Z

The parallel implementation yields a speed up for even small datasets of n = 100 data points,

But from what I understand, the parallel building only happens if the size is smaller than DEFAULT_BALLTREE_MIN_PARALLEL_SIZE which is 1024? What gives the speed improvement for small trees?

This was run with a prior version where parallel_size = 0. A larger parallel_size seems beneficial for larger problems, where parallelization plays a bigger role.

Since the structure of creating a BallTree and a KDTree is pretty much the same, the same could be applied there?

I have a parallelized KDTree implementation locally too, but wanted to finish this one first. Do you prefer having everything in the same PR?

You seem to have an extra commit not related to the tree building in this PR.

Yes, maybe this wasn't smart in retrospect. I thought at the time that this PR would be easy to merge and just built on top of it. Would you like me to edit the commit history of the current PR?

parallelizing knn and inrange searches

33ccb17

SebastianAment force-pushed the parallel-ball-tree branch 2 times, most recently from e40c3f0 to f6acba9 Compare January 8, 2022 11:25

parallelizing BallTree construction

18b93cb

SebastianAment force-pushed the parallel-ball-tree branch from f6acba9 to 18b93cb Compare January 8, 2022 13:31

KristofferC reviewed Jan 17, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelizing `BallTree` Construction #132

Parallelizing `BallTree` Construction #132

SebastianAment commented Jan 7, 2022 •

edited

KristofferC left a comment •

edited

KristofferC Jan 17, 2022

SebastianAment Jan 17, 2022

KristofferC Jan 17, 2022

SebastianAment Jan 17, 2022

SebastianAment commented Jan 17, 2022

Parallelizing BallTree Construction #132

Are you sure you want to change the base?

Parallelizing BallTree Construction #132

Conversation

SebastianAment commented Jan 7, 2022 • edited

Overview

Benchmarks

Setup

On Master

With this PR (updated after further edits with improved allocations)

Summary

KristofferC left a comment • edited

Choose a reason for hiding this comment

KristofferC Jan 17, 2022

Choose a reason for hiding this comment

SebastianAment Jan 17, 2022

Choose a reason for hiding this comment

KristofferC Jan 17, 2022

Choose a reason for hiding this comment

SebastianAment Jan 17, 2022

Choose a reason for hiding this comment

SebastianAment commented Jan 17, 2022

Parallelizing `BallTree` Construction #132

Parallelizing `BallTree` Construction #132

SebastianAment commented Jan 7, 2022 •

edited

KristofferC left a comment •

edited