Skip to content

anicusan/MPISort.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MPISort

Don't put all your eggs in one basket!

CI License DevDocs

Sorting $N$ elements spread out across $P$ processors, with no processor being able to hold all elements at once is a difficult problem, with very few open-source implementations in C++ and Charm++. This library provides the mpisort! function for distributed MPI-based sorting algorithms following the standard Julia Base.sort! signature; at the moment, one optimised algorithm is provided:

SIHSort

Sampling with interpolated histograms sorting algorithm (pronounced sigh sort, like anything MPI-related), optimised for minimum inter-rank communication and memory footprint. Features:

  • Does not require that distributed data fits into the memory of a single node. No IO either.
  • Works for any comparison-based data, with additional optimisations for numeric elements.
  • Optimised for minimum MPI communication; can use Julia threads on each shared-memory node.
  • The node-local arrays may have different sizes; sorting will try to balance the number of elements held by each MPI rank.
  • Works with any AbstractVector, including accelerators such as GPUs (see Note).
  • Implements the standard Julia sort! API, and naturally works for custom data, comparisons, orderings, etc.

Example

# File:   mpisort.jl
# Run as: mpiexec -n 4 julia --threads=2 mpisort.jl

using MPI
using MPISort
using Random


# Initialise MPI, get communicator for all ranks, rank index, number of ranks
MPI.Init()

comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
nranks = MPI.Comm_size(comm)

# Generate local array on each MPI rank - even with different number of elements
rng = Xoshiro(rank)
num_elements = 50 + rank * 2
local_array = rand(rng, 1:500, num_elements)

# Sort arrays across all MPI ranks
alg = SIHSort(comm)
sorted_local_array = mpisort!(local_array; alg=alg)

# Print each local array sequentially
for i in 0:nranks - 1
    rank == i && @show rank sorted_local_array alg.stats
    MPI.Barrier(comm)
end

Note: because the data is redistributed between nodes, the vector size must change - hence it is different to the in-place Base.sort!. The input vector is mutated, but another vector - with potentially different size and elements - is returned. This is the reason for a different function signature (mpisort! with a return value); however, it has the exact same inputs as Base.sort!.

Different sorting settings:

# Automatically uses MPI.COMM_WORLD as communicator; doesn't save sorting stats
sorted_local_array = mpisort!(local_array; alg=SIHSort())

# Reverse sorting; specify communicator explicitly
sorted_local_array = mpisort!(local_array; alg=SIHSort(comm), rev=true)

# Specify key to sort by; see https://docs.julialang.org/en/v1/base/sort/
sorted_local_array = mpisort!(local_array; alg=SIHSort(), by=x->x["key"])

# Different ordering; see https://docs.julialang.org/en/v1/base/sort/#Alternate-orderings
sorted_local_array = mpisort!(local_array; alg=SIHSort(), order=Reverse)

# Save sorting stats
alg = SIHSort(comm)
sorted_local_array = mpisort!(local_array; alg=alg)

@show alg.stats.splitters               # `nranks - 1` elements splitting arrays between nodes
@show alg.stats.num_elements            # `nranks` integers specifying number of elements on each node

# Use different in-place local sorter
alg = SIHSort(comm, nothing)            # Default: standard Base.sort!
alg = SIHSort(comm, QuickSort)          # Specify algorithm, passed to Base.sort!(...; alg=<Algorithm>)
alg = SIHSort(comm, v -> mysorter!(v))  # Pass any function that sorts a local vector in-place

Communication and Memory Footprint

Only optimised collective MPI communication is used, in order: Gather, Bcast, Reduce, Bcast, Alltoall, Allreduce, Alltoallv. I am not aware of a non-IO based algorithm with less communication (if you do know one, please open an issue!).

If $N$ is the total number of elements spread out across $P$ MPI ranks, then the per-rank memory footprint of SIHSort is:

$$ k P + k P + P + 3(P - 1) + \frac{N + \epsilon}{P} $$

Where $k$ is the number of samples extracted from each node; following [1], we use:

$$ k = 2P \ log_2 P $$

Except for the final redistribution on a single new array of length $\frac{N + \epsilon}{P}$, the memory footprint only depends on the number of nodes involved, hence it should be scalable to thousands of MPI ranks. Anyone got a spare 200,000 nodes to benchmark this?

Note on sorting multi-node GPU arrays

SIHSort is generic over the input array type, so it can work with GPU arrays - e.g. Julia CUDA.CuArray - and benefit from MPI-configured, optimised inter-GPU connects.

However, to be fully performant, it needs:

  • A single-node sort implementation - at the moment, only CUDA.CuArray has one; there is great potential in a KernelAbstractions.jl sorter, we really need one!
  • A fully GPU-based searchsortedlast implementation; we do not have one yet, so we rely on a binary search where each tested element is copied to the CPU (!), which of course is not great. More great potential in some optimised KernelAbstractions.jl kernels!

While it works currently, it is not ideal: sorting 1,000,000 Int32 values split across 2 MPI ranks takes ~0.015s on my Intel i9 CPU and ~0.034s on my NVidia Quadro RTX4000 with Max-Q.

References

This algorithm builds on prior art:

  • [1] Harsh V, Kale L, Solomonik E. Histogram sort with sampling. : followed main ideas and theoretical results, but with deterministic sampling and original communication and interpolation optimisations.
  • [2] Sundar H, Malhotra D, Biros G. Hyksort: a new variant of hypercube quicksort on distributed memory architectures.
  • [3] Shi H, Schaeffer J. Parallel sorting by regular sampling.
  • [4] Solomonik E, Kale LV. Highly scalable parallel sorting.
  • [5] John Lapeyre, integer base-2 logarithm - https://github.com/jlapeyre/ILog2.jl.
  • [6] Byrne S, Wilcox LC, Churavy V. MPI. jl: Julia bindings for the Message Passing Interface. : absolute heroes who made MPI a joy to use in Julia.

License

MPISort.jl is MIT-licensed. Enjoy.