Add an efficient unstable thread sort, use it in unstable block/device merge/segmented sorts, and improve tests #1552

Nyrio · 2024-03-19T21:02:58Z

Description

closes #1551

Changes:

Faster unstable thread sort using Parberry's pairwise method.
Faster unstable block merge sort using the unstable thread sort.
Faster unstable device segmented sort using the unstable warp/block merge sort.
Faster unstable device merge sort using the unstable block merge sort.
Introduce cub:Less and cub::Greater operators and provide specializations of compare-swap to improve fp32 key-only sorting performance using min/max instead of predicate.
Stable device merge sort and segmented sort were using unstable block merge sort APIs, now they're dispatching to the appropriate block-level APIs.

Testing improvements/bug fixes:

test_thread_sort was casting the values to KeyT.
test_warp_merge_sort was comparing sort-pairs-by-key to a lexicographic ref sort.
test_warp_merge_sort was comparing stable and unstable variants to a stable ref sort.
test_warp_merge_sort did not generate the values, they were all 0...
test_block_merge_sort was only testing stable APIs.
test_block_merge_sort was not testing inputs for which stability is relevant (drawing random int32_t keys has a very small chance of conflict).
test_device_segmented_sort tested the unstable sort against a stable reference sort.
Sort tests were only testing a trivial less operator, and thus not testing the genericity of the implementation.

Future work (in follow-up PRs):

Make device merge sort policy templated on IS_STABLE, and re-tune.
Make device segmented sort policy templated on IS_STABLE, and re-tune.
Look at the performance of the merging part in block merge sort (Batcher's odd-even mergesort was outperforming it in a quick micro-benchmark I did)

Misc notes:

I just found out about Catch2 segmented sort #1484, changes on the segmented sort tests will conflict.
I have hardcoded is_stable = true in the benchmarks but ideally, we'd want to benchmark both. What would be the best way to do that? (a) add a new set; (b) add a boolean axis; (c) a separate benchmark (quite redundant).

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

…erage

…erge sort, and fix many issues with warp/block merge sort tests

…ce merge sort

…ce segmented sort

copy-pr-bot · 2024-03-19T21:03:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Nyrio · 2024-03-19T21:04:55Z

cc @gevtushenko @elstehle @alliepiper

elstehle · 2024-03-20T09:41:31Z

/ok to test

Nyrio · 2024-03-20T14:35:40Z

There is an odd compiler error in C++17 builds in this code in block_merge_sort.cuh where I changed the if for CUB_IF_CONSTEXPR:

CUB_IF_CONSTEXPR(IS_LAST_TILE)
{
  #pragma unroll
  for (int item = 1; item < ITEMS_PER_THREAD; ++item)
  {
    ...
  }
}

block_merge_sort.cuh(230): error #607-D: this pragma must immediately precede a statement

I never saw that error before and am struggling to make a reproducer. I'm changing it back to a regular if statement for now.

Also CI ran into issues with constructing an std::vector of c2h::custom_type_t (in the warp merge sort tests) so I worked around that.

miscco · 2024-03-22T09:40:14Z

/ok to test

miscco · 2024-03-22T09:45:16Z

cub/cub/agent/agent_merge_sort.cuh

@@ -67,6 +67,7 @@ struct AgentMergeSortPolicy

 /// \brief This agent is responsible for the initial in-tile sorting.
 template <typename Policy,
+          bool IS_STABLE,


When adding API extensions I would strongly prefer if we can get away from raw booleans.

It is much harder to discern what true means in some API call deep in the code, as opposed to Stability::Stable or Stability::Unstable

That requires us to put a bit more work into the implementation but makes is much easier to work with the API

We have some examples for that here cub\cub\device\dispatch\tuning\tuning_reduce_by_key.cuh

cub/cub/device/device_segmented_sort.cuh

miscco · 2024-03-22T09:56:30Z

cub/cub/block/block_merge_sort.cuh

+                                       int valid_items,
+                                       KeyT oob_default)
+  {
+    if (IS_LAST_TILE)


I assume this is the one place you mentioned where CUB_IF_CONSTEXPR is having issues?

cub/cub/block/block_merge_sort.cuh

alliepiper · 2024-03-24T20:11:05Z

I just found out about #1484, changes on the segmented sort tests will conflict.

FYI, I'm updating the new tests in that PR to support unstable sorting. Should have something ready in the next couple of days.

alliepiper · 2024-03-27T19:45:37Z

#1484 now supports unstable sort for the new DeviceSegmentedSort tests. I verified it against this branch and the additional logic to check unstable results only adds ~5s to the total test times. I'll merge that PR into main it once CI is happy so you can pull it in here.

gevtushenko

@Nyrio thank you for the contribution! I'm sorry that it takes us so long to review it. The PR doesn't introduce any difference in codegen for stable sort. Regarding the unstable sort, preliminary benchmarks show about the same performance for built-in types, and about 4% speedup for complex data types (benchmarked on H100 and A6000 Ada). Is this expected improvement or you had a different workload in mind? If you have a different workload illustrating better speedup, we'd highly appreciate if you could contribute it with this PR. Apart from that, while @elstehle is looking at the algorithm itself, I've left a few minor comments below.

gevtushenko · 2024-03-24T07:29:17Z

cub/cub/device/dispatch/dispatch_merge_sort.cuh

@@ -376,6 +391,7 @@ template <typename KeyInputIteratorT,
          typename ValueIteratorT,
          typename OffsetT,
          typename CompareOpT,
+          bool IS_STABLE = true,


important: unfortunately, dispatch structure is part of CUB API. I'm afraid that the new template parameter should have to go after the selected policy, not to break existing code that relies on dispatch directly. If you don't want to duplicate policy selection code in every usage, you could add a DispatchStableMergeSort type alias with different order of arguments.

gevtushenko · 2024-03-24T07:31:10Z

cub/cub/thread/thread_sort.cuh

 #include <cuda/std/type_traits>

 CUB_NAMESPACE_BEGIN

+template <typename IntT>
+_CCCL_HOST_DEVICE _CCCL_FORCEINLINE constexpr IntT NetworkDegree(IntT n, IntT m = IntT{1})


question: would you like this function to be part of CUB API? If so, it'll need its own tests and documentation. I'd suggest putting it into a detail namespace.

Indeed it should be in a detail namespace.

What is your opinion about Swap which existed before this PR: should I add documentation, or move to detail? I feel like it should not be part of the API, but that would be a breaking change.

gevtushenko · 2024-03-24T23:55:32Z

cub/cub/thread/thread_sort.cuh

+
+template <typename KeyT, typename ValueT, typename CompareOp>
+_CCCL_DEVICE _CCCL_FORCEINLINE void
+CompareSwap(KeyT& key_lhs, KeyT& key_rhs, ValueT& item_lhs, ValueT& item_rhs, CompareOp compare_op)


important: same note, if you want this function to be part of public API, we'll need docs and tests.

gevtushenko · 2024-03-25T00:01:37Z

cub/cub/thread/thread_sort.cuh

+SPECIALIZE_SORT_ASC(float)
+SPECIALIZE_SORT_DESC(::cuda::std::int32_t)
+SPECIALIZE_SORT_DESC(::cuda::std::uint32_t)
+SPECIALIZE_SORT_DESC(float)


suggestion: we'd probably like to avoid leaking this macro into user code. I'd suggest to undefine it after usage.

gevtushenko · 2024-03-25T00:02:39Z

cub/cub/thread/thread_sort.cuh

+  }
+}
+
+#define SPECIALIZE_SORT_ASC(T)                                                            \


suggestion: we don't know if users have the same macro or not. To avoid potential collisions, I'd suggest to add a CUB_ prefix.

gevtushenko · 2024-03-25T01:11:51Z

cub/cub/thread/thread_sort.cuh

+    CompareSwapMinMaxAsc(key_rhs, key_lhs);                                                  \
+  }
+
+SPECIALIZE_SORT_ASC(::cuda::std::int32_t)


question: I'm not sure why {u,}int32 and float are special. Do you think we could go with enable_if + is_arithmetic on CompareSwap instead of specializations?

For arithmetic types and keys only, using min and max takes two instructions per compare-swap, whereas the conditional version takes 3 (1 set predicate, 2 selections). For integers, the compiler does the optimization automatically, so this specialization is not strictly required, but for float32 it can't do the optimization because the behavior is slightly different with special cases like NaN.

Regarding NaN, the conditional version would not produce a sorted array if there are any NaNs, because comparisons with NaN always evaluate to false, breaking the rules of strict weak ordering:

IN: { 5, 4, NaN, 3, 2, 1 } OUT: { 4, 5, NaN, 1, 2, 3 }

With the min/max version, afaik if one input of CompareSwap is NaN it will duplicate the other, so a possible output of the sort would be:

IN: { 5, 4, NaN, 3, 2, 1 } OUT: { 1 2 3 3 4 5 }

The point is that I regard NaNs in the array as invalid inputs and prefer to use the fast implementation with 2 instructions instead of 3.

gevtushenko · 2024-03-25T01:20:02Z

cub/benchmarks/bench/merge_sort/keys.cu

+  using value_it_t         = value_t *;
+  using offset_t           = OffsetT;
+  using compare_op_t       = less_t;
+  constexpr bool is_stable = true;


important: I'd like the new algorithm to be benchmarked. Could you please copy this file into unstable directory with is_stable = false?

question: why copy the file and not parameterize for better code reuse?

Nyrio · 2024-04-02T10:31:48Z

about 4% speedup for complex data types

Is that DeviceMergeSort or DeviceSegmentedSort?

Is this expected improvement or you had a different workload in mind? If you have a different workload illustrating better speedup, we'd highly appreciate if you could contribute it with this PR.

It's in line with my expectations because the block merge sort is memory-bound, the bottleneck is the merging part in shared memory, so even if the per-thread sort issues fewer instructions, that does not affect the overall runtime much. The goals of the MR are: (a) to expose a more efficient thread sort, e.g. if the user wants to do a segmented sort of many small arrays of the same size, one array per thread with Parberry's pairwise sort is much faster than using CUB's segmented sort ; (b) to enable using more items/thread, as the quadratic cost would previously have prevented that.

I'm sorry that it takes us so long to review it.

No problem, thanks for the reviews. I will try to make some changes this week.

…d inflating the diff)

…n last place to avoid breaking API

Nyrio · 2024-04-04T18:41:22Z

I've made most of the requested changes. What remains to be done is adding the unstable benchmark.

elstehle

The first stage of the sorting network looks good to me. Currently going through the second stage.

cub/cub/thread/thread_sort.cuh

miscco · 2024-04-10T09:21:59Z

@Nyrio We have recently applied formatting to the cub subproject.

I have merged in main and applied formatting to your changes. I hope that should make this transition as painless as possible

miscco · 2024-04-10T09:22:09Z

/ok to test

Nyrio · 2024-04-11T13:33:12Z

Thanks @miscco for applying formatting. :)

I think I've made all the requested changes. @miscco and @gevtushenko to resolve discussions if you're satisfied with the changes.

Nyrio added 4 commits March 12, 2024 20:36

Thread sort: add specializations of compare-swap and improve test cov…

4c371b5

…erage

Add faster unstable thread sorting method, use it in unstable block m…

ed49097

…erge sort, and fix many issues with warp/block merge sort tests

Dispatch appropriately to stable or unstable block merge sort in devi…

eb3aa64

…ce merge sort

Dispatch appropriately to stable or unstable block merge sort in devi…

b65d71b

…ce segmented sort

Nyrio requested review from a team as code owners March 19, 2024 21:02

Nyrio requested review from elstehle and alliepiper March 19, 2024 21:02

Merge branch 'main' into thread-sorting-networks

168a1fa

Fix CI failures

2ce61dd

Merge branch 'main' into thread-sorting-networks

99fb9c8

miscco reviewed Mar 22, 2024

View reviewed changes

alliepiper mentioned this pull request Mar 27, 2024

Catch2 segmented sort #1484

Merged

gevtushenko reviewed Apr 1, 2024

View reviewed changes

Nyrio added 4 commits April 4, 2024 15:25

Merge branch 'main' into thread-sorting-networks

d1781c3

Move helper functions to detail namespace

78e8d2d

Improve const-correctness and selectively apply clang-format (to avoi…

45c838c

…d inflating the diff)

Replace boolean IS_STABLE with enum class and put new optional args i…

799258a

…n last place to avoid breaking API

Merge branch 'main' into thread-sorting-networks

739bcf3

elstehle reviewed Apr 5, 2024

View reviewed changes

cub/cub/thread/thread_sort.cuh Outdated Show resolved Hide resolved

cub/cub/thread/thread_sort.cuh Outdated Show resolved Hide resolved

Replace NetworkDegree with cub::Log2

6c32c4f

Nyrio and others added 2 commits April 5, 2024 16:07

Add unstable variants of merge sort and segmented sort

a11ed0e

Merge branch 'main' into pr/Nyrio/1552

ab7e7a6

Nyrio requested review from gevtushenko and miscco April 11, 2024 13:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an efficient unstable thread sort, use it in unstable block/device merge/segmented sorts, and improve tests #1552

Add an efficient unstable thread sort, use it in unstable block/device merge/segmented sorts, and improve tests #1552

Nyrio commented Mar 19, 2024

copy-pr-bot bot commented Mar 19, 2024

Nyrio commented Mar 19, 2024

elstehle commented Mar 20, 2024

Nyrio commented Mar 20, 2024 •

edited

miscco commented Mar 22, 2024

miscco Mar 22, 2024

miscco Mar 22, 2024

Nyrio Mar 22, 2024

alliepiper commented Mar 24, 2024

alliepiper commented Mar 27, 2024

gevtushenko left a comment

gevtushenko Mar 24, 2024

gevtushenko Mar 24, 2024

Nyrio Apr 4, 2024

gevtushenko Mar 24, 2024

gevtushenko Mar 25, 2024

gevtushenko Mar 25, 2024

gevtushenko Mar 25, 2024

Nyrio Apr 2, 2024 •

edited

gevtushenko Mar 25, 2024

Nyrio Apr 4, 2024

Nyrio commented Apr 2, 2024 •

edited

Nyrio commented Apr 4, 2024

elstehle left a comment

miscco commented Apr 10, 2024

miscco commented Apr 10, 2024

Nyrio commented Apr 11, 2024 •

edited

Add an efficient unstable thread sort, use it in unstable block/device merge/segmented sorts, and improve tests #1552

Are you sure you want to change the base?

Add an efficient unstable thread sort, use it in unstable block/device merge/segmented sorts, and improve tests #1552

Conversation

Nyrio commented Mar 19, 2024

Description

Checklist

copy-pr-bot bot commented Mar 19, 2024

Nyrio commented Mar 19, 2024

elstehle commented Mar 20, 2024

Nyrio commented Mar 20, 2024 • edited

miscco commented Mar 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alliepiper commented Mar 24, 2024

alliepiper commented Mar 27, 2024

gevtushenko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nyrio Apr 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nyrio commented Apr 2, 2024 • edited

Nyrio commented Apr 4, 2024

elstehle left a comment

Choose a reason for hiding this comment

miscco commented Apr 10, 2024

miscco commented Apr 10, 2024

Nyrio commented Apr 11, 2024 • edited

Nyrio commented Mar 20, 2024 •

edited

Nyrio Apr 2, 2024 •

edited

Nyrio commented Apr 2, 2024 •

edited

Nyrio commented Apr 11, 2024 •

edited