-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an efficient unstable thread sort, use it in unstable block/device merge/segmented sorts, and improve tests #1552
base: main
Are you sure you want to change the base?
Conversation
…erge sort, and fix many issues with warp/block merge sort tests
…ce segmented sort
/ok to test |
There is an odd compiler error in C++17 builds in this code in CUB_IF_CONSTEXPR(IS_LAST_TILE)
{
#pragma unroll
for (int item = 1; item < ITEMS_PER_THREAD; ++item)
{
...
}
}
I never saw that error before and am struggling to make a reproducer. I'm changing it back to a regular Also CI ran into issues with constructing an |
/ok to test |
cub/cub/agent/agent_merge_sort.cuh
Outdated
@@ -67,6 +67,7 @@ struct AgentMergeSortPolicy | |||
|
|||
/// \brief This agent is responsible for the initial in-tile sorting. | |||
template <typename Policy, | |||
bool IS_STABLE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When adding API extensions I would strongly prefer if we can get away from raw booleans.
It is much harder to discern what true
means in some API call deep in the code, as opposed to Stability::Stable
or Stability::Unstable
That requires us to put a bit more work into the implementation but makes is much easier to work with the API
We have some examples for that here cub\cub\device\dispatch\tuning\tuning_reduce_by_key.cuh
int valid_items, | ||
KeyT oob_default) | ||
{ | ||
if (IS_LAST_TILE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is the one place you mentioned where CUB_IF_CONSTEXPR
is having issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
FYI, I'm updating the new tests in that PR to support unstable sorting. Should have something ready in the next couple of days. |
#1484 now supports unstable sort for the new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Nyrio thank you for the contribution! I'm sorry that it takes us so long to review it. The PR doesn't introduce any difference in codegen for stable sort. Regarding the unstable sort, preliminary benchmarks show about the same performance for built-in types, and about 4% speedup for complex data types (benchmarked on H100 and A6000 Ada). Is this expected improvement or you had a different workload in mind? If you have a different workload illustrating better speedup, we'd highly appreciate if you could contribute it with this PR. Apart from that, while @elstehle is looking at the algorithm itself, I've left a few minor comments below.
@@ -376,6 +391,7 @@ template <typename KeyInputIteratorT, | |||
typename ValueIteratorT, | |||
typename OffsetT, | |||
typename CompareOpT, | |||
bool IS_STABLE = true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
important: unfortunately, dispatch structure is part of CUB API. I'm afraid that the new template parameter should have to go after the selected policy, not to break existing code that relies on dispatch directly. If you don't want to duplicate policy selection code in every usage, you could add a DispatchStableMergeSort
type alias with different order of arguments.
cub/cub/thread/thread_sort.cuh
Outdated
#include <cuda/std/type_traits> | ||
|
||
CUB_NAMESPACE_BEGIN | ||
|
||
template <typename IntT> | ||
_CCCL_HOST_DEVICE _CCCL_FORCEINLINE constexpr IntT NetworkDegree(IntT n, IntT m = IntT{1}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: would you like this function to be part of CUB API? If so, it'll need its own tests and documentation. I'd suggest putting it into a detail namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it should be in a detail
namespace.
What is your opinion about Swap
which existed before this PR: should I add documentation, or move to detail
? I feel like it should not be part of the API, but that would be a breaking change.
|
||
template <typename KeyT, typename ValueT, typename CompareOp> | ||
_CCCL_DEVICE _CCCL_FORCEINLINE void | ||
CompareSwap(KeyT& key_lhs, KeyT& key_rhs, ValueT& item_lhs, ValueT& item_rhs, CompareOp compare_op) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
important: same note, if you want this function to be part of public API, we'll need docs and tests.
cub/cub/thread/thread_sort.cuh
Outdated
SPECIALIZE_SORT_ASC(float) | ||
SPECIALIZE_SORT_DESC(::cuda::std::int32_t) | ||
SPECIALIZE_SORT_DESC(::cuda::std::uint32_t) | ||
SPECIALIZE_SORT_DESC(float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: we'd probably like to avoid leaking this macro into user code. I'd suggest to undefine it after usage.
cub/cub/thread/thread_sort.cuh
Outdated
} | ||
} | ||
|
||
#define SPECIALIZE_SORT_ASC(T) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: we don't know if users have the same macro or not. To avoid potential collisions, I'd suggest to add a CUB_ prefix.
cub/cub/thread/thread_sort.cuh
Outdated
CompareSwapMinMaxAsc(key_rhs, key_lhs); \ | ||
} | ||
|
||
SPECIALIZE_SORT_ASC(::cuda::std::int32_t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: I'm not sure why {u,}int32 and float are special. Do you think we could go with enable_if
+ is_arithmetic
on CompareSwap
instead of specializations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For arithmetic types and keys only, using min and max takes two instructions per compare-swap, whereas the conditional version takes 3 (1 set predicate, 2 selections). For integers, the compiler does the optimization automatically, so this specialization is not strictly required, but for float32 it can't do the optimization because the behavior is slightly different with special cases like NaN.
Regarding NaN, the conditional version would not produce a sorted array if there are any NaNs, because comparisons with NaN always evaluate to false, breaking the rules of strict weak ordering:
IN: { 5, 4, NaN, 3, 2, 1 }
OUT: { 4, 5, NaN, 1, 2, 3 }
With the min/max version, afaik if one input of CompareSwap is NaN it will duplicate the other, so a possible output of the sort would be:
IN: { 5, 4, NaN, 3, 2, 1 }
OUT: { 1 2 3 3 4 5 }
The point is that I regard NaNs in the array as invalid inputs and prefer to use the fast implementation with 2 instructions instead of 3.
using value_it_t = value_t *; | ||
using offset_t = OffsetT; | ||
using compare_op_t = less_t; | ||
constexpr bool is_stable = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
important: I'd like the new algorithm to be benchmarked. Could you please copy this file into unstable directory with is_stable = false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why copy the file and not parameterize for better code reuse?
Is that
It's in line with my expectations because the block merge sort is memory-bound, the bottleneck is the merging part in shared memory, so even if the per-thread sort issues fewer instructions, that does not affect the overall runtime much. The goals of the MR are: (a) to expose a more efficient thread sort, e.g. if the user wants to do a segmented sort of many small arrays of the same size, one array per thread with Parberry's pairwise sort is much faster than using CUB's segmented sort ; (b) to enable using more items/thread, as the quadratic cost would previously have prevented that.
No problem, thanks for the reviews. I will try to make some changes this week. |
…d inflating the diff)
…n last place to avoid breaking API
I've made most of the requested changes. What remains to be done is adding the unstable benchmark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first stage of the sorting network looks good to me. Currently going through the second stage.
@Nyrio We have recently applied formatting to the cub subproject. I have merged in main and applied formatting to your changes. I hope that should make this transition as painless as possible |
/ok to test |
Thanks @miscco for applying formatting. :) I think I've made all the requested changes. @miscco and @gevtushenko to resolve discussions if you're satisfied with the changes. |
Description
closes #1551
Changes:
cub:Less
andcub::Greater
operators and provide specializations of compare-swap to improve fp32 key-only sorting performance using min/max instead of predicate.Testing improvements/bug fixes:
test_thread_sort
was casting the values to KeyT.test_warp_merge_sort
was comparing sort-pairs-by-key to a lexicographic ref sort.test_warp_merge_sort
was comparing stable and unstable variants to a stable ref sort.test_warp_merge_sort
did not generate the values, they were all 0...test_block_merge_sort
was only testing stable APIs.test_block_merge_sort
was not testing inputs for which stability is relevant (drawing random int32_t keys has a very small chance of conflict).test_device_segmented_sort
tested the unstable sort against a stable reference sort.Future work (in follow-up PRs):
IS_STABLE
, and re-tune.IS_STABLE
, and re-tune.Misc notes:
is_stable = true
in the benchmarks but ideally, we'd want to benchmark both. What would be the best way to do that? (a) add a new set; (b) add a boolean axis; (c) a separate benchmark (quite redundant).Checklist