Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an efficient unstable thread sort, use it in unstable block/device merge/segmented sorts, and improve tests #1552

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

Nyrio
Copy link
Contributor

@Nyrio Nyrio commented Mar 19, 2024

Description

closes #1551

Changes:

  • Faster unstable thread sort using Parberry's pairwise method.
  • Faster unstable block merge sort using the unstable thread sort.
  • Faster unstable device segmented sort using the unstable warp/block merge sort.
  • Faster unstable device merge sort using the unstable block merge sort.
  • Introduce cub:Less and cub::Greater operators and provide specializations of compare-swap to improve fp32 key-only sorting performance using min/max instead of predicate.
  • Stable device merge sort and segmented sort were using unstable block merge sort APIs, now they're dispatching to the appropriate block-level APIs.

Testing improvements/bug fixes:

  • test_thread_sort was casting the values to KeyT.
  • test_warp_merge_sort was comparing sort-pairs-by-key to a lexicographic ref sort.
  • test_warp_merge_sort was comparing stable and unstable variants to a stable ref sort.
  • test_warp_merge_sort did not generate the values, they were all 0...
  • test_block_merge_sort was only testing stable APIs.
  • test_block_merge_sort was not testing inputs for which stability is relevant (drawing random int32_t keys has a very small chance of conflict).
  • test_device_segmented_sort tested the unstable sort against a stable reference sort.
  • Sort tests were only testing a trivial less operator, and thus not testing the genericity of the implementation.

Future work (in follow-up PRs):

  • Make device merge sort policy templated on IS_STABLE, and re-tune.
  • Make device segmented sort policy templated on IS_STABLE, and re-tune.
  • Look at the performance of the merging part in block merge sort (Batcher's odd-even mergesort was outperforming it in a quick micro-benchmark I did)

Misc notes:

  • I just found out about Catch2 segmented sort #1484, changes on the segmented sort tests will conflict.
  • I have hardcoded is_stable = true in the benchmarks but ideally, we'd want to benchmark both. What would be the best way to do that? (a) add a new set; (b) add a boolean axis; (c) a separate benchmark (quite redundant).

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@Nyrio Nyrio requested review from a team as code owners March 19, 2024 21:02
Copy link

copy-pr-bot bot commented Mar 19, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Nyrio
Copy link
Contributor Author

Nyrio commented Mar 19, 2024

cc @gevtushenko @elstehle @alliepiper

@elstehle
Copy link
Collaborator

/ok to test

@Nyrio
Copy link
Contributor Author

Nyrio commented Mar 20, 2024

There is an odd compiler error in C++17 builds in this code in block_merge_sort.cuh where I changed the if for CUB_IF_CONSTEXPR:

CUB_IF_CONSTEXPR(IS_LAST_TILE)
{
  #pragma unroll
  for (int item = 1; item < ITEMS_PER_THREAD; ++item)
  {
    ...
  }
}
block_merge_sort.cuh(230): error #607-D: this pragma must immediately precede a statement

I never saw that error before and am struggling to make a reproducer. I'm changing it back to a regular if statement for now.


Also CI ran into issues with constructing an std::vector of c2h::custom_type_t (in the warp merge sort tests) so I worked around that.

@miscco
Copy link
Collaborator

miscco commented Mar 22, 2024

/ok to test

@@ -67,6 +67,7 @@ struct AgentMergeSortPolicy

/// \brief This agent is responsible for the initial in-tile sorting.
template <typename Policy,
bool IS_STABLE,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When adding API extensions I would strongly prefer if we can get away from raw booleans.

It is much harder to discern what true means in some API call deep in the code, as opposed to Stability::Stable or Stability::Unstable

That requires us to put a bit more work into the implementation but makes is much easier to work with the API

We have some examples for that here cub\cub\device\dispatch\tuning\tuning_reduce_by_key.cuh

cub/cub/device/device_segmented_sort.cuh Outdated Show resolved Hide resolved
int valid_items,
KeyT oob_default)
{
if (IS_LAST_TILE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is the one place you mentioned where CUB_IF_CONSTEXPR is having issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

cub/cub/block/block_merge_sort.cuh Outdated Show resolved Hide resolved
cub/cub/block/block_merge_sort.cuh Show resolved Hide resolved
@alliepiper
Copy link
Collaborator

I just found out about #1484, changes on the segmented sort tests will conflict.

FYI, I'm updating the new tests in that PR to support unstable sorting. Should have something ready in the next couple of days.

@alliepiper
Copy link
Collaborator

#1484 now supports unstable sort for the new DeviceSegmentedSort tests. I verified it against this branch and the additional logic to check unstable results only adds ~5s to the total test times. I'll merge that PR into main it once CI is happy so you can pull it in here.

Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nyrio thank you for the contribution! I'm sorry that it takes us so long to review it. The PR doesn't introduce any difference in codegen for stable sort. Regarding the unstable sort, preliminary benchmarks show about the same performance for built-in types, and about 4% speedup for complex data types (benchmarked on H100 and A6000 Ada). Is this expected improvement or you had a different workload in mind? If you have a different workload illustrating better speedup, we'd highly appreciate if you could contribute it with this PR. Apart from that, while @elstehle is looking at the algorithm itself, I've left a few minor comments below.

@@ -376,6 +391,7 @@ template <typename KeyInputIteratorT,
typename ValueIteratorT,
typename OffsetT,
typename CompareOpT,
bool IS_STABLE = true,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important: unfortunately, dispatch structure is part of CUB API. I'm afraid that the new template parameter should have to go after the selected policy, not to break existing code that relies on dispatch directly. If you don't want to duplicate policy selection code in every usage, you could add a DispatchStableMergeSort type alias with different order of arguments.

#include <cuda/std/type_traits>

CUB_NAMESPACE_BEGIN

template <typename IntT>
_CCCL_HOST_DEVICE _CCCL_FORCEINLINE constexpr IntT NetworkDegree(IntT n, IntT m = IntT{1})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: would you like this function to be part of CUB API? If so, it'll need its own tests and documentation. I'd suggest putting it into a detail namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it should be in a detail namespace.

What is your opinion about Swap which existed before this PR: should I add documentation, or move to detail? I feel like it should not be part of the API, but that would be a breaking change.


template <typename KeyT, typename ValueT, typename CompareOp>
_CCCL_DEVICE _CCCL_FORCEINLINE void
CompareSwap(KeyT& key_lhs, KeyT& key_rhs, ValueT& item_lhs, ValueT& item_rhs, CompareOp compare_op)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important: same note, if you want this function to be part of public API, we'll need docs and tests.

SPECIALIZE_SORT_ASC(float)
SPECIALIZE_SORT_DESC(::cuda::std::int32_t)
SPECIALIZE_SORT_DESC(::cuda::std::uint32_t)
SPECIALIZE_SORT_DESC(float)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: we'd probably like to avoid leaking this macro into user code. I'd suggest to undefine it after usage.

}
}

#define SPECIALIZE_SORT_ASC(T) \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: we don't know if users have the same macro or not. To avoid potential collisions, I'd suggest to add a CUB_ prefix.

CompareSwapMinMaxAsc(key_rhs, key_lhs); \
}

SPECIALIZE_SORT_ASC(::cuda::std::int32_t)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I'm not sure why {u,}int32 and float are special. Do you think we could go with enable_if + is_arithmetic on CompareSwap instead of specializations?

Copy link
Contributor Author

@Nyrio Nyrio Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For arithmetic types and keys only, using min and max takes two instructions per compare-swap, whereas the conditional version takes 3 (1 set predicate, 2 selections). For integers, the compiler does the optimization automatically, so this specialization is not strictly required, but for float32 it can't do the optimization because the behavior is slightly different with special cases like NaN.

Regarding NaN, the conditional version would not produce a sorted array if there are any NaNs, because comparisons with NaN always evaluate to false, breaking the rules of strict weak ordering:

IN:  { 5, 4, NaN, 3, 2, 1 }
OUT: { 4, 5, NaN, 1, 2, 3 }

With the min/max version, afaik if one input of CompareSwap is NaN it will duplicate the other, so a possible output of the sort would be:

IN:  { 5, 4, NaN, 3, 2, 1 }
OUT: { 1 2 3 3 4 5 }

The point is that I regard NaNs in the array as invalid inputs and prefer to use the fast implementation with 2 instructions instead of 3.

using value_it_t = value_t *;
using offset_t = OffsetT;
using compare_op_t = less_t;
constexpr bool is_stable = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important: I'd like the new algorithm to be benchmarked. Could you please copy this file into unstable directory with is_stable = false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why copy the file and not parameterize for better code reuse?

@Nyrio
Copy link
Contributor Author

Nyrio commented Apr 2, 2024

about 4% speedup for complex data types

Is that DeviceMergeSort or DeviceSegmentedSort?

Is this expected improvement or you had a different workload in mind? If you have a different workload illustrating better speedup, we'd highly appreciate if you could contribute it with this PR.

It's in line with my expectations because the block merge sort is memory-bound, the bottleneck is the merging part in shared memory, so even if the per-thread sort issues fewer instructions, that does not affect the overall runtime much. The goals of the MR are: (a) to expose a more efficient thread sort, e.g. if the user wants to do a segmented sort of many small arrays of the same size, one array per thread with Parberry's pairwise sort is much faster than using CUB's segmented sort ; (b) to enable using more items/thread, as the quadratic cost would previously have prevented that.

I'm sorry that it takes us so long to review it.

No problem, thanks for the reviews. I will try to make some changes this week.

@Nyrio
Copy link
Contributor Author

Nyrio commented Apr 4, 2024

I've made most of the requested changes. What remains to be done is adding the unstable benchmark.

Copy link
Collaborator

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first stage of the sorting network looks good to me. Currently going through the second stage.

cub/cub/thread/thread_sort.cuh Outdated Show resolved Hide resolved
cub/cub/thread/thread_sort.cuh Outdated Show resolved Hide resolved
@miscco
Copy link
Collaborator

miscco commented Apr 10, 2024

@Nyrio We have recently applied formatting to the cub subproject.

I have merged in main and applied formatting to your changes. I hope that should make this transition as painless as possible

@miscco
Copy link
Collaborator

miscco commented Apr 10, 2024

/ok to test

@Nyrio
Copy link
Contributor Author

Nyrio commented Apr 11, 2024

Thanks @miscco for applying formatting. :)

I think I've made all the requested changes. @miscco and @gevtushenko to resolve discussions if you're satisfied with the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

[FEA]: Add efficient unstable thread sort
5 participants