Allow iterators in cub::DeviceRadixSort #868

zasdfgbnm · 2021-03-17T04:52:57Z

Currently, cub::DeviceRadixSort only support operating on pointers

template<typename KeyT , typename ValueT >
static CUB_RUNTIME_FUNCTION
cudaError_t      SortPairs (void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, const ValueT *d_values_in, ValueT *d_values_out, int num_items, int begin_bit=0, int end_bit=sizeof(KeyT)*8, cudaStream_t stream=0, bool debug_synchronous=false)

It would be good if the d_values_in could be an iterator.

One use case is pytorch/pytorch#53841, in this PR, we are working on a sorting problem where the input keys are random numbers, and input values are 0, 1, 2, 3, ..., N. Currently, we have to generate a memory buffer to store these 0, 1, 2, ..., N, which is not optimal. It would be nice if we can do something like:

cub::CountingInputIterator iter(0);
cub::DeviceRadixSort::SortPairs(..., /*d_values_in=*/iter, /*d_values_out=*/buffer, ...);

The text was updated successfully, but these errors were encountered:

RaulPPelaez · 2021-03-17T09:28:03Z

Have you tried the thrust::sort functions?, they allow to do what you are asking for. For example:
https://thrust.github.io/doc/group__sorting_gaec4e3610a36062ee3e3d16607ce5ad80.html
Just curious to know if you have some experience with timings for your use case between cub/thrust. I guess thrust is probably going to call cub anyway.
If you are worried about temporary memory allocation when using thrust, you can use the policy argument, like here:
https://github.com/NVIDIA/thrust/blob/main/examples/cuda/custom_temporary_allocation.cu

zasdfgbnm · 2021-03-17T15:00:09Z

@RaulPPelaez Yes I am aware of thrust::sort. I was actually using thrust::sort and want to migrate to cub. I don't know the perf number, but thrust sort has some device host synchronizations which drives me this migration.

alliepiper · 2021-03-17T16:37:24Z

I agree that this should be done if possible and can look into prioritizing this after I finish updating our benchmarking infrastructure.

Pinging @dumerrill and @canonizer since they know the radix sorting code best -- are ya'll aware of any reasons that this wouldn't work? Other device algorithms support iterators, so I'm curious if there's a reason DeviceRadixSort is different.

elstehle · 2021-03-18T07:39:55Z

Pinging @dumerrill and @canonizer since they know the radix sorting code best -- are ya'll aware of any reasons that this wouldn't work? Other device algorithms support iterators, so I'm curious if there's a reason DeviceRadixSort is different.

The main difference being that DeviceRadixSort is a multi-pass algorithm. One could account for taking an arbitrary iterator, but the implementation will require special treatment.
Currently, internally, DeviceRadixSort uses a DoubleBuffer with two ptr members (swapping the two buffers with each sorting pass). To account for an iterator, the first sorting pass will have the iterator as an input and write the materialised results to memory. From then on, we can continue with the usual DoubleBuffer logic.

Two options to implement this:

materialise the input iterator prior to the sort (iirc, this is what thrust::sort does): Less efficient but easy to implement.
have a special treatment on the first sorting pass: More implementation effort, longer compilation times (as first sorting pass has different template specialisation), but more efficient (saving N memory writes).

alliepiper · 2021-03-18T19:42:54Z

Thanks -- that's a good point.

The double buffer approach has some other issues, too -- the current implementation casts away the const-ness of the input and then writes to the const input memory as part of the double buffering. We should fix that when addressing this issue, since the fix would reuse the same code path.

zasdfgbnm · 2021-09-14T00:42:00Z

I am working on this in NVIDIA/cub#374

zasdfgbnm linked a pull request Sep 14, 2021 that will close this issue

[WIP] Allow cub::DeviceRadixSort and cub::DeviceSegmentedRadixSort to use iterator as input NVIDIA/cub#374

Draft

alliepiper assigned zasdfgbnm Sep 21, 2021

alliepiper unassigned zasdfgbnm May 9, 2022

jrhemstad added the cub For all items related to CUB label Feb 22, 2023

jarmak-nv assigned miscco Feb 23, 2023

jarmak-nv transferred this issue from NVIDIA/cub Nov 8, 2023

miscco removed their assignment Dec 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow iterators in cub::DeviceRadixSort #868

Allow iterators in cub::DeviceRadixSort #868

zasdfgbnm commented Mar 17, 2021

RaulPPelaez commented Mar 17, 2021

zasdfgbnm commented Mar 17, 2021

alliepiper commented Mar 17, 2021

elstehle commented Mar 18, 2021 •

edited

alliepiper commented Mar 18, 2021

zasdfgbnm commented Sep 14, 2021

Allow iterators in cub::DeviceRadixSort #868

Allow iterators in cub::DeviceRadixSort #868

Comments

zasdfgbnm commented Mar 17, 2021

RaulPPelaez commented Mar 17, 2021

zasdfgbnm commented Mar 17, 2021

alliepiper commented Mar 17, 2021

elstehle commented Mar 18, 2021 • edited

alliepiper commented Mar 18, 2021

zasdfgbnm commented Sep 14, 2021

elstehle commented Mar 18, 2021 •

edited