Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Draft of segmented reduce optimization #578

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

gevtushenko
Copy link
Collaborator

This PR applies a technique similar to one in segmented sort algorithm. Segments are partitioned and various thread groups are applied to various segment categories. While optimizing segmented reduction I introduced warp reduce agent and generalized reduce agent implementation. Below are speedups for small segment sizes, best speedup is about 66x:
small

Medium size segments experience minor slowdowns, but it can be addressed by further tuning:
mid

Large size segments are not affected by optimization:
large

In the commits, there's an attempt to fuse small segments reduction with the partitioning stage. This optimization doesn't perform as well. My guess is that it slows down decoupled look-back at the partitioning stage or affects it's occupancy, which leads to overall slowdown.

In order not to break stream capture (if one is used), I incorporated a separate check for that. We might need to check stream capturing mode in our tests later.

@gevtushenko gevtushenko added the P2: nice to have Desired, but not necessary. label Sep 30, 2022
@gevtushenko
Copy link
Collaborator Author

Experimented with a deterministic version of large segments optimization. Assigned a number of CTAs per each segment. The optimization is quite expensive in terms of the memory and requires about num_segments * (sizeof(int) + 4 * sizeof(AccumulatorT)). The speedup disappears as soon as there's about 16 large segments (particular number depends on the number of SMs), so I don't think it's worth it. Just in case, pushed and reverted mentioned optimization.
large

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P2: nice to have Desired, but not necessary.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant