Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL 2020] Add support for ridiculous SYCL 2020 reduction API #1453

Draft
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

illuhad
Copy link
Collaborator

@illuhad illuhad commented May 9, 2024

This PR adds support for the ridiculous SYCL 2020 reduction API. We did have some earlier reduction support, which however was incomplete, and most importantly, implemented at the backend-specific level of the kernel launchers. This means that the existing support did not scale to the new backends that were added more recently (OpenCL or Level Zero) and our current main compilation flow, the generic SSCP compiler.

Overall, the SYCL 2020 reduction API is quite ridiculous because it integrates a high-level feature (reductions) directly into the low-level kernel launch API, which creates all sorts of massive software engineering challenges purely by choice, not by necessity. Also, the unconstrained generality of this feature requires massive effort to cover all of the different cases. It also does not provide users who actually want/need control with control of critical behavior, such as scratch allocation and deallocation behavior. So you're going to have to trust that your implementation does something reasonable.

This PR reimplements reductions at a higher level, and maps them to the reduction engine that was introduced for stdpar support.

In more detail:

  • Adds missing interfaces (reduction accepting buffer as argument)
  • Adds type traits to check whether identities are known
  • Aligns SYCL functional objects with SYCL 2020 specification
  • Adds reduction support to operate on binary operators without known identity
  • Adds optimization to map more work to each work item in high-level parallel for, similarly to what stdpar already does. This was not present in the old implementation and can provide massive performance improvements.
  • Uses the scratch allocation caching infrastructure that was introduced for stdpar; this should have massively lower overheads compared to the old reduction implementation.
  • Implements initialize_to_identity property and aligns default behavior with the SYCL 2020 specification by reducing on top of existing output values by default.
  • Most importantly, reductions now work with generic target on all backends.

Limitations:

  • Reduction overloads with span remain unimplemented. Implementing those might be possible via marray but it seems like a pretty pointless feature to me.
  • The case when not sufficient local memory is available (e.g. due to massive user-provided reduction data type) remains unhandled and unsupported. If you want to do a reduction over a data type that is like 1k in size, reevaluate your life choices.
  • Reductions over data types without default constructor are unsupported. If you want to do this, I hereby ban you and your entire family line for 5 generations into the future, and retroactively, 5 generations in the past, from ever using AdaptiveCpp and programming in general.
  • Reduction with omp.library-only target will face a substantial performance regression. This is because implementing reductions turned out to only be feasible with the simplifying assumption that we always have the ndrange kernel execution model. This model is however very inefficient on omp.library-only by design.
  • Because of the dependency of the implementation on the ndrange model, the previously existing (but barely documented or advertised) reduction support in the hierarchical and scoped kernel execution models had to be dropped.

Draft because this needs way more testing for all the different cases. It's only very lightly tested at the moment.

@illuhad
Copy link
Collaborator Author

illuhad commented May 22, 2024

@fodinabor The new headers introduced in this PR cause clang 12 with omp.accelerated to segfault in SubCfgFormation.cpp:926. It seems some of iterators there is null (?). When printing the BBs I also noticed that some don't seem to have a parent module, which may be related to this issue (?). Do you have an idea what might be going on?

Copy link
Collaborator

@nilsfriess nilsfriess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found two typos while trying to compile the CTS reduction test suite

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants