Skip to content

v2023.06.0

Latest
Compare
Choose a tag to compare
@rhornung67 rhornung67 released this 21 Aug 22:23
e5b2102

This release contains new features, bug fixes, and build improvements.

Please download the RAJAPerf-v2023.06.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.

  • New features and usage changes:

    • User and developer documentation formerly in top-level README.md file (parsed to appear on GitHub project home page) has been expanded and moved to Sphinx documentation hosted on ReadTheDocs.
    • Caliper integration and annotations have been added in addition to Adiak metadata fields. Basic documentation is available in the User Guide. More documentation along with a tutorial will be available in the future.
    • Execution policies for many RAJA variants of GPU kernels were changed to take advantage of recent performance improvements in RAJA where we make better use of compile-time knowledge of block sizes. THis brings RAJA variants into closer alignment with GPU base variants.
    • The feature called 'Teams' has been changed to 'Launch' to be consistent with the RAJA feature.
    • A runtime option was added to change the memory space used for kernel data allocations. This allows us to compare performance using different memory spaces. Please see the user documentation or the '-h' help option output for details.
    • Warmup kernels were restructured so that only those relevant to kernels selected to run will be run.
    • New kernels have been added:
      * Basic_COPY8, which allows us to explore what bandwidth looks like with more memory accesses per iteration
      * Apps_MASS3DEA, which represents local element matrix assembly operations in finite element applications
      * Apps_ZONAL_ACCUMULATION_3D, which has the same data access patterns Apps_NODAL_ACCUMULATION_3D, but without the need for atomics
      * Basic_ARRAY_OF_PTRS, which involves a use case where a kernel captures an array and uses a runtime sized portion of it. This pattern exhibits different performance behavior for CUDA vs. HIP.
      * Apps_EDGE3D, which computes the summed mass + stiffness matrix of low order edge bases (relevant to MHD discretizations)
    • Added new command line options:
      * '--align' which allows one to change the alignment of host memory allocations.
      * '--disable_warmup' which allows one to turn off warmup kernels if desired.
      * '--tunings' or '-t', which allows a user to specify which blocksize tunings to run for GPU kernel variants. Please see the '-h' help output for more information.
      * '--gpu_stream_0', which allows a user to switch between GPU stream zero and the RAJA default stream.
    • Also, command line option '--help' or '-h' output was reorganized and improved for readability and clarity.
    • All 'loop_exec' RAJA execution policy usage has been replaced with the RAJA 'seq_exec' policy. The 'loop_exec' policy in RAJA is now deprecated and will be removed in the next RAJA (no-patch) release.
    • An environment variable 'RAJA_PERFSUITE_UNIT_TEST' has been added that allows one to select a single kernel to run via an alternative mechanism to the command line.
  • Build changes / improvements:

    • The RAJA submodule has been updated to v2023.06.1.
    • The BLT submodule has been updated to v0.5.3, which is the version used by the RAJA submodule version.
    • Moved RAJA Perf Spack package to RADIUSS Spack Configs project where it will be curated and upstreamed to Spack like packages for other RAJA-related projects.
    • For various reasons the Apps_COUPLE kernel has been removed from the default build since it was incomplete and needed lightweight device side support for complex arithmetic. It may be resurrected at some point and re-added to the Suite.
  • Bug fixes / improvements:

    • Fix issue related to improper initialization of reduction variable in OpenMP variants of Lcals_FIRST_MIN kernel. Interestingly, the issue only appeared at larger core counts possible on newer multi-core architectures.
    • Fix issue in Lcals_FIRST_MIN kernel where base CUDA and HIP variants were using an output array before it was initialized.