Skip to content

Releases: LLNL/RAJAPerf

v2023.06.0

21 Aug 22:23
e5b2102
Compare
Choose a tag to compare

This release contains new features, bug fixes, and build improvements.

Please download the RAJAPerf-v2023.06.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.

  • New features and usage changes:

    • User and developer documentation formerly in top-level README.md file (parsed to appear on GitHub project home page) has been expanded and moved to Sphinx documentation hosted on ReadTheDocs.
    • Caliper integration and annotations have been added in addition to Adiak metadata fields. Basic documentation is available in the User Guide. More documentation along with a tutorial will be available in the future.
    • Execution policies for many RAJA variants of GPU kernels were changed to take advantage of recent performance improvements in RAJA where we make better use of compile-time knowledge of block sizes. THis brings RAJA variants into closer alignment with GPU base variants.
    • The feature called 'Teams' has been changed to 'Launch' to be consistent with the RAJA feature.
    • A runtime option was added to change the memory space used for kernel data allocations. This allows us to compare performance using different memory spaces. Please see the user documentation or the '-h' help option output for details.
    • Warmup kernels were restructured so that only those relevant to kernels selected to run will be run.
    • New kernels have been added:
      * Basic_COPY8, which allows us to explore what bandwidth looks like with more memory accesses per iteration
      * Apps_MASS3DEA, which represents local element matrix assembly operations in finite element applications
      * Apps_ZONAL_ACCUMULATION_3D, which has the same data access patterns Apps_NODAL_ACCUMULATION_3D, but without the need for atomics
      * Basic_ARRAY_OF_PTRS, which involves a use case where a kernel captures an array and uses a runtime sized portion of it. This pattern exhibits different performance behavior for CUDA vs. HIP.
      * Apps_EDGE3D, which computes the summed mass + stiffness matrix of low order edge bases (relevant to MHD discretizations)
    • Added new command line options:
      * '--align' which allows one to change the alignment of host memory allocations.
      * '--disable_warmup' which allows one to turn off warmup kernels if desired.
      * '--tunings' or '-t', which allows a user to specify which blocksize tunings to run for GPU kernel variants. Please see the '-h' help output for more information.
      * '--gpu_stream_0', which allows a user to switch between GPU stream zero and the RAJA default stream.
    • Also, command line option '--help' or '-h' output was reorganized and improved for readability and clarity.
    • All 'loop_exec' RAJA execution policy usage has been replaced with the RAJA 'seq_exec' policy. The 'loop_exec' policy in RAJA is now deprecated and will be removed in the next RAJA (no-patch) release.
    • An environment variable 'RAJA_PERFSUITE_UNIT_TEST' has been added that allows one to select a single kernel to run via an alternative mechanism to the command line.
  • Build changes / improvements:

    • The RAJA submodule has been updated to v2023.06.1.
    • The BLT submodule has been updated to v0.5.3, which is the version used by the RAJA submodule version.
    • Moved RAJA Perf Spack package to RADIUSS Spack Configs project where it will be curated and upstreamed to Spack like packages for other RAJA-related projects.
    • For various reasons the Apps_COUPLE kernel has been removed from the default build since it was incomplete and needed lightweight device side support for complex arithmetic. It may be resurrected at some point and re-added to the Suite.
  • Bug fixes / improvements:

    • Fix issue related to improper initialization of reduction variable in OpenMP variants of Lcals_FIRST_MIN kernel. Interestingly, the issue only appeared at larger core counts possible on newer multi-core architectures.
    • Fix issue in Lcals_FIRST_MIN kernel where base CUDA and HIP variants were using an output array before it was initialized.

v2022.10.0

12 Jan 22:47
57ee53e
Compare
Choose a tag to compare

This release contains new features, bug fixes, and build improvements.

Please download the RAJAPerf-v2022.10.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.

Notable changes include:

  • Release version name change:

    • Following the naming scheme for coordinated RAJA Portability Suite releases, this release of the RAJA Performance Suite is v2022.10.0 to indicate that it corresponds to the v2022.10.x releases of RAJA and camp.

    • We've been doing coordinated releases of RAJA Portability Suite projects (RAJA, Umpire, CHAI, and camp) for a while, and we changed the version naming scheme for those projects to reflect that. For example, the version number for the last release of these projects is v2022.10.x, meaning the release occurred in October 2022. The intent is that the v2022.10.x project releases are consistent in terms of their dependencies and they are tested together. The 'x' patch version number is applied to each project independently if a bugfix or other patch is needed. Any combination of v2022.10.x versioned libraries should be compatible.

  • New features and usage changes:

    • Add CONVECTION3DPA finite element kernel.
    • Add basic memory operation kernels MEMSET and MEMCPY
  • Build changes / improvements:

    • Improved CI testing, including using test infrastructure in RAJA (eliminate redundancies).
    • Fix 'make install' so that executable is installed as well.
    • Update all submodules to be consistent with RAJA v2022.10.4 release, including that version of RAJA.
  • Bug fixes / improvements:

    • Fix race condition in FIRST_MIN kernel (Thanks C. Robeck from AMD).
    • Fix broken OpenMP target variant of REDUCE_STRUCT kernel.
    • Fix MPI hang when rank zero does not enter a barrier if no path name is given for creating directories.
    • Support long double with MPI all reduce even when MPI implementation does not support long double.
    • Fix message printing to be rank zero only.

v0.12.0

02 May 19:53
388c1d7
Compare
Choose a tag to compare

Version 0.12.0

This release contains new features, bug fixes, and build improvements. Please see the RAJA user guide for more information about items in this release.

Please download the RAJAPerf-v0.12.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.

Notable changes include:

  • New features / API changes:

    • Add command line options to exclude individual kernels and/or variants, and kernels using specified RAJA features. Please use '-h' option to see available options and what they do.
    • Add command line option to output min, max, and/or average of kernel timing data over number of passes through the suite. Please use '-h' option to see available options and what they do.
    • Added basic MPI support, which enables the code to run on multiple MPI ranks simultaneously. This makes analysis of node performance more realistic since it mimics how real applications exercise memory bandwidth, for example.
    • Add a new checksum calculation for verifying correctness of results generated by kernel variants. The new algorithm uses a new weighting scheme to reduce the amount of bias towards later elements in the results arrays, and employs a Kahan sum to reduce error in the summation of many terms.
    • Added support for running multiple GPU block size "tunings" of kernels so that experiments can be run to assess how kernel performance depends on block size for different programming models and hardware architectures. By default, the Suite will run all tunings when executed, but a subset of tunings may be chosen at runtime via command line arguments.
    • Add DIFFUSION3DPA kernel, which is a high-order FEM kernel that stresses shared memory usage.
    • Add NODAL_ACCUMULATION_3D and DAXPY_ATOMIC kernels which exercise atomic operations in cases with few or unlikely collisions.
    • Add REDUCE_STRUCT kernel, which tests compilers' ability to optimize load operations when using data arrays accessed through pointer members of a struct.
    • Add REDUCE_SUM kernel so we can more easilyt compare reduction implementations.
    • Add SCAN, INDEXLIST, and INDEXLIST_3LOOP kernels that include scan operations, and operations to create lists of indices based on where a condition is satisfied by elements of a vector (common type of operation used in mesh-based physics codes).
    • Following improvements in RAJA, removed unused execution policies in RAJA "Teams" kernels: DIFFUSION3DPA, MASS3DPA, MAT_MAT_SHARED. Kernel implementations are unchanged.
  • Build changes/improvements

    • Updated versions of RAJA and BLT submodules.
      • RAJA is at the SHA-1 commit 87a5cac, which is a few commits ahead of the v2022.03.0 release. The post-release changes are used here for CI testing improvements.
      • BLT v0.5.0.
        See the release documentation for those libraries for details.
    • With this release, the RAJA Perf Suite requires C++14 (due to use of RAJA v2022.03.0).
    • With this release, the RAJA Perf Suite requires CMake 3.14.5 or newer.
    • BLT v0.5.0 includes improved support for ROCm/HIP builds. Although the option CMAKE_HIP_ARCHITECTURES to specify the HIP target architecture is not available until CMake version 3.21, the option is supported in the new BLT version and works with all versions of CMake.
  • Bug fixes/improvements:

    • Fixed index ordering in GPU variants of HEAT_3D kernel, which was preventing coalesced memory accesses.
    • Squashed warnings related to unused variables.

v0.11.0

01 Sep 20:42
22ac1de
Compare
Choose a tag to compare

The release adds new kernels, new features, and resolves some issues. New kernels exercise RAJA features that are not used in pre-existing kernels.

Please download the RAJAPerf-v0.11.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.

Notable changes include:

  • Update RAJA submodule to v0.14.0 release.
  • Update BLT submodule to v0.4.1 release (same one used in RAJA v0.14.0)
  • New kernels added:
    • 'Basic' group: MAT_MAT_SHARED, PI_ATOMIC, PI_REDUCE
    • 'Apps' group: HALOEXCHANGE, HALOEXCHANGE_FUSED, MASS3DPA
    • New group 'Algorithm' added and kernels in that group: SORT, SORTPAIRS
  • New Lambda_CUDA and Lambda_HIP variants added to various kernels to help isolate performance issues when observed.
  • Default problem size for all kernels is no ~1M so this is consistent across all kernels. Please refer to Suite documentation on main GitHub page for a discuss of problem size definitions.
  • Execution of all GPU kernel variants has been modified (RAJA execution policies, base variant launches) to allow arbitrary problem sizes to be run.
  • New runtime options:
    • Option to run kernels with a specified size. This makes it easier to run scaling studies with the Suite.
    • Option to filter kernels to run based on which RAJA features they use.
  • More kernel information output added, such as features, iterations per rep, kernels per rep, bytes per rep, and FLOPs per rep. This and other information is printed to the screen before the Suite is run and is also output to a new CSV report file. Please see Suite documentation on main GitHub page for details.
  • Additional warmup kernels enabled to initialize internal RAJA data structures so that initial kernel execution timings are more realistic.
  • Error checking for base GPU variants added to catch launch failures where they occur.
  • Compilation of RAJA exercises, examples, and tests is disabled by default. This makes compilation times much faster for users who do not want to build those parts of RAJA. These things can be enabled, if desired, with a CMake option.

v0.10.0

17 Nov 20:50
6bf725a
Compare
Choose a tag to compare

This release changes the way kernel variants are managed to handle cases where not all kernels implement all variants and where not all variants apply to all kernels. Future releases of the RAJA Performance Suite will include such kernels and variants. The README documentation visible on the main project page describes the new process to add new kernels and variants, which is a fairly minor perturbation to what existed previously.

Please download the RAJAPerf-v0.10.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.

v0.9.0

04 Nov 21:34
064dd17
Compare
Choose a tag to compare

This release adds HIP variants (baseline and RAJA) for each kernel in the suite.

Please download the RAJAPerf-v0.9.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.

v0.8.0

02 Nov 20:54
94c65b2
Compare
Choose a tag to compare

This release updates the RAJA submodule to v0.13.0 and the BLT submodule to match what is used in RAJA and also fixes some issues.

The main changes in this release are:

  • Updates to most of the RAJA::kernel execution policies used in nested loop kernels in this suite with newer RAJA usage in which 'Lambda' statements specify which arguments are used in each lambda.
  • Fixes to the RAJA OpenMP target back-end allow all OpenMP target kernels in this suite to compile and execute properly.
  • Kernel variant fixes and a timing data fix pointed out by individuals who submitted issues.

Please download the RAJAPerf-v0.8.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.

v0.7.0

10 Feb 23:19
a6ef027
Compare
Choose a tag to compare

This release updates the RAJA submodule to v0.11.0 and the BLT submodule to v0.3.0.

Please download the RAJAPerf-v0.7.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.

v0.6.0 Release

19 Dec 21:47
21e476f
Compare
Choose a tag to compare

This release contains two new variants of each kernel and several new kernels.

  • The new variants are sequential-lambda and OpenMP-lambda. They do not use RAJA (like the baseline variants), but use lambda expressions for the loop bodies. The hope is that these variants can help isolate performance issues to RAJA internals or compiler struggles to optimize code containing lambda expressions.

  • New kernels appear in the Basic and Lcals kernel subsets.

Please download the RAJAPerf-v0.6.0.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.

v0.5.2 Release

31 Oct 20:13
2da5e27
Compare
Choose a tag to compare

This release contains updates to use RAJA v0.10.0 as well as some bug fixes.

Please download the RAJAPerf-v0.5.2.tar.gz file below. The others will not work due to the way RAJAPerf uses git submodules.