

# Evaluation of Programming Models and Performance for Stencil Computation on Current GPU Architectures

Baodi Shan

TotalEnergies EP Research & Technology US, LLC  
 Houston, Texas, USA  
 Stony Brook University  
 Stony Brook, New York, USA  
 baodi.shan@stonybrook.edu

Mauricio Araya-Polo

TotalEnergies EP Research & Technology US, LLC  
 Houston, Texas, USA

**Abstract**—Accelerated computing is widely used in high-performance computing. Therefore, it is crucial to experiment and discover how to better utilize GPGPUs latest generations on relevant applications. In this paper, we present results and share insights about highly tuned stencil-based kernels for NVIDIA Ampere (A100) and Hopper (GH200) architectures. Performance results yield useful insights into the behavior of this type of algorithms for these new accelerators. This knowledge can be leveraged by many scientific applications which involve stencils computations. Further, evaluation of three different programming models: CUDA, OpenACC, and OpenMP target offloading is conducted on aforementioned accelerators. We extensively study the performance and portability of various kernels under each programming model and provide corresponding optimization recommendations. Furthermore, we compare the performance of different programming models on the mentioned architectures. Up to 58% performance improvement was achieved against the previous GPGPU’s architecture generation for an highly optimized kernel of the same class, and up to 42% for all classes. In terms of programming models, and keeping portability in mind, optimized OpenACC implementation outperforms OpenMP implementation by 33%. If portability is not a factor, our best tuned CUDA implementation outperforms the optimized OpenACC one by 2.1 $\times$ .

## I. INTRODUCTION

Currently, HPC-based on General Purpose Graphic Processing Units (GPGPUs) is mostly replacing Central Processing Units (CPUs)-based computing, thus becoming the primary source of computational power for supercomputing systems. In the latest TOP500 supercomputer rankings released in Nov. 2023, fifteen out of the twenty fastest systems use GPUs as accelerators for supercomputing. Therefore is relevant to continue updating and optimizing workloads that rely on this kind of accelerators.

The evolution of GPGPUs architectures is driven by different markets/applications needs, it is unknown which and to what extent the new features might be useful for specific numerical workflows, this is another reason to evaluate how well-established workflows map to new hardware. In this work’s case, the target application is subsurface characterization ([28]), through wave equation solving, which at the core

sports high-order stencil computations. Research on stencil computing continuously produces technical advances ([3], [9], [24], [25]) given its importance for many scientific and industrial applications([4]), from weather prediction ([5]) to earthquake modeling ([15]).

In this work, through performance evaluation under different programming models and data sizes, profiling analysis, and comparison of hardware parameters between GPGPUs, we provide suggestions for developers. Thus, this work has the following main contributions:

- By evaluating our stencil computation program on the NVIDIA GH200, we explored the potential impact of the new feature of CUDA programming model based on the latest generation of NVIDIA GPGPU, thread block cluster, and analyzed the causes of performance changes.
- Based on the NVIDIA Hopper architecture, we proposed optimizations for the stencil computation program and provided optimization suggestions for subsequent developers.
- On the comparison between OpenACC’s `async` and OpenMP’s `nowait`, we proposed a new optimization strategy for asynchronously executing parallel regions, enabling the code to fully utilize the ability of the GPGPUs multiple streams to execute concurrently. At the same time, we also proposed optimization strategies at the compilation level for the stencil computation program on the OpenACC programming model. Compared with the original code, the performance of the new code has been improved by up to 30%.
- We compared the performance and portability of three different GPGPU-based programming models, and evaluated and analysed the performance and changes of the three models on different generations of NVIDIA GPGPUs. Based on our evaluation results and analysis, we have provided corresponding suggestions to developers of scientific programs.
- We compared the power consumption of NVIDIA Am-

pere architecture and Hopper under three different programming models. This reflected the relationship between energy consumption and performance.

The paper is organized as follows: Section II introduces the concept of stencil computation. Section III describes the system overview and programming models used in our evaluations. Section IV showcases various implementations and optimization for the stencil computation program under the CUDA programming model, compares their performance on A100 and GH200, and provides optimization recommendations based on these results. Section V elaborates on the implementation of the the stencil computation program using OpenACC and OpenMP, presents our newest optimization scheme and its performance on the A100 and GH200. Section VI compared the portability of three different programming models, evaluated them on three generations of NVIDIA GPU's architectures and provides recommendations for GPU program developers on selecting programming models, considering both program portability and performance differences on NVIDIA Hopper architecture.

## II. STENCIL COMPUTATION AND RELATED WORK

The stencil pattern under analysis in this work helps to compute the differential operators required by Finite Difference (FD) scheme to solve an acoustic isotropic approximation of the wave equation. The base implementation was introduced in [14] and optimized versions tailored for GPGPUs in presented in [21]. The spatial part of the wave equation is discretized using a 25-point stencil in 3D (8<sup>th</sup> order in space), with four points in each direction as well as the centre point:

$$\nabla^2 \mathbf{u}(x, y, z) \approx \sum_{m=0}^4 c_{xm} [\mathbf{u}(i+m, j, k) + \mathbf{u}(i-m, j, k)] + \\ c_{ym} [\mathbf{u}(i, j+m, k) + \mathbf{u}(i, j-m, k)] + \\ c_{zm} [\mathbf{u}(i, j, k+m) + \mathbf{u}(i, j, k-m)] \quad (1)$$

where  $c_{xm}, c_{ym}, c_{zm}$  are discretization parameters and  $\mathbf{u}$  the wavefield.

Basically, the computing pattern in Equation 1 is computed per every grid point of the computational domain and used to update the wavefield per time iteration. For instance, for a 3D computing domain of size  $1000^3$ , the pattern in Equation 1 is computed 1E09 times per time step, if the simulation iterate 1000 time steps, then pattern is computed 1E12 times in total. Thus, Equation 1 represents the more computationally challenging step of solving the wave equation (by FD) since the memory access pattern overwhelms traditional memory hierarchies, due to its low re-use, and the sparse in-memory location of the required elements to compute the central point of the stencil. Further, the computational domain is surrounded by CPML-like ([11]) boundary condition, which is implemented as part of the inner domain loops.

Multiple works introduce implementations and optimization strategies for stencil computations, relevant to our work

references are described as follows. The strategy of overlapped tiling employs time skewing to amplify the arithmetic intensity of parallel stencil computations. This is done by exchanging redundant computation along the boundaries of overlapped tiles for a decrease in the required memory bandwidth [12], [7]. This approach is particularly effective on GPUs due to the high cost of loading data from a GPU's global memory compared to data-parallel computation. Moreover, redundant computation can be overlapped with data accesses to help conceal memory latency. While overlapped tiling has been demonstrated to enhance the performance of low-order stencils on GPUs, for high-order stencils, the redundant computation escalates rapidly when skewed across multiple time steps by the width of a high-order stencil. The time skewing approach, as discussed in [29], [30], enhances the performance of stencil computations by augmenting data reuse and cache locality. This is achieved by skewing one or more data dimensions by the time dimension, enabling the computation of several time steps for a tile while the values are retained in cache. This method has found extensive application in CPUs, as evidenced by [10] and [23]. An alternative strategy for accelerating computation with time skewing is split tiling, as described in [6]. Instead of using overlapped tiles, which can induce significant amounts of redundant computation, split tiling computes points in two distinct phases. The initial phase computes tiles in parallel as hypertrapezoids that taper along the time dimension. Once all tiles from the first phase have been computed, a second phase fills in the missing points in the time dimension.

Nguyen et al. [16] propose a 3.5D blocking algorithm that blends 2.5D spatial blocking with 1D temporal blocking. 2.5D spatial blocking involves blocking in a 2D plane and streaming along a third dimension to increase data reuse, storing active 2D planes in GPU shared memory. In the 3.5D variant, they use time skewing to advance the computation for multiple time steps before writing data back to the global memory. Although the 3.5D algorithm performs exceptionally well on CPUs, the 1D temporal blocking introduces two potential implementation challenges for high-order stencils with boundary conditions on GPUs: barrier synchronizations and limited parallelism.

## III. EVALUATION SETUP

### A. System Overview

The primary testing platform is a NVIDIA Grace Hopper Superchip with NVIDIA GH200. In the comparative evaluations, we also utilized computing nodes equipped with A100 and T4. Refer to Table I for the hardware and software specifications of the systems.

### B. New in NVIDIA Hopper Architecture

The NVIDIA Hopper Architecture introduces a new memory indexing layer, termed "thread block cluster", which needs to be analyzed properly. The motivation of this feature is to facilitates data locality control at a granular level exceeding a

TABLE I: System Specifications

|                  | GH200        | A100          | T4                    |
|------------------|--------------|---------------|-----------------------|
| CPU              | NVIDIA Grace | AMD EPYC 7F52 | Intel(R) Xeon(R) 5217 |
| GPU              | NVIDIA GH200 | NVIDIA A100   | NVIDIA T4             |
| Cores            | 16896        | 6912          | 2560                  |
| GRAM             | 96 GB        | 40 GB         | 16 GB                 |
| Memory Bandwidth | 4TB/s        | 1555 GB/s     | 320 GB/s              |
| CUDA Compiler    | 12.2         | 12.0          |                       |
| OpenACC Compiler |              | NVHPC (nvc)   |                       |
| OpenMP Compiler  |              | NVHPC (nvc)   |                       |
| GPU Driver       | 535.129.03   | 525.105.17    |                       |
| Compiler Arch    | sm_90        | sm_80         | sm_75                 |



(a) NVIDIA GPGPU without thread block clusters



(b) NVIDIA Hopper with thread block clusters

Fig. 1: Thread-block-to-thread-block data exchange

solitary thread block on an individual Streaming Multiprocessor (SM). It augments the CUDA programming model, incorporating an extra hierarchical level, which includes threads, thread blocks, thread block clusters, and grids. Thread block clusters facilitate simultaneous execution of numerous thread blocks over several SMs, fostering synchronization, and cooperative data retrieval and exchange. Figure 1 illustrates various data exchange schemes occurring in NVIDIA GPGPUs, both with and without thread block clusters.

### C. Programming Models

**CUDA** is a parallel computing platform by NVIDIA [17]. It allows developers to use a CUDA-enabled GPU for general purpose processing. In CUDA programming, developers write functions, known as “kernel”, in a language similar to C/C++, which are then executed on the GPU for faster computations, especially beneficial for tasks like matrix operations, image processing, and machine learning.

**OpenACC** is a programming standard designed to simplify parallel computing, particularly for programming with accelerators like GPUs. It offers a directive-based programming model that lets developers indicate which parts of the code should run in parallel by inserting directives, without needing to specify how to parallelize in detail. The primary goal of

OpenACC ([19]) is to provide a simple way to optimize and parallelize applications with minimal effort from the developer.

**OpenMP** is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran. OpenMP ([8], [20]) provides a simple and flexible interface for developing parallel applications for platforms ranging from desktop computers to supercomputers. OpenMP’s target directive is part of its support for offloading computation to accelerators, such as GPUs [1], [13], [26], [27]. Currently, in the implementation of OpenMP in LLVM, the OpenMP target offloading support NVIDIA GPU, AMD GPU, Intel Phi and remote devices [22].

## IV. EXPLORATION AND OPTIMIZATION OF CUDA IMPLEMENTATIONS

The first GPU programming model we explore is CUDA. Our program encompasses a diverse array of kernel implementations, each of which has been extensively described in Sai et al. [21]. In the present document, we confine our discussion to their capacity to accommodate to nuanced idiosyncrasies inherent to distinct generations of GPGPUs. We selected the best kernels in-class from it. The selection criteria and detailed description of these CUDA kernels will be discussed in subsection IV-A

This section elaborates on the following aspects: Firstly, we explore several implementations of kernels under the CUDA model. Then, we present the performance and profiling results of each kernel implementation on NVIDIA Ampere and NVIDIA Hopper Architectures. Next, we discuss the following issues separately and provide corresponding suggestions: 1. The impact of the newly introduced Thread Block Cluster on program performance on NVIDIA Hopper and its causes. 2. The differences in the performance of the best kernels on NVIDIA Ampere and NVIDIA Hopper Architectures and their causes.

### A. Evaluated CUDA Kernels

As mentioned in the section II, we will focus on analyzing the following four implementations: *gmem*, *smem*, *st\_reg\_fixed*, and *st\_semi*.

As delineated in Section II, our data domain encompasses two distinct regions: the inner region (computational region) and the Perfectly Matched Layer (PML) boundaries. The inner region constitutes a cubic grid situated at the core of the data domain, while the PML region signifies the volume interposed between the inner region and the boundaries of the data domain. Figure 2 shows the data domain decomposition strategy in our stencil computation program. In our discussion of implementation strategies and performance characteristics on the CUDA model, our primary focus is on the performance of the inner region.

In our various implementations, we primarily include the following strategies. The first is “3D Blocking Using Global Memory Only” which is denoted as *gmem*; the second strategy is “3D Blocking Using Shared Memory,” which is denoted as *smem*; the third strategy is “2.5D Streaming Fixed Registers,”



Fig. 2: Data domain decomposition.



Fig. 3: Blocking strategies: (left) 3D blocking, (right) 2.5D blocking.

which we denote as *st\_reg\_fixed*; the fourth strategy is “2.5D Streaming Semi-stencil,” which was initially introduced for CPUs ([2]), which purpose is to increase memory reuse, it is denoted as *st\_semi*.

In order to better illustrate the different optimization strategies, we first introduce the two blocking strategies adopted in our stencil computation program. Our stencil computation program essentially deals with a 3D problem, so the basic blocking strategy is 3D blocking. In 3D Blocking, each data region is segmented into 3D blocks aligned with the axis. In order to determine the optimal block dimensions, we use fixed values in each run to simplify trials with varying values. For stencil computations on GPUs, each block is associated with a kernel launch with a 3D thread block, the thread dimensions of which match the block dimensions. All points within the block and their halos are explicitly copied into the GPU’s on-chip memory prior to any kernel initiation. While in 2.5D blocking strategy, we divide the data domain along the inner two dimensions, X and Y, and conduct a streaming computation along the outermost Z dimension. We initiate kernels with 2D thread blocks, the dimensions of which correspond to the 2D planes. Difference of these two blocking strategies could be seen in Figure 3.

In our kernel descriptions, the symbol  $R$  represents the width of the halo, which is half the spatial order of the stencil. In the acoustic isotropic simulations conducted in our experiments, the value of  $R$  is set to 4. Additionally, let

$N_x$ ,  $N_y$ , and  $N_z$  denote the extents of the input data region along the  $X$ ,  $Y$ , and  $Z$  axes, respectively. In the context of 3D blocks, the notation  $(x, y, z)$  is used to indicate the 3D coordinates, representing both a point location within a 3D block and the thread within a kernel thread block. Similarly, for 2D planes, the notation  $(x, y)$  is employed to specify a point in the 2D plane and identify the corresponding thread.

In the mentioned strategies, *gmem* means 3D blocking using global memory only. During the execution of the 25-point stencil kernel, each thread retrieves its own point and 4 neighboring points along each direction of every axis. To optimize performance, we ensure a favorable memory access pattern when fetching stencil points directly from global memory. This is achieved by storing the 3D grid data as a flat 1D array. Specifically, we prioritize global memory coalescing for the innermost dimension,  $X$ , to enhance data retrieval efficiency.

The *smem* is utilized to denote the methodology of 3D blocking via shared memory specifically for the array  $u$ , the definition of which is explicated in Equation 1. This strategy is a modification of the previously discussed 3D blocking employing global memory. It retains the same 3D blocking strategy for each data region. However, in contrast to performing computations directly on data procured from the global memory, this approach retrieves the  $u$  array from the global memory, stores it in the shared memory, and conducts the stencil computation on data procured from the shared memory. The total quantity of points procured in this scenario is  $D_x \times D_y \times D_z$  for a block and  $(D_x \times D_y + D_x \times D_z + D_y \times D_z) \times R \times 2$  for halos surrounding the block. In the context of high order stencils, it is imperative to consider the halo size to ascertain that both the block and the halo are accommodated within the shared memory.

Pertaining to the “Fixed Registers” strategy, it represents a variant from the “Shifting Registers” approach, a 2.5D streaming approach. In the “Shifting Registers” strategy, we maintain the points of the current  $XY$ -subplane in shared memory. However, as we stream along the  $z$ -axis, we utilize registers for the points along the  $z$ -axis. Unlike shared memory, where data procured from one thread is accessible by other threads within the same block, registers are solely accessible by the current thread. Given that we are streaming along the  $z$ -axis, the data from the  $z$ -axis loaded for one thread is not required by other threads. Consequently, we allocate a shared memory space to accommodate  $(Dx+2R) \times (Dy+2R)$  points for the currently active plane. The shared memory footprint in comparison to the previous method is  $1 : (2R + 1)$ . For high-order stencils,  $R$  is large, thereby leading to a significant reduction in shared memory usage. However, in the “Fixed Registers” strategy, we ascertain that the values in the registers remain “fixed” rather than “shifted” and update the necessary data via read and write operations between shared memory and global memory.

As for “Semi-stencil” [12], it separates the computation into two phases: forward and backward. The algorithm reads  $R+1$  points in one dimension and performs calculations as if they were the left side of the stencil in the forward phase, storing



Fig. 4: Roofline plot of CUDA kernels on A100

partial results. In the backward phase, the points are treated as the right side of the stencil, and the final result is written back. This approach changes the load-to-store ratio, reducing the number of loads. This implementation further integrates with a 2.5D streaming approach.

#### B. Performance Results and Profiling on CUDA Kernels

The grid size used for all the experiments in this section is  $1024^3$  and the number of time iterations is 1000. Table II shows execution time for different CUDA kernels, due to the requirement that the size of thread block cluster dimension must be a multiple of the GPU block in the corresponding dimension, we have chosen different dimension values for different CUDA kernels.

Table III and Table IV show profiling results, the tables focus on memory and computation information respectively. We selected the fastest version with the thread block cluster feature under each CUDA implementation and marked it as “*opt*”.

We will focus on the relationship between changes in GPU memory read and write volume and throughput in subsubsection IV-C1, we specifically annotated the changes in read and write volume and throughput brought about by the thread block cluster feature and represented these changes in parentheses.

Finally, Figure 4 and Figure 5 show the roofline plot of the CUDA kernels on A100 and GH200. In order to avoid confusion due to large amount of information in the roofline model graph, the information per kernel represent the average of the three different memory levels. In the roofline plot of GH200, the position of points is noticeably closer to the upper-left corner compared to the plot of A100. This signifies



Fig. 5: Roofline plot of CUDA kernels on GH200

higher computational performance and efficiency. With the analysis from profiling, it can be concluded that the memory performance enhancement brought by the Hopper architecture significantly aids memory-bound programs like stencil computation. Additionally, apart from the *gmem* kernel, the other kernels and their corresponding optimized versions exhibit substantial overlap. This observation further substantiates the limited performance improvement brought about by thread block clustering for highly optimized code of this nature.

#### C. Analysis and Recommendations on Performance Differences among Different Scenarios

*1) CUDA kernel with and without thread block clusters on NVIDIA Hopper Architecture:* As introduced in the NVIDIA Hopper white paper [18], the primary purpose of the new NVIDIA Hopper Architecture’s thread block cluster is to optimize data transfers between adjacent thread blocks, thereby enhancing performance. This new feature positively impact should be mainly reflected in memory operations, not in computational performance. The results in Table III and Table IV exactly confirm this point. According to Table III, it can be seen that for *gmem*, *reg*, and *semi* kernels, when the thread block cluster is adopted, the read and write volume of the GPU memory significantly decreases, while the L1 and L2 cache hit rate increase.

The exception is the kernel using shared memory, where the L1 cache hit rate doesn’t increase but decrease. But in this case, the benefits introduced by the improvement in GPU memory read and write operations are sufficient to lead this specific kernel to an enhanced performance.

However, overall, the introduction of thread block clusters in our CUDA kernel implementation for the stencil computation

TABLE II: Execution time for selective CUDA kernels on NVIDIA Ampere, NVIDIA Hopper without and with thread block cluster. The numbers in angle brackets indicate the dimension of the thread block cluster. Figures in bold font represents the best performance in each column. The number in parentheses represents the performance improvement ratio of the optimal kernel in the column compared to the optimal kernel of A100.

| Kernel       | A100 [s]      | GH200,without thread block cluster [s] | GH200, with thread block cluster [s]                                |
|--------------|---------------|----------------------------------------|---------------------------------------------------------------------|
| gmem         | 27.058        | 11.710                                 | <1,3,1>12.041<br><1,1,3>11.315                                      |
| smem         | 23.740        | 11.635                                 | <1,3,1>12.300<br><1,1,3>11.751                                      |
| st_reg_fixed | 19.908        | <b>9.445(41.5%)</b>                    | <1,2> <b>9.434(41.6%)</b><br><1,4>9.733<br><2,1>9.455<br><4,1>9.776 |
| st_semi      | <b>16.154</b> | 9.654                                  | <1,2>10.435<br><1,4>10.757<br><2,1>10.799<br><4,1>11.585            |

TABLE III: Profiling results of memory information of CUDA kernels on A100 and GH200. The numbers in the brackets represent the percentage change in memory throughput and device memory between adjacent data.

| Kernel       | Device    | Memory Throughput Rate | Memory Throughput(GB/s) | L1 Hit Rate | L2 Hit Rate | Device Memory Read/Write(GB) |
|--------------|-----------|------------------------|-------------------------|-------------|-------------|------------------------------|
| gmem         | A100      | 52.45%                 | 815.59                  | 74.95%      | 47.01%      | 22.15                        |
|              | GH200     | 59.57%                 | 2400.00                 | 74.91%      | 33.05%      | 22.14                        |
|              | GH200-opt | 44.67%                 | 1800.00                 | 75.11%      | 44.67%      | <b>17.23(-22.18%)</b>        |
| smem         | A100      | 60.88%                 | 946.96                  | 11.64%      | 47.28%      | 22.44                        |
|              | GH200     | 55.57%                 | 2240                    | 11.61%      | 33.45%      | 22.42                        |
|              | GH200-opt | 40.65%                 | 1640                    | 12.20%      | 46.30%      | <b>17.42(-22.31%)</b>        |
| st_reg_fixed | A100      | 39.87%                 | 621.87                  | 32.09%      | 56.07%      | 34.40                        |
|              | GH200     | 27.80%                 | 1110                    | 27.98%      | 59.36%      | 23.55                        |
|              | GH200-opt | 27.28%                 | 1100                    | 27.28%      | 60.59%      | <b>23.06(-2.08%)</b>         |
| st_semi      | A100      | 33.53%                 | 519.70                  | 33.46%      | 52.36%      | 26.12                        |
|              | GH200     | 23.63%                 | 950.54                  | 34.46%      | 46.71%      | 22.39                        |
|              | GH200-opt | 23.26%                 | 935.73                  | 34.47%      | 48.16%      | <b>21.91(-2.14%)</b>         |

did not bring significant performance benefits. In our CUDA code, the shared memory located in the thread blocks is allocated and planned with fine granularity. We have minimized the data exchange between shared memory in different thread blocks, and almost every thread block performs computations only from its corresponding shared memory. Therefore, the benefits of thread block clusters are negligible.

2) *Optimal (“opt”) implementation and its contributing factors:* According to Table II, on NVIDIA Hopper Architecture, the optimal implementation is *st\_reg\_fixed*, while for Ampere, the optimal one is *st\_semi*.

The key factor behind this is the improved read and write management. In *st\_semi*, the advantage of the strategy lies in optimizing the balance between loading and storing. The data primarily resides in shared memory, and the benefits provided by the new generation GPU are mainly related to improvements in shared memory. On the other hand, *st\_reg\_fixed* relies on data exchanges between registers, shared memory, and global memory. Therefore, based on the improvement observed in NVIDIA Hopper compared to NVIDIA Ampere as mentioned in [18], there are significant improvements in shared memory size, memory read/write speed, and register

size. As a result, *st\_reg\_fixed* benefits more than *st\_semi*.

From Table III, it can be observed that on the GH200, the memory read and write operations of *st\_reg\_fixed* exhibit a decrease of 31.54% compared to A100. However, this decrease is only 14.28% in the case of *st\_semi*, despite a similar improvement in throughput for both cases. This result explains why the performance of the *st\_reg\_fixed* kernel is more advantageous on the GH200 as compared to the *st\_semi* version. On GH200, the difference in performance due to differences about memory is even more pronounced.

Based on the analysis above, our recommendation for CUDA developers is to follow a strategy that fully exploits memory hierarchy, rather than only relying on block clustering, unless their algorithm has obvious locality patterns to be exploited.

## V. EXPLORATION AND OPTIMIZATION OF OPENACC AND OPENMP TARGET OFFLOADING

In this section, we will present and analyze the evaluation results of our stencil computation program on two general parallel computing programming models, OpenACC

TABLE IV: Profiling results of compute information of CUDA kernels on A100 and GH200.

| Kernel       | Device    | Compute Throughput | SM Frequency  | Elapsed Cycles | Theoretical Occupancy |                     | Achieved  |                     |
|--------------|-----------|--------------------|---------------|----------------|-----------------------|---------------------|-----------|---------------------|
|              |           |                    |               |                | Occupancy             | Active Warps Per SM | Occupancy | Active Warps Per SM |
| gmem         | A100      | 57.26%             | 764,838,253   | 20,763,880     | 100                   | 64                  | 88        | 56                  |
|              | GH200     | 77.86%             | 1,528,630,472 | 14,133,370     | 100                   | 64                  | 82        | 52                  |
|              | GH200-opt | 75.05%             | 1,528,421,734 | 14,666,940     | 100                   | 64                  | 76        | 49                  |
| smem         | A100      | 67.21%             | 765,095,453   | 18,110,315     | 100                   | 64                  | 92        | 59                  |
|              | GH200     | 71.14%             | 1,529,208,988 | 15,339,753     | 100                   | 64                  | 87        | 55                  |
|              | GH200-opt | 67.00%             | 1,527,931,405 | 16,294,783     | 100                   | 64                  | 81        | 52                  |
| st_reg_fixed | A100      | 32.29%             | 767,236,890   | 42,435,694     | 50                    | 32                  | 47        | 30                  |
|              | GH200     | 36.22%             | 1,529,787,652 | 32,207,850     | 50                    | 32                  | 47        | 30                  |
|              | GH200-opt | 36.29%             | 1,529,783,280 | 32,144,449     | 50                    | 32                  | 46        | 30                  |
| st_semi      | A100      | 34.85%             | 762,292,692   | 38,308,107     | 50                    | 32                  | 48        | 31                  |
|              | GH200     | 31.58%             | 1,529,861,717 | 36,041,202     | 50                    | 32                  | 49        | 31                  |
|              | GH200-opt | 31.77%             | 1,529,839,704 | 35,830,759     | 50                    | 32                  | 48        | 31                  |

```
#pragma acc parallel
#pragma acc loop gang vector collapse(3)
for (llint i = x3; i < x4; ++i) {
    for (llint j = y3; j < y4; ++j) {
        for (llint k = z3; k < z4; ++k) {
            # Computation
        }
    }
}
```

Fig. 6: OpenACC code for inner region

```
#pragma omp target teams distribute \
parallel for simd collapse(3)
for (llint i = x3; i < x4; ++i) {
    for (llint j = y3; j < y4; ++j) {
        for (llint k = z3; k < z4; ++k) {
            # Computation
        }
    }
}
```

Fig. 7: OpenMP target offloading code for inner region

and OpenMP. We will first demonstrate the different implementations of these two programming models and the new optimization strategies we have proposed for this stencil computation program. Then, we will provide the evaluation and analysis of the profiling on NVIDIA Ampere and NVIDIA Hopper Architectures.

#### A. Implementations and Optimization on OpenACC and OpenMP Target Offloading

We adopted the same data domain decomposition as shown in Figure 2, which is identical to that under the CUDA model. The unoptimized inner region implementations of the program in OpenACC and OpenMP are similar, both implementing parallelism at the outermost loop of the 3D data as shown in Figure 6 and Figure 7. The OpenACC implementation uses the directive `#pragma acc loop gang vector collapse(3)`, while the OpenMP target offloading implementation uses `#pragma omp target teams distribute parallel for simd collapse(3)`.

However, both of these implementations face issues of memory bandwidth bottlenecks and an inability to hide latency, problems which asynchronous computation can effectively address. In CUDA, we typically use the concept of streams to implement asynchronous instructions provided by the host. In OpenACC and OpenMP, we cannot manage streams with the same fine granularity as in CUDA. Alternatively, both OpenACC and OpenMP provide methods for implementing stream asynchronicity. For OpenACC, the `async` clause can

be used to specify the CUDA stream ID. In the case of OpenMP, the `nowait` clause can be used to dispatch target tasks to the GPU, allowing the GPU to manage its own scheduling. In this process, each target task will be executed asynchronously.

1) *Runtime Optimization 1: Fine-Grained Concurrent Kernels:* In OpenACC, we adjust the parallel region from the outermost loop to the second layer loop, and then use the third layer loop to iterate over the stream IDs, thereby distributing the computation of the parallel region to different streams. This approach allows for different parallel regions to be executed asynchronously, particularly enabling memory read/write tasks and compute tasks from different parallel regions to effectively hide latency caused by insufficient bandwidth. However, adjusting the original parallel region from the third layer loop to the second layer loop means that the number of times the parallel region is executed, i.e., the number of CUDA kernels, has increased by several orders of magnitude compared to the original version (the specific increase depends on the size of the data). This increase in number implies an increase in kernel launch or creation of CUDA streaming overhead, thus posing a potential risk of performance degradation.

Due to the limited number of CUDA kernels that NVIDIA GPUs can execute simultaneously, excessive CUDA streaming does not lead to greater overlap, meaning it does not bring effective concurrency. Therefore, we limited the number of CUDA streams available in OpenACC to 2 and ensured that the CUDA Stream IDs between two adjacent kernels



Fig. 8: Tracing results for OpenACC (Grid Size:  $1024^3$ , 1000 iterations)



Fig. 9: Tracing results for OpenMP target offloading (Grid Size:  $1024^3$ , 1000 iterations)



Fig. 10: Performance of OpenACC with and without *async* on A100

are different. In this way, we can reduce the overhead of launching streams while ensuring concurrency. The tracing results provided by the NVIDIA Nsight System is shown in Figure 8. It can be seen that the program mainly uses two CUDA streams to perform the inner part of the computation, while other streams handle boundary computation.

Figure 10 shows performance of the *async* version of OpenACC and the original version of OpenACC under different grid sizes. In order to better reflect the computing power of the stencil computation program under different grid sizes, we did not use the number of floating-point operations per second (FLOPS) as the evaluation standard, but chose the number of grids processed per second (grid/s) as the performance reference indicator. It can be seen that when the grid size is large, the original version reach the performance bottleneck and the throughput does not increase with the grid size. But the asynchronous execution of parallel regions can effectively hide the latency caused by memory read/write and kernel launch, thereby improving performance.

In OpenMP target offloading, there is no directive equivalent to that in OpenACC that can specify CUDA streams.

The asynchronous execution can only be indirectly achieved through the *nowait* clause.

Initially, the same code optimization as in the OpenACC version was used, i.e. reducing the target task to the second layer loop and making it execute asynchronously. However, compared with OpenACC, the reduced OpenMP kernel does not yield any performance improvement. On the contrary, we found that the same “small kernel” can be executed in 35us with OpenACC, while OpenMP requires more than 101us to complete. In terms of total kernel time, the OpenMP code using this optimization strategy has no improvement at all, but rather a significant decrease. The Table V shows the performance comparison between OpenMP kernel and OpenACC kernel under the same kernel. It can be seen that OpenMP target offloading shows significantly weaker performance when facing this relatively small kernel.

On the one hand, this behavior mainly comes from the differences in how OpenACC and OpenMP handle the working mechanisms of CUDA kernels. In OpenACC, the kernel code region is compiled into machine code by the compiler and then directly run on the NVIDIA GPU. However, the OpenMP target offloading region includes some necessary device runtime when running on the NVIDIA GPU. In OpenMP, this runtime code includes but is not limited to kernel launching, fine-grained memory management, and potential multi-device support. Particularly, with regards to fine-grained memory management, OpenMP target offloading supports loading necessary data into the shared memory of the NVIDIA GPU, which OpenACC does not support. This runtime support provided by OpenMP can effectively improve the program’s running efficiency in some cases, but it often backfires for the small kernels in this optimization scheme. This is because the additional runtime overhead on the GPU is hard to be hidden by the kernel code, and thus the runtime overhead becomes an unavoidable part of the execution time.

On the other hand, OpenMP target offloading has a non-negligible overhead for the creation and deletion of CUDA streams. Since the *nowait* clause cannot specify the CUDA stream ID, OpenMP target offloading can create at most as many CUDA streams as there are kernels. Figure 9 shows the tracing result of the OpenMP target offloading program in the Nsight System. It can be seen that the program has launched

TABLE V: The kernel performance of asynchronous execution on the second layer loop in OpenMP and OpenACC.

|         | Kernel Time(us) | Cycles | Performance(GFLOPs) | Arithmetic Intensity |
|---------|-----------------|--------|---------------------|----------------------|
| OpenMP  | 101.18          | 109660 | 381.254             | 0.88                 |
| OpenACC | 35.74           | 38805  | 1079.255            | 0.88                 |

nearly 1000 CUDA streams, which brought significant overhead.

2) *Runtime Optimization 2: Coarse-Grained Concurrent Kernels:* The second performance optimization approach is to keep the original size of the target task unchanged, and make the inner loop and the boundary task execute asynchronously. Since the boundary task is significantly smaller than the inner task, the performance improvement yield by this optimization method is relatively small.



Fig. 11: Performance of OpenMP Target Offloading with and without *nowait* on A100

The Figure 11 shows the changes in execution time of the original version and the asynchronous version under different grid sizes. Unlike the optimization approach of reducing the kernel to make it execute asynchronously in OpenACC, this optimization method does not change the kernel itself. Therefore, in subsequent discussions, we will only discuss the optimized kernel and no longer distinguish between the two versions of the kernel.

3) *Compilation Optimization:* In addition, we further improved the performance of OpenACC programs by optimizing the number of registers. The number of registers affects the performance of GPU programs. On the one hand, excessive use of registers may limit the number of parallel threads, thereby affecting performance; on the other hand, when there are not enough registers to store all variables, some variables need to be stored in local memory (which is part of global memory), which we refer to as spillover storage and loading. Excessive spillover storage and loading will also lead to a decrease in

performance. Table VI shows the impact of different register caps on the performance of OpenACC-*async* programs.

#### B. Performance and Profiling Results on OpenACC and OpenMP

Table VII shows execution time for OpenACC, OpenACC-*async*, OpenMP on A100 and GH200. The grid size used in the experiments is  $1024^3$ , and the kernel allocates in total about 22.1 GB GPU memory. The number of the experiments time steps is 1000.

Table VIII and Table IX show the profiling of the memory information and computation information respectively. Same as Table II, these two profiling results are based on grid size of  $1024^3$  and 1000 iterations.

#### C. Analysis of OpenACC and OpenMP on A100 and GH200

The performance gap between OpenMP and OpenACC originates both from memory management and from differences in computational efficiency. As shown in Table VIII, the memory throughput of OpenMP is significantly less than that of the OpenACC version, and yet the volume of memory read/write operations is not less than that of the OpenACC version, thus, memory read/write operations will introduce significant latency. As shown in Table IX, the elapsed cycles of OpenMP kernel are also significantly higher than the OpenACC version.

In addition, in OpenACC, the program can relatively automatically configure GPU-related settings, such as automatically adjusting the GPU's grid size according to the size of the data. OpenMP developers can manually select the grid size through `num_teams`. However, compared to OpenACC's automatic setting of Grid Size, choosing a Grid Size is difficult, and the optimal value often changes with the size of the data. In Figure 10, the curve of OpenACC is smoother than the curve of OpenMP in Figure 11, which could prove that OpenACC can make fuller and more reasonable use of GPU resources.

## VI. PROGRAMMING MODELS COMPARISON

#### A. Portability of Different Programming Models

In terms of performance alone, within the NVIDIA GPU-based programming models, the CUDA model, thanks to its fine-grained control over instructions and memory, as well as the vendor-specific implementation of certain instructions (such as `__fmaf_rn`), is far ahead when optimized. However, the development of scientific computing programs needs to take other costs into consideration, where portability and development difficulty cannot be ignored. Table X shows the line counts of different kernel implementations, and it can be seen that OpenACC and OpenMP Target Offloading far surpass the CUDA model in terms of portability. At the

TABLE VI: Register spills and performance differences caused by different register limits in OpenACC

| Max Register | Stack Frame | Spill Stores | Spill Loads | $1024^3$         | $512^3$           |
|--------------|-------------|--------------|-------------|------------------|-------------------|
| No Setup     | 0           | 0            | 0           | 9.97122 s        | 1.12388 s         |
| 100          | 0           | 0            | 0           | 9.96777 s        | 1.12309 s         |
| 80           | 0           | 0            | 0           | 9.60158 s        | 1.03545 s         |
| 60           | 8           | 8            | 8           | <b>9.55082 s</b> | <b>0.954919 s</b> |
| 40           | 72          | 68           | 68          | 13.7608 s        | 1.61103 s         |

 TABLE VII: Execution time of OpenACC and OpenMP programs on A100 and GH200 (Grid Size:  $1024^3$ , 1000 iterations)

| Programming Model | Device | Execution Time(s) |
|-------------------|--------|-------------------|
| OpenACC           | A100   | 53.188            |
|                   | GH200  | 23.196            |
| OpenACC-async     | A100   | 44.222            |
|                   | GH200  | 19.229            |
| OpenMP            | A100   | 58.568            |
|                   | GH200  | 29.527            |

same time, considering that OpenMP Target Offloading code supports a variety of accelerators, we believe that the directive-based OpenMP Target Offloading programming model would be an excellent choice for developers of high-performance computing programs.

#### B. Exploration of Performance of Different Generations of NVIDIA GPUs on Multiple Programming Models

In addition to the NVIDIA Ampere(A100) and Hopper Architecture(GH200), we evaluate the performance of three models on the NVIDIA Turing Architecture(T4) and compared the execution time for 1000 iterations when the grid size is  $700^3$ , the results of which can be seen in Table XI. We also calculated the performance differences of different models on different generations of GPUs and made some interesting findings. As can be seen from the last three lines of Table XI, as new generations of GPGPUs are introduced, the gap between the three programming models is gradually narrowing. Compared to the Turing, the performance of the OpenACC version on the Hopper has increased by  $13.0\times$ , while the performance of the OpenMP version has even achieved an  $14.4\times$  increase; the optimized CUDA version, however, only saw a  $9.4\times$  improvement with respect Turing reference.

Although the performance of OpenMP target offloading and OpenACC is  $3\times$  and  $2\times$  the optimized CUDA version, this is a result obtained at the expense of portability. The execution time of the optimized CUDA version in the chart is selected from a massive number of CUDA implementations, and selecting the optimal CUDA version requires a costly grid-like search, further in terms of code length or debugging difficulty. Therefore, we believe that OpenMP and OpenACC, whose performance continues to improve along with newer GPGPUs generations and compiler progress, are becoming development options worth considering when portability is a must.

In terms of architectural comparison, the rate of improvement ( $ri$ ), defined as

$$ri(Arch_{i+1}|Arch_i, prog\_model) = \frac{time\ Arch_i}{time\ Arch_{i+1}} \quad (1)$$

was remarkable for  $ri(\text{Ampere}|\text{Turing}, \text{CUDA}) = 5.6$ , but less pronounced for  $ri(\text{Hopper}|\text{Ampere}, \text{CUDA}) = 1.7$ . The highest  $ri$  in Table XI is achieved by OpenMP between Turing and Ampere architectures (7.7).

#### VII. COMPARISON OF POWER CONSUMPTION

In this section, we evaluated the power consumption of NVIDIA Ampere Architecture and NVIDIA Hopper Architecture. Figure 12 shows the energy consumption curves of two different GPGUs across three distinct programming models. The grid size of this evaluation is  $1024^3$ , and the number of timesteps for CUDA is 10000, and for OpenACC/OpenMP is 5000.

The collected data provided insightful revelations regarding the energy efficiency of the A100 and GH200 GPUs in different programming models. Initial observations revealed that the GH200 GPU, while not exhibiting superior energy efficiency, tends to consume more energy for equivalent computational tasks compared to the A100. This increased energy consumption is particularly evident in OpenMP offloading implementations, where the GH200's architecture, it seems does not prioritize power efficiency. However, a notable advantage of the GH200 is its time to solution efficiency, where it significantly reduces computing time for similar tasks.

On the CUDA programming model side, both GPUs showed a significant increase in power consumption during the peak computational phases of the stencil computations. However, the GH200 GPU maintained a more stable energy profile.

During the OpenACC power consumption evaluation (Figure 12.b), unlike its behavior with CUDA implementations, the GH200 exhibited greater fluctuations in energy consumption compared to the A100. This observation suggests that NVIDIA's newer Hopper architecture may not be as refined in energy management for OpenACC as it is for Ampere. Simultaneously, it was noted that the performance of the A100 remained consistently at its peak across both CUDA and OpenACC implementations. However, the GH200 only approached its theoretical performance peak with the OpenACC-async version. This indicates that CUDA's computational optimizations are highly advanced, and memory bandwidth has become the primary limiting factor in program performance.

The outcomes observed with OpenMP (Figure 12.c) are akin to those witnessed with CUDA (Figure 12.a), wherein the

TABLE VIII: Profiling results of memory information of OpenACC and OpenMP code on NVIDIA A100 and GH200 GPUs.

| Kernel        | Device | Memory Throughput Rate | Memory Throughput(GB/s) | L1 Hit Rate | L2 Hit Rate | Device Memory Read/Write(Bytes) |
|---------------|--------|------------------------|-------------------------|-------------|-------------|---------------------------------|
| OpenACC       | A100   | 60.95%                 | 947<br>2,422            | 42.97%      | 55.16%      | 47,612,893,312                  |
|               | GH200  | 60.22%                 |                         | 43.44%      | 45.31%      | 45,280,361,216                  |
| OpenACC-async | A100   | 56.18%                 | 872<br>2,097            | 48.93%      | 48.41%      | 46,209,280                      |
|               | GH200  | 52.26%                 |                         | 46.13%      | 43.74%      | 44,710,400                      |
| OpenMP        | A100   | 54.39%                 | 845<br>1,208            | 47.04%      | 54.51%      | 46,086,569,472                  |
|               | GH200  | 30.03%                 |                         | 46.01%      | 56.66%      | 32,045,994,240                  |

TABLE IX: Profiling results of compute information of OpenACC and OpenMP code on NVIDIA A100 and GH200 GPUs.

| Kernel        | Device | Compute Throughput | SM Frequency     | Elapsed Cycles | Theoretical Occupancy |                     | Achieved  |                     |
|---------------|--------|--------------------|------------------|----------------|-----------------------|---------------------|-----------|---------------------|
|               |        |                    |                  |                | Occupancy             | Active Warps Per SM | Occupancy | Active Warps Per SM |
| OpenACC       | A100   | 40.87%             | 764,999,157.62   | 38,427,322     | 37                    | 24                  | 37        | 24                  |
|               | GH200  | 45.15%             | 1,529,167,787.04 | 28,593,676     | 37                    | 24                  | 37        | 24                  |
| OpenACC-async | A100   | 47.21%             | 765,073,370.74   | 40,523         | 31                    | 20                  | 27        | 17                  |
|               | GH200  | 48.40%             | 1,508,094,031.53 | 32,408         | 37                    | 24                  | 32        | 20                  |
| OpenMP        | A100   | 42.83%             | 764,998,995.17   | 41,677,024     | 25                    | 16                  | 24        | 15                  |
|               | GH200  | 58.30%             | 1,529,855,754.88 | 40,585,290     | 25                    | 16                  | 23        | 14                  |



Fig. 12: Power consumption of NVIDIA Ampere and NVIDIA Hopper across different programming models

TABLE X: Lines of code of different GPGPU programming models implemented kernels

| Programming Model | Lines of Code |
|-------------------|---------------|
| CUDA-gmem         | 138           |
| CUDA-smem         | 145           |
| CUDA-st_reg_fixed | 142           |
| CUDA-st_semi      | 172           |
| OpenACC           | 71            |
| OpenMP            | 69            |

TABLE XI: Execution of OpenACC and OpenMP code on three generations of NVIDIA GPGPUs

|                           | Turing   | Ampere  | Hopper |
|---------------------------|----------|---------|--------|
| CUDA                      | 31.775s  | 5.706s  | 3.364s |
| OpenACC                   | 81.718s  | 11.260s | 6.276s |
| OpenMP                    | 139.358s | 18.012s | 9.651s |
| (OpenACC - CUDA)/OpenACC  | 0.611    | 0.493   | 0.464  |
| (OpenMP - CUDA)/OpenMP    | 0.772    | 0.683   | 0.651  |
| (OpenACC - OpenMP)/OpenMP | 0.414    | 0.375   | 0.349  |

energy consumption fluctuations of the GH200 are relatively stable. Overall, plenty of room for the application and SW stack to improve and take better advantage of the Hopper architecture. For instance, for the CUDA implementation, time to solution reduced from 340s to 200s ( $1.7\times$ ) but the power consumption raised from 250W to 510W ( $2\times$ ), so at least  $0.3\times$  (power to time solution ratio) to catch-up.

## VIII. CONCLUSION

Performance evaluation of optimized 3D stencil kernels was conducted on three GPU programming models for the latest generation GPGPUs, providing optimization suggestions tailored to each programming model. Our findings indicate that the GH200 demonstrates performance improvements of up to 58% compared to the previous GPU generation.

Simultaneously, we proposed CUDA stream-based asynchronous execution strategies for the OpenACC and OpenMP of stencil computation programs, resulting in performance enhancements of up to 30% over the original versions.

We also compared the performance and portability of the three programming models on multiple GPGPU generations. Our observations reveal that as GPU architecture and perfor-

mance progress, the performance gap between OpenMP, OpenACC, and optimized CUDA versions narrows. Particularly, in scenarios demanding portability, OpenMP and OpenACC are gradually becoming viable alternatives.

## REFERENCES

- [1] S. Bak, C. Bertoni, S. Boehm, R. D. Budiardja, B. M. Chapman, J. Doerfert, M. Eisenbach, H. Finkel, O. R. Hernandez, J. Huber, S. Iwasaki, V. Kale, P. R. C. Kent, J. Kwack, M. Lin, P. Luszczek, Y. Luo, B. Pham, S. Pophale, K. Ravikumar, V. Sarkar, T. Scogland, S. Tian, and P. K. Yeung, “Openmp application experiences: Porting to accelerated nodes,” *Parallel Comput.*, vol. 109, p. 102856, 2022. [Online]. Available: <https://doi.org/10.1016/j.parco.2021.102856>
- [2] R. de la Cruz and M. Araya-Polo, “Algorithm 942: Semi-stencil,” *ACM Trans. Math. Softw.*, vol. 40, no. 3, apr 2014. [Online]. Available: <https://doi.org/10.1145/2591006>
- [3] A. Denzler, G. F. Oliveira, N. Hajinazar, R. Bera, G. Singh, J. Gómez-Luna, and O. Mutlu, “Casper: Accelerating stencil computations using near-cache processing,” *IEEE Access*, vol. 11, pp. 22 136–22 154, 2023.
- [4] A. Dubey, “Stencils in scientific computations,” in *Proceedings of the Second Workshop on Optimizing Stencil Computations*, ser. WOSC ’14. New York, NY, USA: Association for Computing Machinery, 2014, p. 57. [Online]. Available: <https://doi.org/10.1145/2686745.2686756>
- [5] O. Fuhrer, C. Osuna, X. Lapillonne, T. Gysi, B. Cumming, M. Bianco, A. Arteaga, and T. Schulthess, “Towards a performance portable, architecture agnostic implementation strategy for weather and climate models,” *Supercomputing Frontiers and Innovations: an International Journal*, vol. 1, no. 1, pp. 45–62, Apr. 2014.
- [6] T. Grosser, A. Cohen, P. H. J. Kelly, J. Ramanujam, P. Sadayappan, and S. Verdoolege, “Split tiling for gpus: Automatic parallelization using trapezoidal tiles,” in *Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units*, ser. GPGPU-6. New York, NY, USA: Association for Computing Machinery, 2013, p. 24–31. [Online]. Available: <https://doi.org/10.1145/2458523.2458526>
- [7] J. Holewinski, L.-N. Pouchet, and P. Sadayappan, “High-performance code generation for stencil computations on gpu architectures,” in *Proceedings of the 26th ACM International Conference on Supercomputing*, ser. ICS ’12. New York, NY, USA: Association for Computing Machinery, 2012, p. 311–320. [Online]. Available: <https://doi.org/10.1145/2304576.2304619>
- [8] J. Huber, M. Cornelius, G. Georgakoudis, S. Tian, J. M. M. Diaz, K. Dinel, B. Chapman, and J. Doerfert, “Efficient execution of openmp on gpus,” in *2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)*, 2022, pp. 41–52.
- [9] M. Jacquelin, M. Araya-Polo, and J. Meng, “Scalable distributed high-order stencil computations,” in *SC22: International Conference for High Performance Computing, Networking, Storage and Analysis*. IEEE, 2022, pp. 1–13.
- [10] G. Jin, J. Mellor-Crummey, and R. Fowler, “Increasing temporal locality with skewing and recursive blocking,” in *Proceedings of the 2001 ACM/IEEE Conference on Supercomputing*, 2001, pp. 43–43.
- [11] D. Komatitsch and R. Martin, “An unsplit convolutional perfectly matched layer improved at grazing incidence for the seismic wave equation,” *Geophysics*, vol. 72, 09 2007.
- [12] S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan, “Effective automatic parallelization of stencil computations,” in *Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation*, ser. PLDI ’07. New York, NY, USA: Association for Computing Machinery, 2007, p. 235–244. [Online]. Available: <https://doi.org/10.1145/1250734.1250761>
- [13] W. Lu, B. Shan, E. Raut, J. Meng, M. Araya-Polo, J. Doerfert, A. M. Malik, and B. Chapman, “Towards efficient remote openmp offloading,” in *International Workshop on OpenMP*. Springer, 2022, pp. 17–31.
- [14] J. Meng, A. Atle, H. Calandra, and M. Araya-Polo, “Minimod: A finite difference solver for seismic modeling,” 2020.
- [15] P. Moczo, J. Kristek, and M. Gális, *The Finite-Difference Modelling of Earthquake Motions: Waves and Ruptures*. Cambridge University Press, 2014.
- [16] A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey, “3.5-d blocking optimization for stencil computations on modern cpus and gpus,” in *SC’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis*. IEEE, 2010, pp. 1–13.
- [17] NVIDIA, “Cuda zone,” 2023, accessed: August 15, 2023. [Online]. Available: <https://developer.nvidia.com/cuda-zone>
- [18] NVIDIA, “Nvidia h100 tensor core gpu architecture,” 2023, accessed: August 15, 2023. [Online]. Available: <https://resources.nvidia.com/en-us-tensor-core>
- [19] OpenACC-Standard.org, “Openacc,” 2023, accessed: August 14, 2023. [Online]. Available: <https://www.openacc.org/>
- [20] OpenMP.org, “Openmp,” 2023, accessed: August 14, 2023. [Online]. Available: <https://www.openmp.org/>
- [21] R. Sai, J. Mellor-Crummey, X. Meng, M. Araya-Polo, and J. Meng, “Accelerating high-order stencils on gpus,” in *2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)*, 2020, pp. 86–108.
- [22] B. Shan, M. Araya-Polo, A. M. Malik, and B. Chapman, “Mpibased remote openmp offloading: A more efficient and easy-to-use implementation,” in *Proceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores*, ser. PMAM’23, 2023, p. 50–59. [Online]. Available: <https://doi.org/10.1145/3582514.3582519>
- [23] Y. Song and Z. Li, “New tiling techniques to improve cache temporal locality,” in *Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation*, ser. PLDI ’99. New York, NY, USA: Association for Computing Machinery, 1999, p. 215–228. [Online]. Available: <https://doi.org/10.1145/301618.301668>
- [24] B. Sun, M. Li, H. Yang, J. Xu, Z. Luan, and D. Qian, “Adapting combined tiling to stencil optimizations on sunway processor,” *CCF Transactions on High Performance Computing*, pp. 1–12, 2023.
- [25] Q. Sun, Y. Liu, H. Yang, Z. Jiang, Z. Luan, and D. Qian, “Stencil-mart: Predicting optimization selection for stencil computations across gpus,” in *2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)*. IEEE, 2022, pp. 875–885.
- [26] S. Tian, B. Chapman, and J. Doerfert, “Exploring the limits of generic code execution on gpus via direct (openmp) offload,” in *OpenMP: Advanced Task-Based, Device and Compiler Programming*, S. McIntosh-Smith, M. Klemm, B. R. de Supinski, T. Deakin, and J. Klinkenberg, Eds. Cham: Springer Nature Switzerland, 2023, pp. 179–192.
- [27] S. Tian, J. Chesterfield, J. Doerfert, and B. Chapman, “Experience report: Writing a portable gpu runtime with openmp 5.1,” in *OpenMP: Enabling Massive Node-Level Parallelism*, S. McIntosh-Smith, B. R. de Supinski, and J. Klinkenberg, Eds. Cham: Springer International Publishing, 2021, pp. 159–169.
- [28] T. Tylor-Jones and L. Azevedo, *A Practical Guide to Seismic Reservoir Characterization*. Springer Nature, 2023.
- [29] D. Wonnacott, “Using time skewing to eliminate idle time due to memory bandwidth and network limitations,” in *Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000*, 2000, pp. 171–180.
- [30] D. Wonnacott, “Achieving scalable locality with time skewing,” *International Journal of Parallel Programming*, vol. 30, 03 1999.