

Sai Jayanth Kalisi  
Van Tha Bik Lian

UW - EE 524 GPU  
GPU Compute Final Project - Part 2: Design Specification

Focus: CUDA Kernel Tuning Challenge (All Kernels Provided)

## PART 1 (3D STENCIL SWEEP)

Stencil patterns are a combination of maps with a local gather over a fixed set of offsets. Every output of a stencil is a function of some neighborhood of elements in the input collection. Its local neighborhood structure exposes opportunities for data usage and optimization of data locality. Inputs are typically divided into a set of partially overlapping strips or regions so that neighbors can be accessed. Outputs are divided into non-overlapping regions so they can be safely written independently in parallel. Stencils are extremely common in image processing and scientific simulations. They are often 2D or 3D (can be extended to higher dimensions), and computationally; they feature high memory traffic, and low arithmetic intensity, and are often memory bound when computation uses all global memory accesses. In this project we study 3D stencils, in particular, we will look at FOUR 3D stencil kernels applied on a given dataset and iteratively fine-tune each kernel to get the best performance.

All our builds, releases, timing results, and profiling will be obtained from the AWS AppStream instance with the Turing T4 GPU. We will use NVIDIA's Insight Compute to evaluate our kernels throughout our iterative fine-tuning process. Below are the FOUR kernels we aim to optimize for this part of the project:

**ALL THESE KERNELS are 7-POINT 3D Stencil | WE ARE NOT CHANGING  
STENCIL\_RADIUS\_3D THROUGHOUT OUR ITERATIVE PROCESS**

### **S1. stencil\_3d\_basic\_kernel**

- This is a basic implementation of the 3D sweep stencil kernel using GLOBAL MEMORY ACCESS
- Takes a uniform NxNxN cubic 3D data input grid

### **S2. stencil\_3d\_tiled\_smem\_kernel**

- 3D Stencil Sweep with shared memory tiling

### **S3. stencil\_3d\_nonsqr\_tiled\_smem\_kernel**

- 3D Stencil Sweep with shared memory tiling. Non-square processing for better GMEM coalescing

### **S4. stencil\_3d\_thread\_coarsening\_kernel**

- 3D Stencil Sweep - uses shared memory tiling as well as thread coarsening.

## PART 2 (HISTOGRAMS)

Histograms display the number count/percentage of occurrences of data values in a dataset and can be used to extract notable features/patterns from large datasets. Histograms can be used to summarize key data distribution characteristics, extract features for object recognition, fraud detection, and so on. Computing each count or bin is an operation that can be done in parallel. In this part of the project we study histograms, in particular, we will look at FOUR kernels for histograms, applied on a given dataset, and iteratively fine-tune each kernel to get the best performance.

All our builds, releases, timing results, and profiling will be obtained from the AWS AppStream instance with the Turing T4 GPU. We will use NVIDIA's Insight Compute to evaluate our kernels throughout our iterative fine-tuning process. Below are the FOUR kernels we aim to optimize for this part of the project:

#### H1. histo\_basic\_kernel

- Basic GMEM implementation. Uses atomicAdd to accumulate bin values (privatization - takes care of race conditions)

#### H2. histo\_priv\_smem\_kernel

- SMEM implementation (NO TILING) - also uses atomicAdd (privatization)

#### H3. histo\_coarsen\_interleaved\_kernel

- Uses SHARED MEM + coarsening with interleaved partitioning. Uses atomicAdd (privatization)

#### H4. histo\_coarsen\_contiguous\_kernel

- Uses SHARED MEM + coarsening with contiguous partitioning (each segment assigned to a thread). Uses atomicAdd (privatization)

### PART 3 (Reduction)

Reduction of an array is useful for summarizing all values of that array into a single value and for example, finding the mean, max, min, product, median, etc. It is beneficial for statistical analysis of large swaths of data, in machine and deep learning, physics and engineering simulations, rendering of images, and massively parallel algorithms. In particular, the reduction we intend to implement is that of an aggregate sum across all values in the array provided. In

All our builds, releases, timing results, and profiling will be obtained from the AWS AppStream instance with the Turing T4 GPU. We will use NVIDIA's Insight Compute to evaluate our kernels throughout our iterative fine-tuning process. Below are the FOUR kernels we aim to optimize for this part of the project:

**All kernels are generalized to handle compute beyond a single thread block/SM - \_\_syncthreads() have worked for threads in the same block/SM. However, there is no way to perform barrier sync between blocks. FOR ALL kernels below, we use atomics to sync between blocks and SMS at the L2 or GMEM level.**

#### R1. reduce\_sum\_poor\_gmem\_multiblock\_kernel

- GMEM implementation of the reduced sum is the multi-block version, which uses atomics (privatization for partial sums).

#### R2. reduce\_sum\_improved\_gmem\_multiblock\_kernel

- This kernel has different strides compared to R1 to improve control+mem divergence. Still uses GMEM, and atomics for partial sums

#### R3. reduce\_sum\_segmented\_multiblock\_kernel

- Implements reduce sum with shared memory usage. The first reduction pass uses GMEM then SMEM thereafter. Atomics is still used for partial sums.

#### R4. reduce\_sum\_segmented\_coarsened\_multiblock\_kernel

- Mostly the same as R3. Incorporates CFACT for coarsening.

## 2 & 3. Tuning/Optimization Table - Initial Results

Please see link:

[https://docs.google.com/spreadsheets/d/1nPIY7L3BHG6XiCasgFNKrFSTcV0i-WUeaGxuU5gesas/edit?  
usp=sharing](https://docs.google.com/spreadsheets/d/1nPIY7L3BHG6XiCasgFNKrFSTcV0i-WUeaGxuU5gesas/edit?usp=sharing)

The above link provides a list of launch configurations (including tile sizes, coarsening factors, implementation types).

There are two tabs in the above link, one for GPU times, and another for CPU times. We hope to fully populate this graph by the end of this project, and show our analysis accordingly.

Also, please see the link above for the initial timings and results ran by insight compute. **Please make sure to see both tabs.**

## 4. Provide Nsight Compute profiling results for each kernel type set {3D Stencil, Histogram, Reduction}

# HISTOGRAM

| ID | Estimated Speedup | Function Name          | Demangled Name         | Duration<br>(444640) | Runtime Improvement<br>(239694) | Compute Throughput | Memory Throughput | # Registers   | Grid S |
|----|-------------------|------------------------|------------------------|----------------------|---------------------------------|--------------------|-------------------|---------------|--------|
| 0  | 53.94             | histo_basic_kernel     | histo_basic_kernel_    | 322.53 (+956.5...    | 173.97                          | 2.51 (-91.18%      | 3.69 (-83.97%     | 16 (+0.00%, - |        |
| 1  | 47.88             | histo_priv_global_k... | histo_priv_global_k... | 18.11 (-83.01...     | 8.66                            | 31.98 (+51.75...   | 15.99 (-19.90...  | 16 (+0.00%, - |        |
| 2  | 47.28             | histo_priv_smem_k...   | histo_priv_smem_k...   | 44.03 (-56.03...     | 20.82                           | 28.00 (+26.84...   | 21.86 (+18.18...  | 16 (+0.00%, - |        |
| 3  | 66.43             | histo_coarsen_cont...  | histo_coarsen_cont...  | 27.07 (-74.07...     | 17.98                           | 24.93 (+9.14...    | 35.43 (+134.59... | 16 (+0.00%, - |        |
| 4  | 55.51             | histo_coarsen_i...     | histo_coarsen_i...     | 32.90 (-68.04...     | 18.26                           | 28.87 (+32.10...   | 18.87 (-1.93...   | 16 (+0.00%, - |        |

## HISTOGRAM: BASIC GMEM KERNEL

The following performance optimization opportunities were discovered for this result. Follow the rule links to see more context on the Details page.  
Note: Speedup estimates provide upper bounds for the optimization potential of a kernel assuming its overall algorithmic structure is kept unchanged.

**Long Scoreboard Stalls**  
Est. Speedup: 53.94%  
On average, each warp of this kernel spends 192.0 cycles being stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, texture) operation. Find the instruction producing the data being waited upon to identify the culprit. To reduce the number of cycles waiting on L1TEX data accesses verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit rates by increasing data locality (coalescing), or by changing the cache configuration. Consider moving frequently used data to shared memory. This stall type represents about 53.9% of the total average of 356.0 cycles between issuing two instructions.

The following table lists the metrics that are key performance indicators:

| Metric Name                            | Value     | Guidance                                                      |
|----------------------------------------|-----------|---------------------------------------------------------------|
| smsp_issue_active.avg.per_cycle_active | 0.0194977 | Increase the average number of instructions issued per cycle  |
| smsp_average_long_scoreboard           | 192.047   | Decrease the number of cycles spent in long scoreboard stalls |

**Achieved Occupancy**  
Est. Speedup: 13.07%  
The difference between calculated theoretical (100.0%) and measured achieved occupancy (86.9%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the [CUDA Best Practices Guide](#) for more details on optimizing occupancy.

The following table lists the metrics that are key performance indicators:

| Metric Name                                       | Value   | Guidance                                                               |
|---------------------------------------------------|---------|------------------------------------------------------------------------|
| sm_warpes_active.avg.pct_of_peak_sustained_active | 86.9284 | Increase the achieved occupancy towards the theoretical limit (100.0%) |

**L2 Slices Workload Imbalance**  
Est. Speedup: 11.32%  
One or more L2 Slices have a much higher number of active cycles than the average number of active cycles. Maximum instance value is 86.70% above the average, while the minimum instance value is 28.49% below the average.

The following table lists the metrics that are key performance indicators:

| Metric Name           | Value   | Guidance                                                                                       |
|-----------------------|---------|------------------------------------------------------------------------------------------------|
| its_cycles_active.avg | 36005.3 | Balancing the number of active cycles across L2 Slices would result in a more optimized kernel |

## HISTOGRAM: PRIVATE GLOBAL KERNEL

The following performance optimization opportunities were discovered for this result. Follow the rule links to see more context on the Details page.  
Note: Speedup estimates provide upper bounds for the optimization potential of a kernel assuming its overall algorithmic structure is kept unchanged.

**Achieved Occupancy**  
Est. Speedup: 47.80% The difference between calculated theoretical (100.0%) and measured achieved occupancy (52.2%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the [CUDA Best Practices Guide](#) for more details on optimizing occupancy.

The following table lists the metrics that are key performance indicators:

| Metric Name                                     | Value   | Guidance                                                               |
|-------------------------------------------------|---------|------------------------------------------------------------------------|
| sm_warp_active.avg_pct_of_peak_sustained_active | 52.2042 | Increase the achieved occupancy towards the theoretical limit (100.0%) |

**SMSPs Workload Imbalance**  
Est. Speedup: 9.97% One or more SMSPs have a much higher number of active cycles than the average number of active cycles. Additionally, other SMSPs have a much lower number of active cycles than the average number of active cycles. Maximum instance value is 13.59% above the average, while the minimum instance value is 14.26% below the average.

The following table lists the metrics that are key performance indicators:

| Metric Name            | Value   | Guidance                                                                                   |
|------------------------|---------|--------------------------------------------------------------------------------------------|
| smsp_cycles_active.avg | 7168.06 | Balancing the number of active cycles across SMSPs would result in a more optimized kernel |

**SMs Workload Imbalance**  
Est. Speedup: 5.68% One or more SMs have a much lower number of active cycles than the average number of active cycles. Maximum instance value is 7.00% above the average, while the minimum instance value is 13.23% below the average.

The following table lists the metrics that are key performance indicators:

| Metric Name          | Value   | Guidance                                                                                 |
|----------------------|---------|------------------------------------------------------------------------------------------|
| sm_cycles_active.avg | 7926.25 | Balancing the number of active cycles across SMs would result in a more optimized kernel |

## HISTOGRAM: PRIVATE SMEM KERNEL

**Uncoalesced Shared Accesses**  
Est. Speedup: 47.28% This kernel has uncoalesced shared accesses resulting in a total of 41229 excessive wavefronts (51% of the total 80293 wavefronts). Check the L1 Wavefronts Shared Excessive table for the primary source locations. The [CUDA Best Practices Guide](#) has an example on optimizing shared memory accesses.

The following table lists the metrics that are key performance indicators:

| Metric Name                                   | Value | Guidance                                           |
|-----------------------------------------------|-------|----------------------------------------------------|
| derived_memory_l1_wavefronts_shared_excessive | 41229 | Reduce the number of excessive wavefronts in L1TEX |

**Long Scoreboard Stalls**  
Est. Speedup: 45.90% On average, each warp of this kernel spends 10.1 cycles being stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, texture) operation. Find the instruction producing the data being waited upon to identify the culprit. To reduce the number of cycles waiting on L1TEX data accesses verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit rates by increasing data locality (coalescing), or by changing the cache configuration. Consider moving frequently used data to shared memory. This stall type represents about 45.9% of the total average of 22.1 cycles between issuing two instructions.

The following table lists the metrics that are key performance indicators:

| Metric Name                            | Value    | Guidance                                                      |
|----------------------------------------|----------|---------------------------------------------------------------|
| smsp_issue_active.avg_per_cycle_active | 0.280212 | Increase the average number of instructions issued per cycle  |
| smsp_average_long_scoreboard           | 10.1471  | Decrease the number of cycles spent in long scoreboard stalls |

**Achieved Occupancy**  
Est. Speedup: 25.89% The difference between calculated theoretical (100.0%) and measured achieved occupancy (74.1%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the [CUDA Best Practices Guide](#) for more details on optimizing occupancy.

The following table lists the metrics that are key performance indicators:

| Metric Name                                     | Value   | Guidance                                                               |
|-------------------------------------------------|---------|------------------------------------------------------------------------|
| sm_warp_active.avg_pct_of_peak_sustained_active | 74.1107 | Increase the achieved occupancy towards the theoretical limit (100.0%) |

## HISTOGRAM: COARSEN CONTIGUOUS KERNEL

**L1TEX Global Load Access Pattern**  
Est. Speedup: 66.43% The memory access pattern for global loads from L1TEX might not be optimal. On average, only 2.0 of the 32 bytes transmitted per sector are utilized by each thread. This could possibly be caused by a stride between threads. Check the [Source Counters](#) section for uncoalesced global loads.

The following table lists the metrics that are key performance indicators:

| Metric Name                                                   | Value   | Guidance                                                                  |
|---------------------------------------------------------------|---------|---------------------------------------------------------------------------|
| smsp_sas_average_data_bytes_per_sector_mem_global_op_id_ratio | 2       | Increase the average number of bytes utilized per sector towards 32 bytes |
| l1tex_throughput.avg_pct_of_peak_sustained_elapsed            | 70.8594 | The higher the L1TEX throughput the more severe the issue becomes         |

**Long Scoreboard Stalls**  
Est. Speedup: 48.32% On average, each warp of this kernel spends 11.1 cycles being stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, texture) operation. Find the instruction producing the data being waited upon to identify the culprit. To reduce the number of cycles waiting on L1TEX data accesses verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit rates by increasing data locality (coalescing), or by changing the cache configuration. Consider moving frequently used data to shared memory. This stall type represents about 48.3% of the total average of 23.0 cycles between issuing two instructions.

The following table lists the metrics that are key performance indicators:

| Metric Name                            | Value    | Guidance                                                      |
|----------------------------------------|----------|---------------------------------------------------------------|
| smsp_issue_active.avg_per_cycle_active | 0.298234 | Increase the average number of instructions issued per cycle  |
| smsp_average_long_scoreboard           | 11.1219  | Decrease the number of cycles spent in long scoreboard stalls |

**Uncoalesced Shared Accesses**  
Est. Speedup: 45.58% This kernel has uncoalesced shared accesses resulting in a total of 41276 excessive wavefronts (57% of the total 73030 wavefronts). Check the L1 Wavefronts Shared Excessive table for the primary source locations. The [CUDA Best Practices Guide](#) has an example on optimizing shared memory accesses.

The following table lists the metrics that are key performance indicators:

| Metric Name                                   | Value | Guidance                                           |
|-----------------------------------------------|-------|----------------------------------------------------|
| derived_memory_l1_wavefronts_shared_excessive | 41276 | Reduce the number of excessive wavefronts in L1TEX |

## HISTOGRAM: COARSEN INTERLEAVED KERNEL

The following performance optimization opportunities were discovered for this result. Follow the rule links to see more context on the Details page.  
Note: Speedup estimates provide upper bounds for the optimization potential of a kernel assuming its overall algorithmic structure is kept unchanged.

**Long Scoreboard Stalls**  
Est. Speedup: 55.51%

On average, each warp of this kernel spends 11.1 cycles being stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, texture) operation. Find the instruction producing the data being waited upon to identify the culprit. To reduce the number of cycles waiting on L1TEX data accesses verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit rates by increasing data locality (coalescing), or by changing the cache configuration. Consider moving frequently used data to shared memory. This stall type represents about 55.5% of the total average of 19.9 cycles between issuing two instructions.

The following table lists the metrics that are key performance indicators:

| Metric Name                            | Value    | Guidance                                                      |
|----------------------------------------|----------|---------------------------------------------------------------|
| smsp_issue_active.avg_per_cycle_active | 0.316803 | Increase the average number of instructions issued per cycle  |
| smsp_average_long_scoreboard           | 11.0545  | Decrease the number of cycles spent in long scoreboard stalls |

**Uncoalesced Shared Accesses**  
Est. Speedup: 51.28%

This kernel has uncoalesced shared accesses resulting in a total of 41229 excessive wavefronts (57% of the total 72969 wavefronts). Check the L1 Wavefronts Shared Excessive table for the primary source locations. The [CUDA Best Practices Guide](#) has an example on optimizing shared memory accesses.

The following table lists the metrics that are key performance indicators:

| Metric Name                                   | Value | Guidance                                           |
|-----------------------------------------------|-------|----------------------------------------------------|
| derived_memory_l1_wavefronts_shared_excessive | 41229 | Reduce the number of excessive wavefronts in L1TEX |

**Tail Effect**  
Est. Speedup: 50.00%

A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 85 thread blocks. Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for up to 50.0% of the total kernel runtime with a lower occupancy of 23.0%. Try launching a grid with no partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for a grid. See the [Hardware Model](#) description for more details on launch configurations.

The following table lists the metrics that are key performance indicators:

| Metric Name                     | Value   | Guidance                                                                          |
|---------------------------------|---------|-----------------------------------------------------------------------------------|
| launch_waves_per_multiprocessor | 1.53125 | Decrease the number of partial waves (the fractional part of the number of waves) |

## REDUCTION

This table shows all results in the report. Use the column headers to sort the results in this report. Double-click a result to see detailed metrics. Double-click on demangled names to rename it.

| ID | Estimated Speedup | Function Name     | Demangled Name    | Duration<br>(1.25593e+07) | Runtime Improvement<br>(5.57707e+06) | Compute Throughput | Memory Throughput | # Registers | Grid S |
|----|-------------------|-------------------|-------------------|---------------------------|--------------------------------------|--------------------|-------------------|-------------|--------|
| 0  | 54.41             | reduce_sum_po_    | reduce_sum_po_    | 4.87 (+5,371.39%)         | 2.65 38.96 (+67.52%)                 | 19.75 (+3.05%)     | 16 (+0.00%, -)    | 16          |        |
| 1  | 35.35             | reduce_sum_impro_ | reduce_sum_impro_ | 2.26 (+2,436.78%)         | 0.80 32.48 (+39.31%)                 | 31.17 (+62.61%)    | 16 (+0.00%, -)    | 16          |        |
| 2  | 61.47             | reduce_sum_smem_  | reduce_sum_smem_  | 2.14 (+2,382.77%)         | 1.31 68.19 (+193.21%)                | 68.19 (+255.77%)   | 16 (+0.00%, -)    | 16          |        |
| 3  | 24.22             | reduce_sum_segm_  | reduce_sum_segm_  | 2.34 (+2,531.05%)         | 0.57 62.26 (+167.68%)                | 62.26 (+224.80%)   | 16 (+0.00%, -)    | 16          |        |
| 4  | 26.18             | reduce_sum_segm_  | reduce_sum_segm_  | 0.96 (+981.00%)           | 0.25 73.89 (+217.76%)                | 73.89 (+285.49%)   | 16 (+0.00%, -)    | 16          |        |

## REDUCTION: POOR GMEM KERNEL (MULTIBLOCK)

The following performance optimization opportunities were discovered for this result. Follow the rule links to see more context on the Details page.  
Note: Speedup estimates provide upper bounds for the optimization potential of a kernel assuming its overall algorithmic structure is kept unchanged.

**Uncoalesced Global Accesses**  
Est. Speedup: 54.41%

This kernel has uncoalesced global accesses resulting in a total of 31500000 excessive sectors (66% of the total 47984375 sectors). Check the L2 Theoretical Sectors Global Excessive table for the primary source locations. The [CUDA Programming Guide](#) has additional information on reducing uncoalesced device memory accesses.

The following table lists the metrics that are key performance indicators:

| Metric Name                                            | Value    | Guidance                                        |
|--------------------------------------------------------|----------|-------------------------------------------------|
| derived_memory_l2_theoretical_sectors_global_excessive | 3.15e+07 | Reduce the number of excessive wavefronts in L2 |

**L1TEX Global Load Access Pattern**  
Est. Speedup: 29.62%

The memory access pattern for global loads from L1TEX might not be optimal. On average, only 8.0 of the 32 bytes transmitted per sector are utilized by each thread. This could possibly be caused by a stride between threads. Check the [Source Counters](#) section for uncoalesced global loads.

The following table lists the metrics that are key performance indicators:

| Metric Name                                                    | Value   | Guidance                                                                  |
|----------------------------------------------------------------|---------|---------------------------------------------------------------------------|
| smsp_sass_average_data_bytes_per_sector_mem_global_op_id_ratio | 8.00195 | Increase the average number of bytes utilized per sector towards 32 bytes |
| l1tex_throughput_avg_pct_of_peak_sustained_elapsed             | 39.5029 | The higher the L1TEX throughput the more severe the issue becomes         |

**L1TEX Global Store Access Pattern**  
Est. Speedup: 29.62%

The memory access pattern for global stores to L1TEX might not be optimal. On average, only 8.0 of the 32 bytes transmitted per sector are utilized by each thread. This could possibly be caused by a stride between threads. Check the [Source Counters](#) section for uncoalesced global stores.

The following table lists the metrics that are key performance indicators:

| Metric Name                                                    | Value   | Guidance                                                                  |
|----------------------------------------------------------------|---------|---------------------------------------------------------------------------|
| smsp_sass_average_data_bytes_per_sector_mem_global_op_id_ratio | 8.00391 | Increase the average number of bytes utilized per sector towards 32 bytes |
| l1tex_throughput_avg_pct_of_peak_sustained_elapsed             | 39.5029 | The higher the L1TEX throughput the more severe the issue becomes         |

## REDUCTION: IMPROVED GMEM KERNEL (MULTI BLOCK)

**Barrier Stalls**  
Est. Speedup: 35.35%

On average, each warp of this kernel spends 8.5 cycles being stalled waiting for sibling warps at a CTA barrier. A high number of warps waiting at a barrier is commonly caused by diverging code paths before a barrier. This causes some warps to wait a long time until other warps reach the synchronization point. Whenever possible, try to divide up the work into blocks of uniform workloads. If the block size is 512 threads or greater, consider splitting it into smaller groups. This can increase eligible warps without affecting occupancy, unless shared memory becomes a new occupancy limiter. Also, try to identify which barrier instruction causes the most stalls, and optimize the code executed before that synchronization point first. This stall type represents about 35.4% of the total average of 24.0 cycles between issuing two instructions.

The following table lists the metrics that are key performance indicators:

| Metric Name                            | Value    | Guidance                                                     |
|----------------------------------------|----------|--------------------------------------------------------------|
| smsp_issue_active.avg.per_cycle_active | 0.326821 | Increase the average number of instructions issued per cycle |
| smsp_average_barrier                   | 8.49425  | Decrease the number of cycles spent in barrier stalls        |

**FP32 Non-Fused Instructions**  
Est. Speedup: 2.34%

This kernel executes 0 fused and 1062500 non-fused FP32 instructions. By converting pairs of non-fused instructions to their [@fused](#) higher-throughput equivalent, the achieved FP32 performance could be increased by up to 50% (relative to its current performance). Check the Source page to identify where this kernel executes FP32 instructions.

The following table lists the metrics that are key performance indicators:

| Metric Name                                                | Value      | Guidance                                                                              |
|------------------------------------------------------------|------------|---------------------------------------------------------------------------------------|
| sass_inst_executed_per_opcode                              | 1.0625e+06 | Decrease the number of non-fused floating-point instructions (FADD, FMUL, DADD, DMUL) |
| sm_pipe_fma_cycles_active.avg.pct_of_peak_sustained_active | 4.67365    | The higher the utilization of the pipeline the more severe the issue becomes          |

**L1TEX Global Load Access Pattern**  
Est. Speedup: 0.18%

The memory access pattern for global loads from L1TEX might not be optimal. On average, only 31.7 of the 32 bytes transmitted per sector are utilized by each thread. This could possibly be caused by a stride between threads. Check the [Source Counter](#) section for uncoalesced global loads.

The following table lists the metrics that are key performance indicators:

| Metric Name                                                    | Value   | Guidance                                                                  |
|----------------------------------------------------------------|---------|---------------------------------------------------------------------------|
| smsp_sass_average_data_bytes_per_sector.mem_global_op_id_ratio | 31.6828 | Increase the average number of bytes utilized per sector towards 32 bytes |
| l1tex_throughput.avg.pct_of_peak_sustained_elapsed             | 17.7421 | The higher the L1TEX throughput the more severe the issue becomes         |

## REDUCTION: SMEM (SINGLE BLOCK)

**L1TEX Global Store Access Pattern**  
Est. Speedup: 61.47%

The memory access pattern for global stores to L1TEX might not be optimal. On average, only 4.0 of the 32 bytes transmitted per sector are utilized by each thread. This could possibly be caused by a stride between threads. Check the [Source Counter](#) section for uncoalesced global stores.

The following table lists the metrics that are key performance indicators:

| Metric Name                                                    | Value   | Guidance                                                                  |
|----------------------------------------------------------------|---------|---------------------------------------------------------------------------|
| smsp_sass_average_data_bytes_per_sector.mem_global_op_st_ratio | 4       | Increase the average number of bytes utilized per sector towards 32 bytes |
| l1tex_throughput.avg.pct_of_peak_sustained_elapsed             | 70.2562 | The higher the L1TEX throughput the more severe the issue becomes         |

**Thread Divergence**  
Est. Speedup: 27.44%

Instructions are executed in warps, which are groups of 32 threads. Optimal instruction throughput is achieved if all 32 threads of a warp execute the same instruction. The chosen launch configuration, early thread completion, and divergent flow control can significantly lower the number of active threads in a warp per cycle. This kernel achieves an average of 32.0 threads being active per cycle. This is further reduced to 19.1 threads per warp due to predication. The compiler may use predication to avoid an actual branch. Instead, all instructions are scheduled, but a per-thread condition code or predicate controls which threads execute the instructions. Try to avoid different execution paths within a warp when possible.

The following table lists the metrics that are key performance indicators:

| Metric Name                                               | Value   | Guidance                                                                |
|-----------------------------------------------------------|---------|-------------------------------------------------------------------------|
| smsp_thread_inst_executed_pred_on_per_inst_executed_ratio | 19.1243 | Increase the number of predicated-on threads per instruction towards 32 |
| smsp_thread_inst_executed_per_inst_executed_ratio         | 31.9669 | Increase the number of threads per instruction towards 32               |

**FP32 Non-Fused Instructions**  
Est. Speedup: 6.33%

This kernel executes 0 fused and 5500000 non-fused FP32 instructions. By converting pairs of non-fused instructions to their [@fused](#) higher-throughput equivalent, the achieved FP32 performance could be increased by up to 50% (relative to its current performance). Check the Source page to identify where this kernel executes FP32 instructions.

The following table lists the metrics that are key performance indicators:

| Metric Name                                                | Value   | Guidance                                                                              |
|------------------------------------------------------------|---------|---------------------------------------------------------------------------------------|
| sass_inst_executed_per_opcode                              | 5.5e+06 | Decrease the number of non-fused floating-point instructions (FADD, FMUL, DADD, DMUL) |
| sm_pipe_fma_cycles_active.avg.pct_of_peak_sustained_active | 12.6615 | The higher the utilization of the pipeline the more severe the issue becomes          |

## REDUCTION: SEGMENTED, SMEM (MULTI BLOCK)

**Thread Divergence**  
Est. Speedup: 24.22%

Instructions are executed in warps, which are groups of 32 threads. Optimal instruction throughput is achieved if all 32 threads of a warp execute the same instruction. The chosen launch configuration, early thread completion, and divergent flow control can significantly lower the number of active threads in a warp per cycle. This kernel achieves an average of 32.0 threads being active per cycle. This is further reduced to 19.5 threads per warp due to predication. The compiler may use predication to avoid an actual branch. Instead, all instructions are scheduled, but a per-thread condition code or predicate controls which threads execute the instructions. Try to avoid different execution paths within a warp when possible.

The following table lists the metrics that are key performance indicators:

| Metric Name                                               | Value   | Guidance                                                                |
|-----------------------------------------------------------|---------|-------------------------------------------------------------------------|
| smsp_thread_inst_executed_pred_on_per_inst_executed_ratio | 19.5495 | Increase the number of predicated-on threads per instruction towards 32 |
| smsp_thread_inst_executed_per_inst_executed_ratio         | 31.968  | Increase the number of threads per instruction towards 32               |

**L2 Slices Workload Imbalance**  
Est. Speedup: 6.52%

One or more L2 Slices have a much higher number of active cycles than the average number of active cycles. Maximum instance value is 9.13% above the average, while the minimum instance value is 2.50% below the average.

The following table lists the metrics that are key performance indicators:

| Metric Name           | Value       | Guidance                                                                                       |
|-----------------------|-------------|------------------------------------------------------------------------------------------------|
| lts_cycles_active.avg | 1.42981e+06 | Balancing the number of active cycles across L2 Slices would result in a more optimized kernel |

**FP32 Non-Fused Instructions**  
Est. Speedup: 6.24%

This kernel executes 0 fused and 5500000 non-fused FP32 instructions. By converting pairs of non-fused instructions to their [@fused](#) higher-throughput equivalent, the achieved FP32 performance could be increased by up to 50% (relative to its current performance). Check the Source page to identify where this kernel executes FP32 instructions.

The following table lists the metrics that are key performance indicators:

| Metric Name                                                | Value   | Guidance                                                                              |
|------------------------------------------------------------|---------|---------------------------------------------------------------------------------------|
| sass_inst_executed_per_opcode                              | 5.5e+06 | Decrease the number of non-fused floating-point instructions (FADD, FMUL, DADD, DMUL) |
| sm_pipe_fma_cycles_active.avg.pct_of_peak_sustained_active | 12.482  | The higher the utilization of the pipeline the more severe the issue becomes          |

## REDUCTION: SEGMENTED, COARSENED, SMEM (MULTI BLOCK)

The following performance optimization opportunities were discovered for this result. Follow the rule links to see more context on the Details page.  
Note: Speedup estimates provide upper bounds for the optimization potential of a kernel assuming its overall algorithmic structure is kept unchanged.

**Thread Divergence**  
Est. Speedup: 26.18%

Instructions are executed in warps, which are groups of 32 threads. Optimal instruction throughput is achieved if all 32 threads of a warp execute the same instruction. The chosen launch configuration, early thread completion, and divergent flow control can significantly lower the number of active threads in a warp per cycle. This kernel achieves an average of 31.9 threads being active per cycle. This is further reduced to 20.7 threads per warp due to predication. The compiler may use predication to avoid an actual branch. Instead, all instructions are scheduled, but a per-thread condition code or predicate controls which threads execute the instructions. Try to avoid different execution paths within a warp when possible.

The following table lists the metrics that are key performance indicators:

| Metric Name                                               | Value   | Guidance                                                                |
|-----------------------------------------------------------|---------|-------------------------------------------------------------------------|
| smsp_thread_inst_executed_pred_on_per_inst_executed_ratio | 20.6614 | Increase the number of predicated-on threads per instruction towards 32 |
| smsp_thread_inst_executed_per_inst_executed_ratio         | 31.935  | Increase the number of threads per instruction towards 32               |

**FP32 Non-Fused Instructions**  
Est. Speedup: 7.54%

This kernel executes 0 fused and 3000000 non-fused FP32 instructions. By converting pairs of non-fused instructions to their [@fused](#) higher-throughput equivalent, the achieved FP32 performance could be increased by up to 50% (relative to its current performance). Check the Source page to identify where this kernel executes FP32 instructions.

The following table lists the metrics that are key performance indicators:

| Metric Name                                                | Value   | Guidance                                                                              |
|------------------------------------------------------------|---------|---------------------------------------------------------------------------------------|
| sass_inst_executed_per_opcode                              | 3e+06   | Decrease the number of non-fused floating-point instructions (FADD, FMUL, DADD, DMUL) |
| sm_pipe_fma_cycles_active.avg.pct_of_peak_sustained_active | 15.0892 | The higher the utilization of the pipeline the more severe the issue becomes          |

# 3D STENCIL

| ID | Estimated Speedup | Function Name        | Demangled Name       | (5.38997e+07) | (1.93739e+07) | Compute Throughput | Memory Throughput | # Registers | Grid Size   | Block Size |
|----|-------------------|----------------------|----------------------|---------------|---------------|--------------------|-------------------|-------------|-------------|------------|
| 0  | 29.71             | stencil_3d_basi...   | stencil_3d_basi...   | 13.34         | 3.96          | 67.25              | 69.25             | 24          | 24, 192, 1- | 32, 4, -   |
| 1  | 37.83             | stencil_3d_nonsqr... | stencil_3d_nonsqr... | 29.68         | 18.47         | 49.61              | 49.61             | 22          | 18, 256, 2- | 32, 4, -   |
| 2  | 38.34             | stencil_3d_thread... | stencil_3d_thread... | 12.88         | 4.94          | 67.29              | 67.29             | 32          | 18, 18, -   | 32, 32,    |

## 3D STENCIL: BASIC GMEM

The following performance optimization opportunities were discovered for this result. Follow the rule links to see more context on the Details page.  
Note: Speedup estimates provide upper bounds for the optimization potential of a kernel assuming its overall algorithmic structure is kept unchanged.

**Achieved Occupancy** The difference between calculated theoretical (100.0%) and measured achieved occupancy (70.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the [@ CUDA Best Practices Guide](#) for more details on optimizing occupancy.  
**Est. Speedup: 29.71%**

The following table lists the metrics that are key performance indicators:

| Metric Name                                     | Value   | Guidance                                                               |
|-------------------------------------------------|---------|------------------------------------------------------------------------|
| sm_warp_active.avg_pct_of_peak_sustained_active | 70.2875 | Increase the achieved occupancy towards the theoretical limit (100.0%) |

**L1TEX Global Load Access Pattern** The memory access pattern for global loads from L1TEX might not be optimal. On average, only 25.0 of the 32 bytes transmitted per sector are utilized by each thread. This could possibly be caused by a stride between threads.  
**Est. Speedup: 15.16%**

The following table lists the metrics that are key performance indicators:

| Metric Name                                                     | Value   | Guidance                                                                  |
|-----------------------------------------------------------------|---------|---------------------------------------------------------------------------|
| sm_sp_sass_average_data_bytes_per_sector_mem_global_op_id_ratio | 24.9627 | Increase the average number of bytes utilized per sector towards 32 bytes |
| l1tex_throughput.avg_pct_of_peak_sustained_elapsed              | 68.9394 | The higher the L1TEX throughput the more severe the issue becomes         |

**Uncoalesced Global Accesses** This kernel has uncoalesced global accesses resulting in a total of 7803000 excessive sectors (5% of the total 170105400 sectors). Check the L2 Theoretical Sectors Global Excessive table for the primary source locations. The [@ CUDA Programming Guide](#) has additional information on reducing uncoalesced device memory accesses.  
**Est. Speedup: 3.30%**

The following table lists the metrics that are key performance indicators:

| Metric Name                                            | Value     | Guidance                                        |
|--------------------------------------------------------|-----------|-------------------------------------------------|
| derived_memory_l2_theoretical_sectors_global_excessive | 7.803e+06 | Reduce the number of excessive wavefronts in L2 |

## 3D STENCIL: NONSQUARE SMEM

The following performance optimization opportunities were discovered for this result. Follow the rule links to see more context on the Details page.  
Note: Speedup estimates provide upper bounds for the optimization potential of a kernel assuming its overall algorithmic structure is kept unchanged.

**Long Scoreboard Stalls** On average, each warp of this kernel spends 6.1 cycles being stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, texture) operation. Find the instruction producing the data being waited upon to identify the culprit. To reduce the number of cycles waiting on L1TEX data accesses verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit rates by increasing data locality (coalescing), or by changing the cache configuration. Consider moving frequently used data to shared memory. This stall type represents about 37.8% of the total average of 16.2 cycles between issuing two instructions.  
**Est. Speedup: 37.83%**

The following table lists the metrics that are key performance indicators:

| Metric Name                            | Value    | Guidance                                                      |
|----------------------------------------|----------|---------------------------------------------------------------|
| smsp_issue_active.avg.per_cycle.active | 0.340049 | Increase the average number of instructions issued per cycle  |
| smsp_average_long_scoreboard           | 6.13466  | Decrease the number of cycles spent in long scoreboard stalls |

**Achieved Occupancy** The difference between calculated theoretical (100.0%) and measured achieved occupancy (68.3%) can be the result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block as well as across blocks of the same kernel. See the [@ CUDA Best Practices Guide](#) for more details on optimizing occupancy.  
**Est. Speedup: 31.68%**

The following table lists the metrics that are key performance indicators:

| Metric Name                                     | Value   | Guidance                                                               |
|-------------------------------------------------|---------|------------------------------------------------------------------------|
| sm_warp_active.avg_pct_of_peak_sustained_active | 68.3183 | Increase the achieved occupancy towards the theoretical limit (100.0%) |

**L1TEX Global Load Access Pattern** The memory access pattern for global loads from L1TEX might not be optimal. On average, only 20.3 of the 32 bytes transmitted per sector are utilized by each thread. This could possibly be caused by a stride between threads.  
**Est. Speedup: 18.73%**

The following table lists the metrics that are key performance indicators:

| Metric Name                                                     | Value   | Guidance                                                                  |
|-----------------------------------------------------------------|---------|---------------------------------------------------------------------------|
| sm_sp_sass_average_data_bytes_per_sector_mem_global_op_id_ratio | 20.3185 | Increase the average number of bytes utilized per sector towards 32 bytes |
| l1tex_throughput.avg_pct_of_peak_sustained_elapsed              | 51.3078 | The higher the L1TEX throughput the more severe the issue becomes         |

## 3D STENCIL: COARSENED, SMEM

The following performance optimization opportunities were discovered for this result. Follow the rule links to see more context on the Details page.  
Note: Speedup estimates provide upper bounds for the optimization potential of a kernel assuming its overall algorithmic structure is kept unchanged.

**L1TEX Global Load Access Pattern** The memory access pattern for global loads from L1TEX might not be optimal. On average, only 14.1 of the 32 bytes transmitted per sector are utilized by each thread. This could possibly be caused by a stride between threads. Check the [# Source Counter](#) section for uncoalesced global loads.  
**Est. Speedup: 38.34%**

The following table lists the metrics that are key performance indicators:

| Metric Name                                                     | Value   | Guidance                                                                  |
|-----------------------------------------------------------------|---------|---------------------------------------------------------------------------|
| sm_sp_sass_average_data_bytes_per_sector_mem_global_op_id_ratio | 14.0631 | Increase the average number of bytes utilized per sector towards 32 bytes |
| l1tex_throughput.avg_pct_of_peak_sustained_elapsed              | 68.4015 | The higher the L1TEX throughput the more severe the issue becomes         |

**L1TEX Global Store Access Pattern** The memory access pattern for global stores to L1TEX might not be optimal. On average, only 25.5 of the 32 bytes transmitted per sector are utilized by each thread. This could possibly be caused by a stride between threads. Check the [# Source Counter](#) section for uncoalesced global stores.  
**Est. Speedup: 13.89%**

The following table lists the metrics that are key performance indicators:

| Metric Name                                                  | Value   | Guidance                                                                  |
|--------------------------------------------------------------|---------|---------------------------------------------------------------------------|
| sm_sp_sass_average_data_bytes_per_sector_mem_global_op_ratio | 25.5    | Increase the average number of bytes utilized per sector towards 32 bytes |
| l1tex_throughput.avg_pct_of_peak_sustained_elapsed           | 68.4015 | The higher the L1TEX throughput the more severe the issue becomes         |

**Uncoalesced Global Accesses** This kernel has uncoalesced global accesses resulting in a total of 6698592 excessive sectors (9% of the total 75907296 sectors). Check the L2 Theoretical Sectors Global Excessive table for the primary source locations. The [@ CUDA Programming Guide](#) has additional information on reducing uncoalesced device memory accesses.  
**Est. Speedup: 8.63%**

The following table lists the metrics that are key performance indicators:

| Metric Name                                            | Value       | Guidance                                        |
|--------------------------------------------------------|-------------|-------------------------------------------------|
| derived_memory_l2_theoretical_sectors_global_excessive | 6.69859e+06 | Reduce the number of excessive wavefronts in L2 |