

result to see detailed metrics. Double click on demand to rename it.

| Compute Throughput [%] | Memory Throughput [%] | # Registers [register/thread] |
|------------------------|-----------------------|-------------------------------|
| 62.90                  | 94.40                 | 32                            |

much more stuff to do in normal flow in vector add  
(higher AD)

probably more usage of L1 / L2 as opposed to just STB or (higher data reuse)  
also use some strided access (band)

2x fewer register adds

| Compute (SM) Throughput [%] | Memory Throughput [%] | L1/TEX Cache Throughput [%] | L2 Cache Throughput [%] | DRAM Throughput [%] |
|-----------------------------|-----------------------|-----------------------------|-------------------------|---------------------|
| 62.90                       | 94.40                 | 95.23                       | 16.19                   | 0.48                |

\* L1-bound \*

Small working set  
lives in L1  
DRAM (L2) low usage



63% of time SW doing L/s to L1

32% of time it is doing math

The memory access pattern for loads from L1TEX to L2 is not optimal. The granularity of an L1TEX request to L2 is a 128 byte cache line. That is 4 consecutive 32-byte sectors per L2 request. However, this kernel only accesses an average of 1.4 sectors out of the possible 4 sectors per cache line. Check the [Source Counters](#) section for uncoalesced loads and try to minimize how many cache lines need to be accessed per memory request.

#### ► Key Performance Indicators

The memory access pattern for stores from L1TEX to L2 is not optimal. The granularity of an L1TEX request to L2 is a 128 byte cache line. That is 4 consecutive 32-byte sectors per L2 request. However, this kernel only accesses an average of 2.0 sectors out of the possible 4 sectors per cache line. Check the [Source Counters](#) section for uncoalesced stores and try to minimize how many cache lines need to be accessed per memory request.

#### ► Key Performance Indicators

X X strided access  
is killing us  
so much waste X

#### ► Scheduler Statistics

Summary of the activity of the schedulers issuing instructions. Each scheduler maintains a pool of warps that it can issue instructions for. The upper bound of warps in the pool (Theoretical Warps) is limited by the launch configuration. On every cycle each scheduler checks the state of the allocated warps in the pool (Active Warps). Active warps that are not stalled (Eligible Warps) are ready to issue their next instruction. From the set of eligible warps the scheduler selects a single warp from which to issue one or more instructions (Issued Warp). On cycles with no eligible warps, the scheduler can skip and no instruction is issued. Having many skipped issue slots indicates poor latency hiding.

Active Warps Per Scheduler [warp]

14.90 No Eligible [%]

65.00

Eligible Warps Per Scheduler [warp]

2.15 One or More Eligible [%]

35.00

Issued Warp Per Scheduler

0.35

35.00

⚠ Issue Slot Utilization Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only issues an instruction every 2.9 cycles. This might leave hardware resources underutilized and may lead to less optimal performance. Out of the maximum of 16 warps per scheduler, this kernel allocates an average of 14.90 active warps per scheduler, but only an average of 2.15 warps were eligible per cycle. Eligible warps are the subset of active warps that are ready to issue their next instruction. Every cycle with no eligible warp results in no instruction being issued and the issue slot remains unused. To increase the number of eligible warps, avoid possible load imbalances due to highly different execution durations per warp. Reducing stalls indicated on the [Warp State Statistics](#) and [Source Counters](#) sections can help, too.



X X very low utilization  
X waiting for uncoalesced STALL LS



Again stalled on LD(ST)

Wavefronts at 94.4-1.

overhead

| Metric                      | Vector Add      | Naive Matmul               |
|-----------------------------|-----------------|----------------------------|
| <b>Primary Bottleneck</b>   | DRAM Bandwidth  | LSU Instruction Throughput |
| <b>Arithmetic Intensity</b> | Low             | High                       |
| <b>DRAM Utilization</b>     | High            | Near Zero (0.48%)          |
| <b>L1/L2 Efficiency</b>     | Low (Streaming) | High (Reuse/Hit Rate)      |
| <b>Access Pattern</b>       | Coalesced       | Strided / Uncoalesced      |