

see detailed metrics. Double-click on demangled names to rename

| ns]  | Compute Throughput | Memory Throughput | # Registers [register] |
|------|--------------------|-------------------|------------------------|
| 0.00 | 91.56              | 36.53             | 122                    |
| 0.00 | 91.59              | 36.54             | 122                    |

hotly: from 63  $\rightarrow$  70  $\rightarrow$  92  
w/ colas

likely due to  
- register tiling (massive AI)  
(uses 122 reg)

- async  
- double buffering

High-level overview of the throughput for compute and memory resources of the GPU. For each unit, the throughput reports the achieved percentage of utilization with respect to the theoretical maximum. Breakdown the GPU presented as a roofline chart.

Compute (SM) Throughput [%]

Memory Throughput [%]

L1/TEX Cache Throughput [%]

L2 Cache Throughput [%]

DRAM Throughput [%]

High Throughput The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To further improve performance, work will likely need to be shifted from the memory.

Compute Bottlenecks Detect bottlenecks arising from compute capabilities.

|       |      |
|-------|------|
| 91.59 | Dur  |
| 36.54 | Elap |
| 38.57 | SM   |
| 13.19 | SM   |
| 2.32  | DR   |

much much  
cover  
tasks  
for stuff  
from fram  
much more

Hc registers  
are the new  
'cache'



much less SMem ↔ reg  
pressure  
much less L/s since everything  
in reg



18-1 occupy coupon to  
my 94-1.

Naive bottleneck: Cover  
(cards)

Tiled bottleneck: Swap to reg  
bus

Coblas bottleneck: Math

122 reg found  $\rightarrow$  only

18-1. occupy  $\rightarrow$

much more ELP

