



62.9 for naive      94.4 for  
naive

| Compute and memory to clearly identify the highest contributor. High-level overview of the utilization for compute and memory resource |       |
|----------------------------------------------------------------------------------------------------------------------------------------|-------|
| Compute (SM) Throughput [%]                                                                                                            | 70.60 |
| Memory Throughput [%]                                                                                                                  | 84.18 |
| L1/TEX Cache Throughput [%]                                                                                                            | 88.80 |
| L2 Cache Throughput [%]                                                                                                                | 8.95  |
| DRAM Throughput [%]                                                                                                                    | 0.78  |

| Compute Throughput Breakdown               |       |
|--------------------------------------------|-------|
| SM: Inst Executed Pipe Lsu [%]             | 70.60 |
| SM: Mio2rf Writeback Active [%]            | 40.13 |
| SM: Issue Active [%]                       | 39.28 |
| SM: Inst Executed [%]                      | 39.25 |
| SM: Mio Inst Issued [%]                    | 36.95 |
| SM: Pipe Fma Cycles Active [%]             | 26.54 |
| SM: Pipe Alu Cycles Active [%]             | 12.44 |
| SM: Inst Executed Pipe Adu [%]             | 6.60  |
| SM: Mio Pg Read Cycles Active [%]          | 3.25  |
| SM: Mio Pg Write Cycles Active [%]         | 3.25  |
| SM: Pipe Tensor Cycles Active [%]          | 0.86  |
| SM: Pipe Shared Cycles Active [%]          | 0.86  |
| SM: Inst Executed Pipe Uniform [%]         | 0.85  |
| SM: Inst Executed Pipe Cbu Pred On Any [%] | 0.05  |
| SM: Inst Executed Pipe Ipa [%]             | 0     |
| SM: Inst Executed Pipe Fp16 [%]            | 0     |
| SM: Inst Executed Pipe Tex [%]             | 0     |
| SM: Pipe Fp64 Cycles Active [%]            | 0     |
| SM: Inst Executed Pipe Xu [%]              | 0     |
| IDC: Request Cycles Active [%]             | 0     |

88.80 → slightly less than 89.28 in naive 21  
 up from 68 in naive  
 (due to screen refresh)  
 (but we're using screen now)  
 up from 32% in naive  
 (lots of register file writes/reads)  
 down from 31% in naive (likely due to)

(each of vectorized  
loads)

L1 hit rate

88 → 1.7 b/c of

string



down b/c less strain  
on L1

up b/c  
more strain +  
L1 units constantly  
working

|                               | Instructions | Requests   | Waivers    | % Peak | Bank Conflicts |
|-------------------------------|--------------|------------|------------|--------|----------------|
| Shared Load                   | 41,943,040   | 41,943,040 | 50,365,622 | 76.90  | 12,418         |
| Shared Load Matrix            | 0            | 0          | 0          | 0      | 264,385        |
| Shared Store                  | 2,097,152    | 2,097,152  | 2,363,830  | 3.61   | 4,216          |
| Shared Store From Global Load | 0            | 0          | 0          | 0      | 0              |
| Shared Atomic                 | 0            | 0          | 0          | 0      | 0              |
| Other                         | 0            | 0          | 0          | 0      | 0              |
| Total                         | 4,040,192    | 44,040,192 | 62,864,619 | 80.72  | 281,021        |

very good  
arithmetic intensity  
Strm 1x load/lse 20x

very good  
strain usage  
for strain → mrs

| L2 Cache                    |           |           |             |                  |          |               |                   |            |             | L2 Cache                    |                  |          |                  |                    |                  |          |                  |            |            |   |
|-----------------------------|-----------|-----------|-------------|------------------|----------|---------------|-------------------|------------|-------------|-----------------------------|------------------|----------|------------------|--------------------|------------------|----------|------------------|------------|------------|---|
|                             | Requests  | Sectors   | Sections/Rq | % Peak           | Hit Rate | Bytes         | Throughput        | Sectors    | Sections/Rq | % Peak                      | Hit Rate         | Bytes    | Throughput       | Sectors            | Sections/Rq      | % Peak   | Hit Rate         | Bytes      | Throughput |   |
| L1TEX Store                 | 37768     | 131072    | 4           | 0.19             | 100      | 4,194,140,304 | 7,941,40,3554     | 65,536     | 131,072     | 2                           | 0.12             | 100      | 4,194,304        | 4,873,46,396,57    | 0                | 0        | 0                | 0          | 0          |   |
| L1TEX Atomic ALU            | 0         | 0         | 0           | 0                | 0        | 0             | 0                 | 0          | 0           | 0                           | 0                | 0        | 0                | 0                  | 0                | 0        | 0                | 0          | 0          |   |
| L1TEX Atomic DAS            | 0         | 0         | 0           | 0                | 0        | 0             | 0                 | 0          | 0           | 0                           | 0                | 0        | 0                | 0                  | 0                | 0        | 0                | 0          | 0          |   |
| L1TEX Reduction             | 0         | 0         | 0           | 0                | 0        | 0             | 0                 | 0          | 0           | 0                           | 0                | 0        | 0                | 0                  | 0                | 0        | 0                | 0          | 0          |   |
| L1TEX Total                 | 2,095,668 | 8,881,712 | 4,000       | 0.98             | 94.48    | 248,205,1984  | 567,743,241,40436 | 11,975,825 | 16,226,719  | 1.35                        | 10.76            | 97.39    | 519,255,008      | 603,562,762,987,70 | 0                | 0        | 0                | 0          | 0          | 0 |
| GCC Total                   | 0         | 0         | 0           | 0                | 0        | 0             | 0                 | 0          | 0           | 0                           | 0                | 0        | 0                | 0                  | 0                | 0        | 0                | 0          | 0          |   |
| ECC Total                   | -         | -         | -           | -                | -        | -             | -                 | -          | -           | -                           | -                | -        | -                | -                  | -                | -        | -                | -          | -          |   |
| L2 Fabric Total             | 98,134    | 344,938   | 3,511       | 0.75             | 92.78    | 1,101,7356    | 70,841,88,519,985 | 35,680     | 19,798      | 1.37                        | 0.47             | 72.38    | 1,123,7659       | 1,108,385,698,71   | 0                | 0        | 0                | 0          | 0          | 0 |
| GPU Total                   | 2,035,233 | 8,455,370 | 1,97        | 0.31             | 94.00    | 27,579,340    | 528,386,22,740,29 | 12,192,003 | 16,389,549  | 1.36                        | 0.47             | 72.38    | 550,065,366,500  | 6,177,059,450,75   | 0                | 0        | 0                | 0          | 0          | 0 |
| GPU Total                   | 2,035,233 | 8,455,370 | 1,97        | 0.31             | 94.00    | 27,579,340    | 528,386,22,740,29 | 12,192,003 | 16,389,549  | 1.36                        | 0.47             | 72.38    | 550,065,366,500  | 6,177,059,450,75   | 0                | 0        | 0                | 0          | 0          | 0 |
| L2 Cache Evolution Policies |           |           |             |                  |          |               |                   |            |             | L2 Cache Evolution Policies |                  |          |                  |                    |                  |          |                  |            |            |   |
|                             | First     | Hit Rate  | Last        | Hit Rate         | Normal   | Hit Rate      | Normal            | Hit Rate   | Last        | Hit Rate                    | Normal           | Hit Rate | Normal           | Hit Rate           | Normal           | Hit Rate | Normal           | Hit Rate   | Normal     |   |
| L1TEX Load                  | 0         | 0         | 0           | 0                | 0        | 0             | 0                 | 0          | 0           | 0                           | 0                | 0        | 0                | 0                  | 0                | 0        | 0                | 0          | 0          |   |
| L1TEX Same                  | 0         | 0         | 0           | 0                | 0        | 0             | 0                 | 0          | 0           | 0                           | 0                | 0        | 0                | 0                  | 0                | 0        | 0                | 0          | 0          |   |
| L1TEX Atomic                | 0         | 0         | 0           | 0                | 0        | 0             | 0                 | 0          | 0           | 0                           | 0                | 0        | 0                | 0                  | 0                | 0        | 0                | 0          | 0          |   |
| L1TEX Total                 | 0         | 0         | 0           | 0                | 0        | 0             | 0                 | 0          | 0           | 0                           | 0                | 0        | 0                | 0                  | 0                | 0        | 0                | 0          | 0          |   |
| L2 Fabric Total             | 0         | 0         | 0           | 0                | 0        | 0             | 0                 | 0          | 0           | 0                           | 0                | 0        | 0                | 0                  | 0                | 0        | 0                | 0          | 0          |   |
| GPU Total                   | 2,565     | 160       | 0           | 0                | 0        | 0             | 0                 | 0          | 0           | 0                           | 0                | 0        | 0                | 0                  | 0                | 0        | 0                | 0          | 0          |   |
| Device Memory               |           |           |             |                  |          |               |                   |            |             | Device Memory               |                  |          |                  |                    |                  |          |                  |            |            |   |
|                             | Sectors   | % Peak    | Bytes       | % Peak           | Bytes    | Throughput    | Sectors           | % Peak     | Bytes       | Throughput                  | Sectors          | % Peak   | Bytes            | Throughput         | Sectors          | % Peak   | Bytes            | Throughput |            |   |
| Load                        | 262,148   | 0.78      | 8,388,736   | 15,866,52,000,42 | 0        | 202,148       | 0                 | 0          | 0           | 0                           | 9,151,478,008,55 | 0        | 9,151,478,008,55 | 0                  | 0                | 0        | 0                | 0          | 0          |   |
| State                       | 0         | 0         | 0           | 0                | 0        | 0             | 262,148           | 0.78       | 8,388,736   | 15,866,52,000,42            | 0                | 202,148  | 0.48             | 8,388,736          | 9,151,478,008,55 | 0        | 9,151,478,008,55 | 0          |            |   |
| Total                       | 262,148   | 0.78      | 8,388,736   | 15,866,52,000,42 | 0        | 202,148       | 0                 | 0          | 0           | 0                           | 9,151,478,008,55 | 0        | 9,151,478,008,55 | 0                  | 0                | 0        | 0                | 0          | 0          |   |
| Scheduler Statistics        |           |           |             |                  |          |               |                   |            |             | Scheduler Statistics        |                  |          |                  |                    |                  |          |                  |            |            |   |

in L1tex

sum is now

faster

In native L2 scales we

responsibly

faster is faster and scales given reports

No. divisible

→ sum

(65) nine to

37 in fleet  
due to latency hiding (



Some people  
are born  
to lead

A red ink drawing on graph paper featuring four distinct, abstract, flowing shapes. The top-left shape is a wide, curved loop. The top-right shape is a narrower, more vertical loop. The bottom-right shape is a smaller, rounded loop. The bottom-left shape is a long, thin, horizontal stroke ending in a small circle.