

| Result  | Size                            | Time                     | Cycles  | GPU                                 | SM Frequency | Process            | Attributes |
|---------|---------------------------------|--------------------------|---------|-------------------------------------|--------------|--------------------|------------|
| Current | 535 - convolutionConstantKernel | (32, 32, 1024)x(8, 8, 1) | 1.27 ms | 1.391.144 0 - NVIDIA A100-SXM4-40GB | 1.09 Ghz     | [3874072] xray_cnn |            |

Summary Details Source Context Comments Raw Session

Compare Tools View Export

#### GPU Speed Of Light Throughput

High-level overview of the throughput for compute and memory resources of the GPU. For each unit, the throughput reports the achieved percentage of utilization with respect to the theoretical maximum. Breakdowns show the throughput for each individual sub-metric of Compute and Memory to clearly identify the highest contributor. High-level overview of the utilization for compute and memory resources of the GPU presented as a roofline chart.

|                             |       |                          |            |
|-----------------------------|-------|--------------------------|------------|
| Compute (SM) Throughput [%] | 87.60 | Duration [ms]            | 1.27       |
| Memory Throughput [%]       | 69.94 | Elapsed Cycles [cycle]   | 1391144    |
| L1/TEX Cache Throughput [%] | 70.10 | SM Active Cycles [cycle] | 1387877.61 |
| L2 Cache Throughput [%]     | 37.58 | SM Frequency [Ghz]       | 1.09       |
| DRAM Throughput [%]         | 13.64 | DRAM Frequency [Ghz]     | 1.21       |

High Throughput The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To further improve performance, work will likely need to be shifted from the most utilized to another unit. Start by analyzing workloads in the [Compute Workload Analysis](#) section.

Roofline Analysis The ratio of peak float (fp32) to double (fp64) performance on this device is 2:1. The kernel achieved 6% of this device's fp32 peak performance and 0% of its fp64 peak performance. See the [Kernel Profiling Guide](#) for more details on roofline analysis.

#### Floating Point Operations Roofline



#### PM Sampling

Timeline view of PM metrics sampled periodically over the workload duration. Data is collected across multiple passes. Use this section to understand how workload behavior changes over its runtime.

|                                   |       |                          |   |
|-----------------------------------|-------|--------------------------|---|
| Maximum Sampling Interval [cycle] | 20000 | # Pass Groups            | 4 |
| Maximum Buffer Size [Mbyte]       | 1.05  | Dropped Samples [sample] | 0 |

#### Compute Workload Analysis

Detailed analysis of the compute resources of the streaming multiprocessors (SM), including the achieved instructions per clock (IPC) and the utilization of each available pipeline. Pipelines with very high utilization might limit the overall performance.

|                                   |      |                      |       |
|-----------------------------------|------|----------------------|-------|
| Executed Ipc Elapsed [inst/cycle] | 3.50 | SM Busy [%]          | 87.80 |
| Executed Ipc Active [inst/cycle]  | 3.51 | Issue Slots Busy [%] | 87.80 |
| Issued Ipc Active [inst/cycle]    | 3.51 |                      |       |

High Utilization ALU is the highest-utilized pipeline (66.5%) based on active cycles, taking into account the rates of its different instructions. It executes integer and logic operations. The pipeline is well-utilized, but might become a bottleneck if more work is added. Based on the number of executed instructions, the highest utilized pipeline (66.5%) is ALU. It executes integer and logic operations. Comparing the two, the overall pipeline utilization appears to be caused by frequent, low-latency instructions. See the [Kernel Profiling Guide](#) or hover over the pipeline name to understand the workloads handled by each pipeline. The [Instruction Statistics](#) section shows the mix of executed instructions in this kernel.

#### Key Performance Indicators

#### Memory Workload Analysis

Memory Chart

Detailed analysis of the memory resources of the GPU. Memory can become a limiting factor for the overall kernel performance when fully utilizing the involved hardware units (Mem Busy), exhausting the available communication bandwidth between those units (Max Bandwidth), or by reaching the maximum throughput of issuing memory instructions (Mem Pipes Busy). Detailed chart of the memory units. Detailed tables with data for each memory unit.

|                                 |        |                      |       |
|---------------------------------|--------|----------------------|-------|
| Memory Throughput [Gbyte/s]     | 212.05 | Mem Busy [%]         | 69.94 |
| L1/TEX Hit Rate [%]             | 49.56  | Max Bandwidth [%]    | 29.79 |
| L2 Hit Rate [%]                 | 98.52  | Mem Pipes Busy [%]   | 36.30 |
| L2 Compression Success Rate [%] | 0      | L2 Compression Ratio | 0     |

L1TEX Global Load Access Pattern The memory access pattern for global loads from L1TEX might not be optimal. On average, only 15.4 of the 32 bytes transmitted per sector are utilized by each thread. This could possibly be caused by a stride between threads. Check the [Source Counters](#) section for uncoalesced global loads.

#### Key Performance Indicators

#### Shared Load Bank Conflicts

Shared Load Bank Conflicts Est. Speedup: 34.42% The memory access pattern for shared loads might not be optimal and causes on average a 2.1 - way bank conflict across all 18874368 shared load requests. This results in 19054462 bank conflicts, which represent 49.10% of the overall 38809974 wavefronts for shared loads. Check the [Source Counters](#) section for uncoalesced shared loads.

#### Key Performance Indicators

#### Shared Store Bank Conflicts

Shared Store Bank Conflicts Est. Speedup: 17.48% The memory access pattern for shared stores might not be optimal and causes on average a 1.5 - way bank conflict across all 6291456 shared store requests. This results in 2312690 bank conflicts, which represent 24.94% of the overall 9273396 wavefronts for shared stores. Check the [Source Counters](#) section for uncoalesced shared stores.

#### Key Performance Indicators

#### Memory Chart

Values: Transfer Size Inactivity: Greyed Out



#### Scheduler Statistics

Summary of the activity of the schedulers issuing instructions. Each scheduler maintains a pool of warps that it can issue instructions for. The upper bound of warps in the pool (Theoretical Warps) is limited by the launch configuration. On every cycle each scheduler checks the state of the allocated warps in the pool (Active Warps). Active warps that are not stalled (Eligible Warps) are ready to issue their next instruction. From the set of eligible warps the scheduler selects a single warp from which to issue one or more instructions (Issued Warp). On cycles with no eligible warps, the issue slot is skipped and no instruction is issued. Having many skipped issue slots indicates poor latency hiding.

|                                     |       |                          |       |
|-------------------------------------|-------|--------------------------|-------|
| Active Warps Per Scheduler [warp]   | 14.67 | No Eligible [%]          | 12.19 |
| Eligible Warps Per Scheduler [warp] | 5.57  | One or More Eligible [%] | 87.81 |
| Issued Warp Per Scheduler           | 0.88  |                          |       |

#### Warp State Statistics

Analysis of the states in which all warps spent cycles during the kernel execution. The warp states describe a warp's readiness or inability to issue its next instruction. The warp cycles per instruction define the latency between two consecutive instructions. The higher the value, the more warp parallelism is required to hide this latency. For each warp state, the chart shows the average number of cycles spent in that state per issued instruction. Stalls are not always impacting the overall performance nor are they completely avoidable. Only focus on stall reasons if the schedulers fail to issue every cycle. When executing a kernel with mixed library and user code, these metrics show the combined values.

|                                              |       |                                          |       |
|----------------------------------------------|-------|------------------------------------------|-------|
| Warp Cycles Per Issued Instruction [cycle]   | 16.71 | Avg. Active Threads Per Warp             | 32    |
| Warp Cycles Per Executed Instruction [cycle] | 16.71 | Avg. Not Predicated Off Threads Per Warp | 28.78 |

#### Instruction Statistics

Statistics of the executed low-level assembly instructions (SASS). The instruction mix provides insight into the types and frequency of the executed instructions. A narrow mix of instruction types implies a dependency on few instruction pipelines, while others remain unused. Using multiple pipelines allows hiding latencies and enables parallel execution. Note that 'Instructions/Opcode' and 'Executed Instructions' are measured differently and can differ if cycles are spent in system calls.

|                              |           |                                                 |            |
|------------------------------|-----------|-------------------------------------------------|------------|
| Executed Instructions [inst] | 526385152 | Avg. Executed Instructions Per Scheduler [inst] | 1218484.15 |
| Issued Instructions [inst]   | 526404719 | Avg. Issued Instructions Per Scheduler [inst]   | 1218529.44 |

#### NVLink Topology

NVLink Topology diagram shows logical NVLink connections with transmit/receive throughput.

#### NVLink Tables

Detailed tables with properties for each NVLink.

#### NUMA Affinity

Non-uniform memory access (NUMA) affinities based on compute and memory distances for all GPUs.

#### Launch Statistics

Summary of the configuration used to launch the kernel. The launch configuration defines the size of the kernel grid, the division of the grid into blocks, and the GPU resources needed to execute the kernel. Choosing an efficient launch configuration maximizes device utilization.

|                                        |          |                                              |                 |
|----------------------------------------|----------|----------------------------------------------|-----------------|
| Grid Size                              | 1048576  | Function Cache Configuration                 | CachePreferNone |
| Registers Per Thread [register/thread] | 31       | Static Shared Memory Per Block [byte/block]  | 0               |
| Block Size                             | 64       | Dynamic Shared Memory Per Block [byte/block] | 400             |
| Threads [thread]                       | 67108864 | Driver Shared Memory Per Block [Kbyte/block] | 1.02            |
| Waves Per SM                           | 303.41   | Shared Memory Configuration Size [Kbyte]     | 65.54           |
| Uses Green Context                     | 0        | # SMs [SM]                                   | 108             |

#### Occupancy

Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps. Another way to view occupancy is the percentage of the hardware's ability to process warps that are actively in use. Higher occupancy does not always result in higher performance, however, low occupancy always reduces the ability to hide latencies, resulting in overall performance degradation. Large discrepancies between the theoretical and the achieved occupancy during execution typically indicates highly imbalanced workloads.

|                                        |       |                                |    |
|----------------------------------------|-------|--------------------------------|----|
| Theoretical Occupancy [%]              | 100   | Block Limit Registers [block]  | 32 |
| Theoretical Active Warps per SM [warp] | 64    | Block Limit Shared Mem [block] | 42 |
| Achieved Occupancy [%]                 | 92.08 | Block Limit Warps [block]      | 32 |
| Achieved Active Warps Per SM [warp]    | 58.93 | Block Limit SM [block]         | 32 |

#### GPU and Memory Workload Distribution

Analysis of workload distribution in active cycles of SM, SMP, SMSP, L1 & L2 caches, and DRAM

|                                    |            |                                    |            |
|------------------------------------|------------|------------------------------------|------------|
| Average SM Active Cycles [cycle]   | 1387877.61 | Average L1 Active Cycles [cycle]   | 1387877.61 |
| Average L2 Active Cycles [cycle]   | 1330443.34 | Average SMSP Active Cycles [cycle] | 1387639.61 |
| Average DRAM Active Cycles [cycle] | 210488.40  | Total SM Elapsed Cycles [cycle]    | 150232230  |
| Total L1 Elapsed Cycles [cycle]    | 150232230  | Total L2 Elapsed Cycles [cycle]    | 106715520  |
| Total SMSP Elapsed Cycles [cycle]  | 600928920  | Total DRAM Elapsed Cycles [cycle]  | 61748992   |

#### Source Counters

Source metrics, including branch efficiency and sampled warp stall reasons. Warp Stall Sampling metrics are periodically sampled over the kernel runtime. They indicate when warps were stalled and couldn't be scheduled. See the documentation for a description of all stall reasons. Only focus on stalls if the schedulers fail to issue every cycle.