

# Occupancy explained through the AMD RDNA™ architecture

## François Guthmann





Search GPUOpen...



知乎



HOME

SOFTWARE ▾

DOCS ▾

Home » Blogs » Explaining occupancy

## Occupancy explained



François Guthmann



Originally posted December 20, 2023

If you're working with GPUs, chances are you've heard the term *occupancy* thrown around in the context of shader performance. You might have heard it helps hiding memory latency but are not sure exactly what that means. If that's the case, then you are exactly where you should be! In this blog post we will try to demystify what exactly this metric is. We will first talk a bit about the hardware architecture to understand where this metric is coming from. We will then explain the factors that can limit occupancy both statically at compile time and dynamically at run time. We will also help you identify occupancy-limited workloads using tools like the [Radeon™ GPU Profiler](#) and offer potential leads to alleviate the issues. Finally, the last section will try to summarize all the concepts touched upon in this post and offer practical solutions to practical problems.

This article however assumes you have a basic understanding of how to work with a GPU. Mainly, we expect you to know how to use the GPU from a graphics API perspective (draws, dispatches, barriers etc.) and that the workloads are executed in groups of threads on the GPU. We also expect you to know about the basic resources a shader uses like the scalar registers, vector registers, and shared memory.

# OCCUPANCY?



# LOGICAL GRAPHICS PIPELINE



# LOGICAL GRAPHICS PIPELINE



# HARDWARE GRAPHICS PIPELINE



# HIGH LEVEL OVERVIEW



# SHADER ENGINE



# SHADER ENGINE



# WGP

Workgroup Processor (WGP)



## Workgroup Processor (WGP)



## Workgroup Processor (WGP)



# COMPUTE & THREADGROUPS

**dispatch(a, b, c)**



# COMPUTE & THREADGROUPS

**dispatch(a, b, c)**



**numthreads(d, e, f)**



# WAVEFRONTS

**dispatch(a, b, c)**



**numthreads(d, e, f)**



# LOCKSTEP EXECUTION

```
[numthreads(32, 1, 1)]
void CSMain( uint threadIndex :SV_DispatchThreadID )
{
    int sum = 0;
    if(threadIndex < 16)
    {
        sum += 1;
    }
    else
    {
        sum += 2;
    }

    data[threadIndex] = sum;
}
```



# LOCKSTEP EXECUTION

```
[numthreads(32, 1, 1)]
void CSMain( uint threadIndex :SV_DispatchThreadID )
{
    int sum = 0;
    if(threadIndex < 16)
    {
        sum += 1;
    }
    else
    {
        sum += 2;
    }

    data[threadIndex] = sum;
}
```



# LOCKSTEP EXECUTION

```
[numthreads(32, 1, 1)]
void CSMain( uint threadIndex :SV_DispatchThreadID )
{
    int sum = 0;
    if(threadIndex < 16)
    {
        sum += 1;
    }
    else
    {
        sum += 2;
    }

    data[threadIndex] = sum;
}
```



# LOCKSTEP EXECUTION

```
[numthreads(32, 1, 1)]
void CSMain( uint threadIndex :SV_DispatchThreadID )
{
    int sum = 0;
    if(threadIndex < 16)
    {
        sum += 1;
    }
    else
    {
        sum += 2;
    }
    data[threadIndex] = sum;
}
```



# SIMD & LOCKSTEP

```
cbuffer input : register(b0){ int data; };
RWBuffer<int> output : register(u0);

[numthreads(32, 1, 1)]
void CSMain( uint threadIndex : SV_DispatchThreadID )
{
    int sum = 0;
    sum += threadIndex;
    sum += data;
    output[threadIndex] = sum;
}
```

# SIMD & LOCKSTEP

```
cbuffer input : register(b0){ int data; };
RWBuffer<int> output : register(u0);

[numthreads(32, 1, 1)]
void CSMain( uint threadIndex : SV_DispatchThreadID )
{
    → int sum = 0;
        sum += threadIndex;
        sum += data;
        output[threadIndex] = sum;
}
```



# SIMD & LOCKSTEP

```
cbuffer input : register(b0){ int data; };
RWBuffer<int> output : register(u0);

[numthreads(32, 1, 1)]
void CSMain( uint threadIndex : SV_DispatchThreadID )
{
    → int sum = 0;
    sum += threadIndex;
    sum += data;
    output[threadIndex] = sum;
}
```



VGPR

|        |        |        |        |     |
|--------|--------|--------|--------|-----|
| sum: 0 | sum: 0 | sum: 0 | sum: 0 | ... |
|--------|--------|--------|--------|-----|

# SIMD & LOCKSTEP

```
cbuffer input : register(b0){ int data; };
RWBuffer<int> output : register(u0);

[numthreads(32, 1, 1)]
void CSMain( uint threadIndex : SV_DispatchThreadID )
{
    int sum = 0;
    → sum += threadIndex;
    sum += data;
    output[threadIndex] = sum;
}
```



VGPR

|        |        |        |        |     |
|--------|--------|--------|--------|-----|
| sum: 0 | sum: 0 | sum: 0 | sum: 0 | ... |
| sum: 0 | sum: 1 | sum: 2 | sum: 3 | ... |

# SIMD & LOCKSTEP

```
cbuffer input : register(b0){ int data; };  
RWBuffer<int> output : register(u0);
```

```
[numthreads(32, 1, 1)]  
void CSMain( uint threadIndex : SV_DispatchThreadID )  
{  
    int sum = 0;  
    sum += threadIndex;  
    → sum += data;  
    output[threadIndex] = sum;  
}
```



| VGPR   |        |        |        | SGPR |         |
|--------|--------|--------|--------|------|---------|
| sum: 0 | sum: 0 | sum: 0 | sum: 0 | ...  | data: 5 |
| sum: 0 | sum: 1 | sum: 2 | sum: 3 | ...  | data: 5 |
| sum: 5 | sum: 6 | sum: 7 | sum: 8 | ...  | data: 5 |

# SIMD & LOCKSTEP

```
cbuffer input : register(b0){ int data; };  
RWBuffer<int> output : register(u0);
```

```
[numthreads(32, 1, 1)]  
void CSMain( uint threadIndex : SV_DispatchThreadID )  
{  
    int sum = 0;  
    sum += threadIndex;  
    sum += data;  
    → output[threadIndex] = sum;  
}
```



| VGPR   |        |        |        | SGPR |
|--------|--------|--------|--------|------|
| sum: 0 | sum: 0 | sum: 0 | sum: 0 | ...  |
| sum: 0 | sum: 1 | sum: 2 | sum: 3 | ...  |
| sum: 5 | sum: 6 | sum: 7 | sum: 8 | ...  |
| sum: 5 | sum: 6 | sum: 7 | sum: 8 | ...  |

# SIMD & LOCKSTEP

```
cbuffer input : register(b0){ int data; };
RWBuffer<int> output : register(u0);

[numthreads(32, 1, 1)]
void CSMain( uint threadIndex : SV_DispatchThreadID )
{
    int sum = 0;
    sum += threadIndex;
    sum += data;
    output[threadIndex] = sum;
}
```



VGPR

|        |        |        |        |     |         |
|--------|--------|--------|--------|-----|---------|
| sum: 0 | sum: 0 | sum: 0 | sum: 0 | ... | data: 5 |
| sum: 0 | sum: 1 | sum: 2 | sum: 3 | ... | data: 5 |
| sum: 5 | sum: 6 | sum: 7 | sum: 8 | ... | data: 5 |
| sum: 5 | sum: 6 | sum: 7 | sum: 8 | ... | data: 5 |

# ASSIGNED WAVEFRONTS



# ASSIGNED WAVEFRONTS

**Wavefronts don't have to be executed in order**

**Wavefronts execution can be interrupted and resumed at any time**

# WAVEFRONT SCHEDULING & LATENCY HIDING



# WAVEFRONT SCHEDULING & LATENCY HIDING



# WAVEFRONT SCHEDULING & LATENCY HIDING



# WAVEFRONT SCHEDULING & LATENCY HIDING



# WAVEFRONT SCHEDULING & LATENCY HIDING



# WAVEFRONT SCHEDULING & LATENCY HIDING



# WAVEFRONT SCHEDULING & LATENCY HIDING



# WAVEFRONT SCHEDULING & LATENCY HIDING



# WAVEFRONT SCHEDULING & LATENCY HIDING



# WAVEFRONT SCHEDULING & LATENCY HIDING



# WAVEFRONT SCHEDULING & LATENCY HIDING



# WAVEFRONT SCHEDULING & LATENCY HIDING



# WAVEFRONT SCHEDULING & LATENCY HIDING



# WAVEFRONT SCHEDULING & LATENCY HIDING



# LATENCY HIDING IN RGP



**Occupancy is the ratio of assigned wavefronts to the maximum available slots**



# Occupancy is the ratio of assigned wavefronts to the maximum available slots



# Better occupancy doesn't mean better performance !



# Latency bound workloads \*might\* benefit from increased occupancy



# In memory bound scenarios, increasing occupancy might thrash the caches



# THEORETICAL OCCUPANCY – GPRS

```
[numthreads(32, 1, 1)]
void CSMain( uint threadIndex :
             SV_DispatchThreadID )
{
    int sum = 0;
    if(threadIndex < 16)
    {
        sum += 1;
    }
    else
    {
        sum += 2;
    }

    data[threadIndex] = sum;
}
```

# THEORETICAL OCCUPANCY – GPRS

```
[numthreads(32, 1, 1)]
void CSMain( uint threadIndex :
             SV_DispatchThreadID )
{
    int sum = 0;
    if(threadIndex < 16)
    {
        sum += 1;
    }
    else
    {
        sum += 2;
    }

    data[threadIndex] = sum;
}
```

```
shader main
asic(GFX10_3)
type(CS)
sgpr_count(6)
vgpr_count(8)
wave_size(32)
s_version      UC_VERSION_GFX10 | UC_VERSION_W32_BIT
s_inst_prefetch 0x0003
s_getpc_b64    s[0:1]
s_mov_b32       s0, s2
s_load_dwordx4  s[4:7], s[0:1], null
v_mad_u32_u24   v1, s3, 32, v0
v_cmp_gt_u32   vcc_lo, 16, v1
v_cndmask_b32  v2, 2, 1, vcc_lo
v_mov_b32       v3, v2
v_mov_b32       v4, v2
v_mov_b32       v5, v2
s_waitcnt      lgkmcnt(0)
buffer_store_format_xyzw  v[2:5], v1, s[4:7], 0 idxen glc
s_endpgm
```

# THEORETICAL OCCUPANCY – GPRS

```
[numthreads(32, 1, 1)]
void CSMain( uint threadIndex :
             SV_DispatchThreadID )
{
    int sum = 0;
    if(threadIndex < 16)
    {
        sum += 1;
    }
    else
    {
        sum += 2;
    }

    data[threadIndex] = sum;
}
```

```
shader main
asic(GFX10_3)
type(CS)
sgpr_count(6)
vgpr_count(8)
wave_size(32)
s_version      UC_VERSION_GFX10 | UC_VERSION_W32_BIT
s_inst_prefetch 0x0003
s_getpc_b64    s[0:1]
s_mov_b32       s0, s2
s_load_dwordx4  s[4:7], s[0:1], null
v_mad_u32_u24   v1, s3, 32, v0
v_cmp_gt_u32    vcc_lo, 16, v1
v_cndmask_b32   v2, 2, 1, vcc_lo
v_mov_b32       v3, v2
v_mov_b32       v4, v2
v_mov_b32       v5, v2
s_waitcnt       lgkmcnt(0)
buffer_store_format_xyzw  v[2:5], v1, s[4:7], 0 idxen glc
s_endpgm
```

# THEORETICAL OCCUPANCY – GPRS

```
[numthreads(32, 1, 1)]
void CSMain( uint threadIndex :
             SV_DispatchThreadID )
{
    int sum = 0;
    if(threadIndex < 16)
    {
        sum += 1;
    }
    else
    {
        sum += 2;
    }

    data[threadIndex] = sum;
}
```

```
shader main
asic(GFX10_3)
type(CS)
sgpr_count(6)
vgpr_count(8)
wave_size(32)
s_version          UC_VERSION_GFX10 | UC_VERSION_W32_BIT
s_inst_prefetch   0x0003
s_getpc_b64        s[0:1]
s_mov_b32          s0, s2
s_load_dwordx4    s[4:7], s[0:1], null
v_mad_u32_u24      v1, s3, 32, v0
v_cmp_gt_u32       vcc_lo, 16, v1
v_cndmask_b32     v2, 2, 1, vcc_lo
v_mov_b32          v3, v2
v_mov_b32          v4, v2
v_mov_b32          v5, v2
s_waitcnt          lgkmcnt(0)
buffer_store_format_xyzw  v[2:5], v1, s[4:7], 0 idxen glc
s_endpgm
```

# RADEON™ GPU PROFILER TO THE RESCUE



## For max occupancy

- **Wave32 -  $1536 / 16 = 96$  VGPR per wave**
- **Wave64 -  $1536 / 2 / 16 = 48$  VGPR per wave**

# RADEON™ GPU PROFILER TO THE RESCUE

The AMD GPU Profiler interface. The top navigation bar includes 'START', 'OVERVIEW' (which is selected), and 'EVENTS'. On the left, a sidebar lists 'Frame summary', 'Barriers', 'Context rolls', 'Most expensive events', 'Render/depth targets', 'Pipelines', and 'Device configuration' (which is highlighted). The main area displays 'System information' and 'GPU information'. 'System information' includes Processor name: AMD, Clock speed: 4691 MHz, Physical cores: 12, Logical cores: 24, and System memory (RAM): 64 GB. 'GPU information' includes Device name: AMD Radeon RX 7900 XTX and Device ID (and revision): 744CC8. Below these are sections for 'Shader core' and 'Memory', each listing various performance metrics.

The AMD GPU Profiler interface. The top navigation bar includes 'START', 'OVERVIEW' (which is selected), and 'EVENTS'. The main area shows a tree view of events under 'Wavefront occupancy' and 'Event timing'. A large red box highlights the 'Theoretical wavefront occupancy' section. This section states: 'The occupancy of this shader is limited by its vector register usage. This shader could potentially run 10 wavefronts out of 16 wavefronts per SIMD.' It also notes: 'However, if you reduce vector register usage by 12 you could run another waveform.' To the right, there are sections for 'Pipeline state' and 'Instruction Timing', along with a diagram illustrating the shader pipeline stages: Input Assembler, VS, and TCS.

# RADEON™ GPU ANALYZER TO THE RESCUE



# Radeon™ GPU Analyzer to the Rescue



gfx1100 (RDNA3) | Columns

| Address  | Opcode        | Operands                                                      | VGPR pressure (used:61, allocated:72/256) |
|----------|---------------|---------------------------------------------------------------|-------------------------------------------|
| 0x001230 | v_add_f32_e32 | v46, v46, v36                                                 | 58                                        |
| 0x001234 | s_delay_alu   | instid0(VALU_DEP_4)   instskip(NEXT)   instid1(VALU_DEP_4)    | 58                                        |
| 0x001238 | v_add_f32_e32 | v50, v50, v37                                                 | 58                                        |
| 0x00123C | v_mul_f32_e32 | v36, v15, v44                                                 | 58                                        |
| 0x001240 | v_mul_f32_e32 | v37, v15, v14                                                 | 58                                        |
| 0x001244 | v_mul_f32_e64 | v41, 0.15915494, s43                                          | 59                                        |
| 0x00124C | s_delay_alu   | instid0(VALU_DEP_3)   instskip(NEXT)   instid1(VALU_DEP_3)    | 59                                        |
| 0x001250 | v_add_f32_e32 | v36, -1.0, v36                                                | 59                                        |
| 0x001254 | v_mul_f32_e32 | v44, s72, v37                                                 | 60                                        |
| 0x001258 | v_mul_f32_e32 | v45, s74, v37                                                 | 61                                        |
| 0x00125C | v_mul_f32_e32 | v37, s73, v37                                                 | 61                                        |
| 0x001260 | v_cos_f32_e32 | v41, v41                                                      | 61                                        |
| 0x001264 | s_delay_alu   | instid0(VALU_DEP_3)   instskip(SKIP_1)   instid1(VALU_DEP_3)  | 61                                        |
| 0x001268 | v_add_f32_e32 | v44, v46, v44                                                 | 61                                        |
| 0x00126C | v_mul_f32_e32 | v46, v14, v14                                                 | 61                                        |
| 0x001270 | v_add_f32_e32 | v50, v50, v37                                                 | 61                                        |
| 0x001274 | v_mul_f32_e32 | v37, s77, v36                                                 | 61                                        |
| 0x001278 | v_add_f32_e32 | v21, v21, v45                                                 | 61                                        |
| 0x00127C | s_delay_alu   | instid0(VALU_DEP_4)   instskip(SKIP_4)   instid1(VALU_DEP_4)  | 60                                        |
| 0x001280 | v_sub_f32_e32 | v45, v60, v46                                                 | 61                                        |
| 0x001284 | v_mul_f32_e32 | v46, s78, v36                                                 | 60                                        |
| 0x001288 | v_mul_f32_e32 | v36, s76, v36                                                 | 60                                        |
| 0x00128C | v_add_f32_e32 | v50, v50, v37                                                 | 60                                        |
| 0x001290 | v_mul_f32_e32 | v37, v15, v13                                                 | 60                                        |
| 0x001294 | v_add_f32_e32 | v21, v21, v46                                                 | 60                                        |
| 0x001298 | s_delay_alu   | instid0(VALU_DEP_4)   instskip(NEXT)   instid1(TRANS32_DEP_1) | 59                                        |
| 0x00129C | v_add_f32_e32 | v46, v44, v36                                                 | 60                                        |

# THEORETICAL OCCUPANCY – LDS & THREADGROUP\_SIZE

Workgroup Processor (WGP)



# MEASURED OCCUPANCY



# MEASURED OCCUPANCY



# LACK OF WORK LIMITED OCCUPANCY

The image shows a screenshot of the AMD GPU Open interface. At the top, there are navigation buttons for 'START' and 'OVERVIEW' (which is underlined in red), and 'EVENTS'. On the left, a sidebar lists 'Frame summary', 'Barriers', 'Context rolls', 'Most expensive events', 'Render/depth targets', 'Pipelines', and 'Device configuration', with 'Device configuration' currently selected. The main area features the AMD logo. Below it, there are three sections: 'System information', 'GPU information', and 'Memory'. The 'System information' section includes details like Processor name: AMD, Clock speed: 4691 MHz, Physical cores: 12, Logical cores: 24, and System memory (RAM): 64 GB. The 'GPU information' section shows Device name: AMD Radeon RX 7900 XTX and Device ID (and revision): 744CC8. The 'Memory' section lists various cache and memory parameters.

| System Information   | Value    |
|----------------------|----------|
| Processor name:      | AMD      |
| Clock speed:         | 4691 MHz |
| Physical cores:      | 12       |
| Logical cores:       | 24       |
| System memory (RAM): | 64 GB    |

| GPU Information           | Value                  |
|---------------------------|------------------------|
| Device name:              | AMD Radeon RX 7900 XTX |
| Device ID (and revision): | 744CC8                 |

| Memory                                   | Value                    |
|------------------------------------------|--------------------------|
| Video memory clock frequency:            | 1250 MHz (1250 MHz peak) |
| Video memory bandwidth:                  | 960.0 GB/s               |
| Video memory size:                       | 24 GB                    |
| Video memory type:                       | GDDR6                    |
| L0 vector cache size per compute unit:   | 16 KB                    |
| L1 cache size per shader array:          | 256 KB                   |
| L2 cache size:                           | 4 MB                     |
| Infinity cache size:                     | 96 MB                    |
| Instruction cache size per compute unit: | 32 KB                    |
| Scalar cache size per compute unit       | 16 KB                    |
| LDS size per work group processor:       | 128 KB                   |

## Shader Engines (SE): 6

# LACK OF WORK LIMITED OCCUPANCY

The image shows a screenshot of the AMD GPU Profiler software. At the top, there's a navigation bar with 'START', 'OVERVIEW' (which is underlined in red), and 'EVENTS'. On the left, a sidebar lists 'Frame summary', 'Barriers', 'Context rolls', 'Most expensive events', 'Render/depth targets', 'Pipelines', and 'Device configuration', with 'Device configuration' currently selected. The main area features the AMD logo. Below it, there are three sections: 'System information', 'GPU information', and 'Memory'. The 'System information' section includes details like Processor name (AMD), Clock speed (4691 MHz), Physical cores (12), Logical cores (24), and System memory (RAM) (64 GB). The 'GPU information' section details the Device name (AMD Radeon RX 7900 XTX) and Device ID (744CC8). The 'Memory' section provides details on Video memory clock frequency (1250 MHz), Video memory bandwidth (960.0 GB/s), Video memory size (24 GB), Video memory type (GDDR6), and various cache sizes (L0, L1, L2, Infinity, Instruction, Scalar, LDS).

| System information   |          |
|----------------------|----------|
| Processor name:      | AMD      |
| Clock speed:         | 4691 MHz |
| Physical cores:      | 12       |
| Logical cores:       | 24       |
| System memory (RAM): | 64 GB    |

| GPU information           |                        |
|---------------------------|------------------------|
| Device name:              | AMD Radeon RX 7900 XTX |
| Device ID (and revision): | 744CC8                 |

| Shader core                              |                          |
|------------------------------------------|--------------------------|
| Shader core clock frequency:             | 2304 MHz (2304 MHz peak) |
| Shader engines:                          | 6                        |
| Work group processors per shader engine: | 8                        |
| SIMD per work group processor:           | 4                        |
| Wavefronts per SIMD:                     | 16                       |
| Vector registers per SIMD:               | 1536                     |
| Scalar registers per SIMD:               | 2048                     |

| Memory                                   |                          |
|------------------------------------------|--------------------------|
| Video memory clock frequency:            | 1250 MHz (1250 MHz peak) |
| Video memory bandwidth:                  | 960.0 GB/s               |
| Video memory size:                       | 24 GB                    |
| Video memory type:                       | GDDR6                    |
| L0 vector cache size per compute unit:   | 16 KB                    |
| L1 cache size per shader array:          | 256 KB                   |
| L2 cache size:                           | 4 MB                     |
| Infinity cache size:                     | 96 MB                    |
| Instruction cache size per compute unit: | 32 KB                    |
| Scalar cache size per compute unit:      | 16 KB                    |
| LDS size per work group processor:       | 128 KB                   |

**Shader Engines (SE): 6  
WorkGroup Processors (WGP) / SE: 8**

# LACK OF WORK LIMITED OCCUPANCY

The screenshot shows the AMD GPU Open interface with the 'OVERVIEW' tab selected. The main area displays 'System information' for an AMD processor with a clock speed of 4691 MHz, 12 physical cores, 24 logical cores, and 64 GB of RAM. It also shows 'GPU information' for an AMD Radeon RX 7900 XTX with a Device ID of 744CC8. The 'Shader core' section details 6 shader engines, 8 WGP per SE, 4 SIMD per WGP, 16 wavefronts per SIMD, 1536 vector registers, and 2048 scalar registers. The 'Memory' section lists a video memory clock frequency of 1250 MHz, 960.0 GB/s bandwidth, 24 GB GDDR6 memory, and various cache sizes from 16 KB to 128 KB.

| Processor name:                          | AMD                      |
|------------------------------------------|--------------------------|
| Clock speed:                             | 4691 MHz                 |
| Physical cores:                          | 12                       |
| Logical cores:                           | 24                       |
| System memory (RAM):                     | 64 GB                    |
| GPU information                          |                          |
| Device name:                             | AMD Radeon RX 7900 XTX   |
| Device ID (and revision):                | 744CC8                   |
| Shader core                              |                          |
| Shader core clock frequency:             | 2304 MHz (2304 MHz peak) |
| Shader engines:                          | 6                        |
| Work group processors per shader engine: | 8                        |
| SIMD per work group processor:           | 4                        |
| Wavefronts per SIMD:                     | 16                       |
| Vector registers per SIMD:               | 1536                     |
| Scalar registers per SIMD:               | 2048                     |
| Memory                                   |                          |
| Video memory clock frequency:            | 1250 MHz (1250 MHz peak) |
| Video memory bandwidth:                  | 960.0 GB/s               |
| Video memory size:                       | 24 GB                    |
| Video memory type:                       | GDDR6                    |
| L0 vector cache size per compute unit:   | 16 KB                    |
| L1 cache size per shader array:          | 256 KB                   |
| L2 cache size:                           | 4 MB                     |
| Infinity cache size:                     | 96 MB                    |
| Instruction cache size per compute unit: | 32 KB                    |
| Scalar cache size per compute unit:      | 16 KB                    |
| LDS size per work group processor:       | 128 KB                   |

**Shader Engines (SE): 6  
WorkGroup Processors (WGP) / SE: 8  
Total WGP:  $6 * 8 = 48$**

# LACK OF WORK LIMITED OCCUPANCY

The image shows a screenshot of the AMD GPU Open interface. At the top, there are navigation buttons for 'START' and 'OVERVIEW' (which is underlined in red), and 'EVENTS'. On the left, a sidebar lists 'Frame summary', 'Barriers', 'Context rolls', 'Most expensive events', 'Render/depth targets', 'Pipelines', and 'Device configuration' (which is highlighted with a blue background). The main area features the AMD logo. Below it, there are three sections: 'System information', 'GPU information', and 'Memory'. The 'System information' section includes processor name (AMD), clock speed (4691 MHz), physical cores (12), logical cores (24), and system memory (64 GB). The 'GPU information' section details the device name (AMD Radeon RX 7900 XTX) and device ID (744CC8). The 'Memory' section provides details on video memory clock frequency (1250 MHz), bandwidth (960.0 GB/s), size (24 GB), type (GDDR6), and various cache sizes (L0, L1, L2, Infinity, Instruction, Scalar, LDS).

| System information                       |                          |
|------------------------------------------|--------------------------|
| Processor name:                          | AMD                      |
| Clock speed:                             | 4691 MHz                 |
| Physical cores:                          | 12                       |
| Logical cores:                           | 24                       |
| System memory (RAM):                     | 64 GB                    |
| GPU information                          |                          |
| Device name:                             | AMD Radeon RX 7900 XTX   |
| Device ID (and revision):                | 744CC8                   |
| Shader core                              |                          |
| Shader core clock frequency:             | 2304 MHz (2304 MHz peak) |
| Shader engines:                          | 6                        |
| Work group processors per shader engine: | 8                        |
| SIMD per work group processor:           | 4                        |
| Wavefronts per SIMD:                     | 16                       |
| Vector registers per SIMD:               | 1536                     |
| Scalar registers per SIMD:               | 2048                     |
| Memory                                   |                          |
| Video memory clock frequency:            | 1250 MHz (1250 MHz peak) |
| Video memory bandwidth:                  | 960.0 GB/s               |
| Video memory size:                       | 24 GB                    |
| Video memory type:                       | GDDR6                    |
| L0 vector cache size per compute unit:   | 16 KB                    |
| L1 cache size per shader array:          | 256 KB                   |
| L2 cache size:                           | 4 MB                     |
| Infinity cache size:                     | 96 MB                    |
| Instruction cache size per compute unit: | 32 KB                    |
| Scalar cache size per compute unit:      | 16 KB                    |
| LDS size per work group processor:       | 128 KB                   |

**Shader Engines (SE): 6**  
**WorkGroup Processors (WGP) / SE: 8**  
**Total WGP:  $6 * 8 = 48$**   
**SIMD per WGP: 4**

# LACK OF WORK LIMITED OCCUPANCY

The image shows a screenshot of the AMD GPU Open interface. At the top, there are navigation buttons for 'START' and 'OVERVIEW' (which is underlined in red), and 'EVENTS'. On the left, a sidebar lists 'Frame summary', 'Barriers', 'Context rolls', 'Most expensive events', 'Render/depth targets', 'Pipelines', and 'Device configuration' (which is highlighted with a blue background). The main area features the AMD logo. Below it, there are three sections: 'System information', 'GPU information', and 'Memory'. The 'System information' section includes processor name (AMD), clock speed (4691 MHz), physical cores (12), logical cores (24), and system memory (64 GB). The 'GPU information' section details the device name (AMD Radeon RX 7900 XTX) and device ID (744CC8). The 'Memory' section provides details on video memory clock frequency (1250 MHz), bandwidth (960.0 GB/s), size (24 GB), type (GDDR6), and various cache sizes (L0, L1, L2, Infinity, Instruction, Scalar, LDS).

| System information                       |                          |
|------------------------------------------|--------------------------|
| Processor name:                          | AMD                      |
| Clock speed:                             | 4691 MHz                 |
| Physical cores:                          | 12                       |
| Logical cores:                           | 24                       |
| System memory (RAM):                     | 64 GB                    |
| GPU information                          |                          |
| Device name:                             | AMD Radeon RX 7900 XTX   |
| Device ID (and revision):                | 744CC8                   |
| Shader core                              |                          |
| Shader core clock frequency:             | 2304 MHz (2304 MHz peak) |
| Shader engines:                          | 6                        |
| Work group processors per shader engine: | 8                        |
| SIMD per work group processor:           | 4                        |
| Wavefronts per SIMD:                     | 16                       |
| Vector registers per SIMD:               | 1536                     |
| Scalar registers per SIMD:               | 2048                     |
| Memory                                   |                          |
| Video memory clock frequency:            | 1250 MHz (1250 MHz peak) |
| Video memory bandwidth:                  | 960.0 GB/s               |
| Video memory size:                       | 24 GB                    |
| Video memory type:                       | GDDR6                    |
| L0 vector cache size per compute unit:   | 16 KB                    |
| L1 cache size per shader array:          | 256 KB                   |
| L2 cache size:                           | 4 MB                     |
| Infinity cache size:                     | 96 MB                    |
| Instruction cache size per compute unit: | 32 KB                    |
| Scalar cache size per compute unit:      | 16 KB                    |
| LDS size per work group processor:       | 128 KB                   |

**Shader Engines (SE): 6**  
**WorkGroup Processors (WGP) / SE: 8**  
**Total WGP:  $6 * 8 = 48$**   
**SIMD per WGP: 4**  
**Total SIMD:  $4 * 48 = 192$**

# LACK OF WORK LIMITED OCCUPANCY

The image shows a screenshot of the AMD GPU Open interface. At the top, there are navigation buttons for 'START' and 'OVERVIEW' (which is underlined), and 'EVENTS'. On the left, a sidebar lists 'Frame summary', 'Barriers', 'Context rolls', 'Most expensive events', 'Render/depth targets', 'Pipelines', and 'Device configuration' (which is selected). The main area displays the AMD logo and 'System information' for an AMD Radeon RX 7900 XTX. It includes tables for 'GPU information', 'Shader core', 'Memory', and 'Compute units'. The 'Compute units' table shows 6 Shader Engines (SE) with 8 WorkGroup Processors (WGP) each, totaling 48 WGP. Each WGP has 4 SIMDs, totaling 192 SIMDs, and 16 Wave slots per SIMD.

| System information   |          |
|----------------------|----------|
| Processor name:      | AMD      |
| Clock speed:         | 4691 MHz |
| Physical cores:      | 12       |
| Logical cores:       | 24       |
| System memory (RAM): | 64 GB    |

| GPU information           |                        |
|---------------------------|------------------------|
| Device name:              | AMD Radeon RX 7900 XTX |
| Device ID (and revision): | 744CC8                 |

| Shader core                              |                          |
|------------------------------------------|--------------------------|
| Shader core clock frequency:             | 2304 MHz (2304 MHz peak) |
| Shader engines:                          | 6                        |
| Work group processors per shader engine: | 8                        |
| SIMD per work group processor:           | 4                        |
| Wavefronts per SIMD:                     | 16                       |
| Vector registers per SIMD:               | 1536                     |
| Scalar registers per SIMD:               | 2048                     |

| Memory                                   |                          |
|------------------------------------------|--------------------------|
| Video memory clock frequency:            | 1250 MHz (1250 MHz peak) |
| Video memory bandwidth:                  | 960.0 GB/s               |
| Video memory size:                       | 24 GB                    |
| Video memory type:                       | GDDR6                    |
| L0 vector cache size per compute unit:   | 16 KB                    |
| L1 cache size per shader array:          | 256 KB                   |
| L2 cache size:                           | 4 MB                     |
| Infinity cache size:                     | 96 MB                    |
| Instruction cache size per compute unit: | 32 KB                    |
| Scalar cache size per compute unit       | 16 KB                    |
| LDS size per work group processor:       | 128 KB                   |

**Shader Engines (SE): 6**  
**WorkGroup Processors (WGP) / SE: 8**  
**Total WGP:  $6 * 8 = 48$**   
**SIMD per WGP: 4**  
**Total SIMD:  $4 * 48 = 192$**   
**Wave slots per SIMD: 16**

# LACK OF WORK LIMITED OCCUPANCY

The image shows a screenshot of the AMD GPU Open interface. At the top, there are navigation buttons for 'START' and 'OVERVIEW' (which is underlined), and 'EVENTS'. On the left, a sidebar lists 'Frame summary', 'Barriers', 'Context rolls', 'Most expensive events', 'Render/depth targets', 'Pipelines', and 'Device configuration' (which is selected). The main area displays the AMD logo and 'System information' for an AMD Radeon RX 7900 XTX. The information includes:

| System Information   | Value    |
|----------------------|----------|
| Processor name:      | AMD      |
| Clock speed:         | 4691 MHz |
| Physical cores:      | 12       |
| Logical cores:       | 24       |
| System memory (RAM): | 64 GB    |

**GPU information**

| GPU Information           | Value                  |
|---------------------------|------------------------|
| Device name:              | AMD Radeon RX 7900 XTX |
| Device ID (and revision): | 744CC8                 |

**Shader core**

| Shader Core Information                  | Value                    |
|------------------------------------------|--------------------------|
| Shader core clock frequency:             | 2304 MHz (2304 MHz peak) |
| Shader engines:                          | 6                        |
| Work group processors per shader engine: | 8                        |
| SIMD per work group processor:           | 4                        |
| Wavefronts per SIMD:                     | 16                       |
| Vector registers per SIMD:               | 1536                     |
| Scalar registers per SIMD:               | 2048                     |

**Memory**

| Memory Information                       | Value                    |
|------------------------------------------|--------------------------|
| Video memory clock frequency:            | 1250 MHz (1250 MHz peak) |
| Video memory bandwidth:                  | 960.0 GB/s               |
| Video memory size:                       | 24 GB                    |
| Video memory type:                       | GDDR6                    |
| L0 vector cache size per compute unit:   | 16 KB                    |
| L1 cache size per shader array:          | 256 KB                   |
| L2 cache size:                           | 4 MB                     |
| Infinity cache size:                     | 96 MB                    |
| Instruction cache size per compute unit: | 32 KB                    |
| Scalar cache size per compute unit:      | 16 KB                    |
| LDS size per work group processor:       | 128 KB                   |

**Shader Engines (SE): 6**  
**WorkGroup Processors (WGP) / SE: 8**  
**Total WGP:  $6 * 8 = 48$**   
**SIMD per WGP: 4**  
**Total SIMD:  $4 * 48 = 192$**   
**Wave slots per SIMD: 16**  
**Wave slots:  $16 * 192 = 3072$**

# LACK OF WORK LIMITED OCCUPANCY

The image shows a screenshot of the AMD GPU Open interface. At the top, there are navigation buttons for 'START' and 'OVERVIEW' (which is underlined), and 'EVENTS'. On the left, a sidebar lists 'Frame summary', 'Barriers', 'Context rolls', 'Most expensive events', 'Render/depth targets', 'Pipelines', and 'Device configuration' (which is selected). The main area displays the AMD logo and 'System information' for an AMD Radeon RX 7900 XTX. The information includes:

| Processor name:                          | AMD                      |
|------------------------------------------|--------------------------|
| Clock speed:                             | 4691 MHz                 |
| Physical cores:                          | 12                       |
| Logical cores:                           | 24                       |
| System memory (RAM):                     | 64 GB                    |
| GPU information                          |                          |
| Device name:                             | AMD Radeon RX 7900 XTX   |
| Device ID (and revision):                | 744CC8                   |
| Shader core                              |                          |
| Shader core clock frequency:             | 2304 MHz (2304 MHz peak) |
| Shader engines:                          | 6                        |
| Work group processors per shader engine: | 8                        |
| SIMD per work group processor:           | 4                        |
| Wavefronts per SIMD:                     | 16                       |
| Vector registers per SIMD:               | 1536                     |
| Scalar registers per SIMD:               | 2048                     |
| Memory                                   |                          |
| Video memory clock frequency:            | 1250 MHz (1250 MHz peak) |
| Video memory bandwidth:                  | 960.0 GB/s               |
| Video memory size:                       | 24 GB                    |
| Video memory type:                       | GDDR6                    |
| L0 vector cache size per compute unit:   | 16 KB                    |
| L1 cache size per shader array:          | 256 KB                   |
| L2 cache size:                           | 4 MB                     |
| Infinity cache size:                     | 96 MB                    |
| Instruction cache size per compute unit: | 32 KB                    |
| Scalar cache size per compute unit:      | 16 KB                    |
| LDS size per work group processor:       | 128 KB                   |

**Shader Engines (SE): 6**  
**WorkGroup Processors (WGP) / SE: 8**  
**Total WGP:  $6 * 8 = 48$**   
**SIMD per WGP: 4**  
**Total SIMD:  $4 * 48 = 192$**   
**Wave slots per SIMD: 16**  
**Wave slots:  $16 * 192 = 3072$**

**Max occupancy**  
**510 / 3072 = 16.6%**

# FILL THE GPU WITH ENOUGH WORK



# FEED DEM GPUs



# OCCUPANCY GAP



# OCCUPANCY GAP



# LET YOUR WORKLOADS OVERLAP



# LET YOUR WORKLOADS OVERLAP



# LET YOUR WORKLOADS OVERLAP



# LET YOUR WORKLOADS OVERLAP



# LAUNCH RATE LIMITED WORKLOAD



# LAUNCH RATE LIMITED WORKLOAD



# GEOMETRY WORKLOADS



# GEOMETRY WORKLOADS



## Mesh shaders on AMD RDNA™ graphics cards

- From vertex shader to mesh shader
- Optimization and best practices
- Font- and vector-art rendering with mesh shaders
- Procedural grass rendering

## Mesh shaders on AMD RDNA™ graphics cards

Despite the flexibility and performance mesh shading can add to the geometry stage, we find that the technology has not been widely adopted in rendering engines so far. The purpose of this article series is to revisit mesh shading five years after its initial rollout between 2018-2019.

As a result, this blog series aims to demystify mesh shading by providing more detailed explanations, analysis, use-case examples, tutorials, and general advice.

- Part 1: From vertex shader to mesh shader
- Part 2: Optimization and best practices
- Part 3: Font- and Vector-Art Rendering with Mesh Shaders
- Part 4: Procedural grass rendering



**Max Oberberger**

*Max is part of AMD's GPU Architecture and Software Technologies Team. His current focus is GPU work graphs and mesh shader research.*



**Bastian Kuth**

*Bastian is a PhD candidate at Coburg University and University of Erlangen-Nuremberg. His research focuses on real-time geometry processing on GPUs.*



**Quirin Meyer**

*Before becoming a computer graphics professor at Coburg University, Quirin Meyer obtained a Ph.D. in graphics and worked as a software engineer in the industry. His research focuses on real-time geometry processing primarily on GPUs.*

# OCCUPANCY LIMITERS



# Q: Does better occupancy necessarily mean better performance?

# Q: Does better occupancy necessarily mean better performance?



# Q: When should I care about occupancy?

# Q: When should I care about occupancy?



## A: When the GPU needs help hiding latency



**Q: Does maximum occupancy mean that all the memory access latency from my shader is hidden?**

# Q: Does maximum occupancy mean that all the memory access latency from my shader is hidden?



# Q: Is lower theoretical occupancy always bad for performance?

**Q: Is lower theoretical occupancy always bad for performance?**

**NO!**

Q: Is lower theoretical occupancy always bad for performance?

NO!  
but also?  
yes?

# REGISTER SPILLING



# REGISTER SPILLING



The diagram shows a sequence of nine stages: IA, VS, HS, DS, GS, RS, PS, OM, and CS. The first eight stages (IA through OM) are represented by grey rectangles, while the ninth stage (CS) is represented by a blue rectangle.

**Information** **ISA**

**Dispatch properties**

- Total thread groups {480, 270, 1}
- Thread group dimensions {8, 8, 1}
- Ordered append  OFF

**Strict shader processor interpolator (SPI) ordering**  OFF

**Wavefronts and threads**

- Total wavefronts 129,600
- Total threads 8,294,400
- Average wavefront duration 0.026 ms
- Average threads per wavefront 64

**Per-wavefront resources**

- Vector registers 135 (144 allocated)
- Scalar registers 88 (128 allocated)
- Local data share per thread group -

**Theoretical wavefront occupancy**

The occupancy of this shader is limited by its vector register usage.  
This shader could potentially run 5 wavefronts out of 16 wavefronts per SIMD.

However, if you reduce vector register usage by 16 you could run another wavefront.

Registers spilled to scratch memory  ON

# COPYRIGHT AND DISCLAIMER

©2024 Advanced Micro Devices, Inc. All rights reserved.

AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate releases, for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

THIS INFORMATION IS PROVIDED 'AS IS.' AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

# THANK YOU !



- [francois.guthmann@amd.com](mailto:francois.guthmann@amd.com)



/ X - [@frguthmann](https://twitter.com/@frguthmann)

