



Core C++ 2025

19 Oct. 2025 :: Tel-Aviv

# When the Structs Align ... And When They Don't

Tomer Vromen





# Musical Notation

Photo by Mike Castro Demaria on [Unsplash](#)











# Frère Jacques



# Frère Jacques



$\frac{1}{2}$  note



$\frac{1}{4}$  note



$\frac{1}{2}$  note



$\frac{1}{4}$  note



$\frac{1}{8}$  note



$\frac{1}{2}$  note



$\frac{1}{4}$  note



$\frac{1}{8}$  note



$\frac{1}{16}$  note



$\frac{1}{8}$     $\frac{1}{8}$     $\frac{1}{4}$     $\frac{1}{8}$     $\frac{1}{8}$     $\frac{1}{8}$     $\frac{1}{4}$



$$\frac{1}{8} + \frac{1}{8} + \frac{1}{4} + \frac{1}{8} + \frac{1}{8} + \frac{1}{8} + \frac{1}{4} = 1\frac{1}{8}$$





syncopation





4



6



8



A  note is ***beat-aligned*** if it starts at a whole multiple of  from the start of the bar.

A  note is ***beat-aligned*** if it starts at a whole multiple of  from the start of the bar.

Syncopated = not beat-aligned



An object  $x$  is ***N-byte-aligned*** if

its memory address is  $kN$

where  $N = 2^n$

An object  $x$  is ***N-byte-aligned*** if

$$(\text{uintptr\_t})\&x \% (1 \ll n) == 0$$

where  $N = 2^n$

# תומר פרומן – Tomer Vromen

Working @ **DELL** Technologies

C++, Python

PowerFlex Ultra

We're hiring

Haifa/Glil Yam/Be'er Sheva

→ [Tomer.Vromen@dell.com](mailto:Tomer.Vromen@dell.com)



# C++ Alignment Rules



Photo by [Tom Wilson](#) on Unsplash

Object types have *alignment requirements* which place restrictions on the addresses at which an object of that type may be allocated.

[basic.align]

An ***alignment*** is an **implementation-defined** integer value representing the number of bytes between successive addresses at which a given object can be allocated. [...]

Attempting to create an object in storage that does not meet the alignment requirements of the object's type is **undefined behavior**.

[basic.align]

An **alignof** expression yields the alignment requirement of its operand type.

[expr.alignof]

In a declaration, an **alignas(...)** attribute can be used to **increase** the default alignment requirement.

[dcl.align], paraphrased

# Demo

<https://godbolt.org/z/cM6exnMvo>

# Keeping Things Aligned

- Compiler ensures that all created objects are aligned according to C++ rules
- ABI = Abstract Binary Interface
  - Each platform has a different ABI
- ABI defines proper alignment
  - Constraints & invariants
- The x86\_64 Stack Frame: “The end of the input argument area shall be aligned on a 16 byte boundary” (x86\_64 ABI)

# Keeping Things Aligned

- Global variables:
  - Compiler puts them in aligned position
- Stack-allocated (local) objects
  - **ABI promises** that stack is 16-byte aligned when control is transferred to the function entry point.
  - Higher alignment achieved by bitwise **ANDing** the stack register.

# Keeping Things Aligned: Heap-Allocated

```
MyClass *p = new MyClass{"hello", 42};
```

1. Call **operator new(sizeof(MyClass))**
2. Call c'tor with arguments
  - The address (`this`) is the value returned by operator new

Calls to **operator new(std::size\_t)** are guaranteed to be aligned by  
**\_\_STDCPP\_DEFAULT\_NEW\_ALIGNMENT\_\_**

For larger alignment requirements,  
**operator new(std::size\_t, std::align\_val\_t)** is called. (since C++17)

# Breaking the Rules



Photo by [Tom Wilson](#) on [Unsplash](#)

Attempting to create an object in storage that does not meet the alignment requirements of the object's type is **undefined behavior**.

[basic.align]

<https://godbolt.org/z/KWd8qa5qb>

# Alignment in Practice



Photo by [Tom Wilson](#) on [Unsplash](#)

# Alignment In Practice

| CPU                             | Allowed? | Performance |
|---------------------------------|----------|-------------|
| Recent x86, x86_64 (Intel, AMD) | Yes      | Good        |
| ARMv8+                          | Yes      | Good        |
| POWER9+ (IBM)                   | Yes      | Good        |

Modern architectures don't mind unaligned memory access!



# Alignment In Practice

| CPU                             | Allowed? | Performance |
|---------------------------------|----------|-------------|
| Recent x86, x86_64 (Intel, AMD) | Yes      | Good        |
| ARMv8+                          | Yes      | Good        |
| POWER9+ (IBM)                   | Yes      | Good        |
| x86, x86_64, Ivy Bridge & older | Yes      | Depends     |

Modern architectures don't mind unaligned memory access!



# Alignment In Practice

| CPU                             | Allowed? | Performance                                             |
|---------------------------------|----------|---------------------------------------------------------|
| Recent x86, x86_64 (Intel, AMD) | Yes      | Good                                                    |
| ARMv8+                          | Yes      | Good                                                    |
| POWER9+ (IBM)                   | Yes      | Good                                                    |
| x86, x86_64, Ivy Bridge & older | Yes      | Depends                                                 |
| POWER8                          | No       | ---                                                     |
| SPARC                           |          |                                                         |
| MIPS                            |          | Breaks atomicity!                                       |
| ARM M-series                    |          |                                                         |
| RISC-V                          |          | <pre>int prctl(PR_SET_UNALIGN, signed long flag);</pre> |

Modern architectures don't mind unaligned memory access!

Still relevant for older\embedded architectures

Pass **PR\_UNALIGN\_NOPRINT** to silently fix up unaligned user accesses, or **PR\_UNALIGN\_SIGBUS** to generate SIGBUS on unaligned user access.

# Alignment In Practice ☆

## Fundamental types

☆ ABI-defined

# Alignment In Practice

Fundamental types:

`alignof(T) == sizeof(T)`

*Natural alignment*

ABI for x86\_64 --->

\* ABI-defined

| Type           | C                               | sizeof | Alignment<br>(bytes) |
|----------------|---------------------------------|--------|----------------------|
| Integral       | _Bool <sup>†</sup>              | 1      | 1                    |
|                | char                            | 1      | 1                    |
|                | signed char                     |        |                      |
|                | unsigned char                   | 1      | 1                    |
|                | short                           | 2      | 2                    |
|                | signed short                    |        |                      |
|                | unsigned short                  | 2      | 2                    |
|                | int                             | 4      | 4                    |
|                | signed int                      |        |                      |
|                | enum <sup>††</sup>              |        |                      |
|                | unsigned int                    | 4      | 4                    |
|                | long                            | 8      | 8                    |
|                | signed long                     |        |                      |
|                | long long                       |        |                      |
|                | signed long long                |        |                      |
| Pointer        | unsigned long                   | 8      | 8                    |
|                | unsigned long long              | 8      | 8                    |
|                | __int128 <sup>††</sup>          | 16     | 16                   |
|                | signed __int128 <sup>††</sup>   | 16     | 16                   |
| Floating-point | unsigned __int128 <sup>††</sup> | 16     | 16                   |
|                | any-type *                      | 8      | 8                    |
|                | any-type (*)()                  |        |                      |
|                | any-type (*)()                  |        |                      |
| Floating-point | float                           | 4      | 4                    |
|                | double                          | 8      | 8                    |
|                | long double                     | 16     | 16                   |
|                | __float128 <sup>††</sup>        | 16     | 16                   |

# Alignment In Practice $\star$

**Fundamental types:**

*Natural alignment*

# Alignment In Practice $\star$

**Fundamental types:**

*Natural alignment*

**Compound types** (struct, class, union):

The alignment is that of the largest non-static member

$\star$  ABI-defined

# Struct Alignment

*The whole is greater than the sum of its parts*

```
struct S
{
    char a;           
    int b;
    short c;         
    short c;
    double d;        
    double d;
    char e;          
};
```

# Struct Alignment

*The whole is greater than the sum of its parts*

```
struct S
{
    char a;
    int b;
    short c;
    double d;
    char e;
};
```



☆ ABI-defined

# Struct Alignment

*The whole is greater than the sum of its parts*

```
struct S
{
    char a;
    int b;
    short c;
    double d;
    char e;
};
```



☆ ABI-defined

# Struct Alignment

*The whole is greater than the sum of its parts*

```
struct S
{
    char a;
    int b;
    short c;
    double d;
    char e;
};
```



\* ABI-defined

# Struct Alignment

*The whole is greater than the sum of its parts*

```
struct S
{
    char a;
    int b;
    short c;
    double d;
    char e;
};
```



\* ABI-defined

# Struct Alignment

*The whole is greater than the sum of its parts*

```
struct S
{
    char a;
    int b;
    short c;
    double d;
    char e;
};
```



\* ABI-defined

# Struct Alignment

*The whole is greater than the sum of its parts*

```
struct S
{
    char a;
    int b;
    short c;
    double d;
    char e;
};
```



☆ ABI-defined

# Struct Alignment

*The whole is greater than the sum of its parts*

```
struct S
{
    char a;
    int b;
    short c;
    double d;
    char e;
};
```



\* ABI-defined

# Struct Alignment

*The whole is greater than the sum of its parts*

```
struct S
{
    char a;
    int b;
    short c;
    double d;
    char e;
};
```



\* ABI-defined

# Struct Alignment

*The whole is greater than the sum of its parts*

```
struct S
{
    char a;
    int b;
    short c;
    double d;
    char e;
};
```



# Struct Alignment

*The whole is greater than the sum of its parts*

```
struct S
{
    char a;
    int b;
    short c;
    double d;
    char e;
};
```

`sizeof(S) == 25 ???`



☆ ABI-defined

# Struct Alignment

*The whole is greater than the sum of its parts*

```
struct S
{
    char a;
    int b;
    short c;
    double d;
    char e;
};
```

`sizeof(S) == 32`



# Struct Alignment

*The whole is greater than the sum of its parts*

```
struct S
{
    char a;           
    int b;
    short c;
    double d;
    char e;
```

# Struct Alignment

*The whole is greater than the sum of its parts*

```
struct S
```

```
{
```

```
    char a;
```



```
    char e;
```



```
    short c;
```



```
    int b;
```



```
    double d;
```



```
};
```

`sizeof(S) == 16`



★ ABI-defined

# Struct Alignment

*The whole is greater than the sum of its parts*

```
#pragma pack(push, 1)
```

```
struct S
```

```
{
```

```
    char a;
```



```
    int b;
```



```
    short c;
```



```
    double d;
```



```
    char e;
```



```
};
```

```
#pragma pack(pop)
```

`sizeof(S) == 16`



★ ABI + compiler extension

# Struct Alignment

*The whole is greater than the sum of its parts*

```
#pragma pack(push, 1)
```

```
struct S
```

```
{
```

```
    char a;
```



```
    int b;
```



```
    short c;
```



```
    double d;
```



```
    char e;
```



```
};
```

```
#pragma pack(pop)
```

```
s.b = 42;
```

arm32 disassembly:



★ ABI + compiler extension

# Struct Alignment

*The whole is greater than the sum of its parts*

```
#pragma pack(push, 1)
```

```
struct S
```

```
{
```

```
    char a;
```



```
    int b;
```

```
    short c;
```

```
    double d;
```

```
    char e;
```

```
};
```

```
#pragma pack(pop)
```



```
s.b = 42;
```

arm32 disassembly:

```
    movs    r3, #0
    orr     r3, r3, #42
1   strb    r3, [r7, #1]
    movs    r3, #0
2   strb    r3, [r7, #2]
    movs    r3, #0
3   strb    r3, [r7, #3]
    movs    r3, #0
4   strb    r3, [r7, #4]
```

# Struct Alignment

*The whole is greater than the sum of its parts*

```
#pragma pack(push, 1)
```

```
struct S
```

```
{
```

```
    char a;
```

```
    int b;
```

```
    short c;
```

```
    double d;
```

```
    char e;
```

```
};
```

```
#pragma pack(pop)
```



```
s.b = 42;
```

arm32 disassembly:

```
movs r3, #0  
orr r3, r3, #42  
strb r3, [r7, #1]  
movs r3, #0  
strb r3, [r7, #2]  
movs r3, #0  
strb r3, [r7, #3]  
movs r3, #0  
strb r3, [r7, #4]
```



☆ ABI + compiler extension

# SIMD

## Single Instruction Multiple Data

Intel's documentation --->

### MOVAPD—Move Aligned Packed Double Precision Floating-Point Values

| Opcode/<br>Instruction                                     | Op/En | 64/32 bit<br>Mode<br>Support | CPUID Feature<br>Flag             | Description                                                                                           |
|------------------------------------------------------------|-------|------------------------------|-----------------------------------|-------------------------------------------------------------------------------------------------------|
| 66 0F 28 /r<br>MOVAPD xmm1 {k1}{z}, xmm2/m128              | A     | V/V                          | SSE2                              | Move aligned packed double precision floating-point values from xmm2/mem to xmm1.                     |
| 66 0F 28 /r<br>MOVAPD xmm1 {k1}{z}, xmm2/m128              | B     | V/V                          | SSE2                              | Move aligned packed double precision floating-point values from xmm1 to xmm2/mem.                     |
| EVEX.128.66.0F.W1 28 /r<br>VMOVAPD xmm1 {k1}{z}, xmm2/m128 | C     | V/V                          | (AVX512VL AND AVX512F) OR AVX10.1 | Move aligned packed double precision floating-point values from xmm2/m128 to xmm1 using writemask k1. |
| EVEX.256.66.0F.W1 28 /r<br>VMOVAPD ymm1 {k1}{z}, ymm2/m256 | C     | V/V                          | (AVX512VL AND AVX512F) OR AVX10.1 | Move aligned packed double precision floating-point values from ymm2/m256 to ymm1 using writemask k1. |
| EVEX.512.66.0F.W1 28 /r<br>VMOVAPD zmm1 {k1}{z}, zmm2/m512 | C     | V/V                          | AVX512F OR AVX10.1                | Move aligned packed double precision floating-point values from zmm2/m512 to zmm1 using writemask k1. |
| EVEX.128.66.0F.W1 29 /r<br>VMOVAPD xmm2/m128 {k1}{z}, xmm1 | D     | V/V                          | (AVX512VL AND AVX512F) OR AVX10.1 | Move aligned packed double precision floating-point values from xmm1 to xmm2/m128 using writemask k1. |
| EVEX.256.66.0F.W1 29 /r<br>VMOVAPD ymm2/m256 {k1}{z}, ymm1 | D     | V/V                          | (AVX512VL AND AVX512F) OR AVX10.1 | Move aligned packed double precision floating-point values from ymm1 to ymm2/m256 using writemask k1. |
| EVEX.512.66.0F.W1 29 /r<br>VMOVAPD zmm2/m512 {k1}{z}, zmm1 | D     | V/V                          | AVX512F OR AVX10.1                | Move aligned packed double precision floating-point values from zmm1 to zmm2/m512 using writemask k1. |

“When the source or destination operand is a memory operand,  
the operand must be aligned”

|                                                            |   |     |                                   |                                                                                                       |
|------------------------------------------------------------|---|-----|-----------------------------------|-------------------------------------------------------------------------------------------------------|
| EVEX.128.66.0F.W1 28 /r<br>VMOVAPD xmm1 {k1}{z}, xmm2/m128 | C | V/V | (AVX512VL AND AVX512F) OR AVX10.1 | Move aligned packed double precision floating-point values from xmm2/m128 to xmm1 using writemask k1. |
| EVEX.256.66.0F.W1 28 /r<br>VMOVAPD ymm1 {k1}{z}, ymm2/m256 | C | V/V | (AVX512VL AND AVX512F) OR AVX10.1 | Move aligned packed double precision floating-point values from ymm2/m256 to ymm1 using writemask k1. |
| EVEX.512.66.0F.W1 28 /r<br>VMOVAPD zmm1 {k1}{z}, zmm2/m512 | C | V/V | AVX512F OR AVX10.1                | Move aligned packed double precision floating-point values from zmm2/m512 to zmm1 using writemask k1. |
| EVEX.128.66.0F.W1 29 /r<br>VMOVAPD xmm2/m128 {k1}{z}, xmm1 | D | V/V | (AVX512VL AND AVX512F) OR AVX10.1 | Move aligned packed double precision floating-point values from xmm1 to xmm2/m128 using writemask k1. |
| EVEX.256.66.0F.W1 29 /r<br>VMOVAPD ymm2/m256 {k1}{z}, ymm1 | D | V/V | (AVX512VL AND AVX512F) OR AVX10.1 | Move aligned packed double precision floating-point values from ymm1 to ymm2/m256 using writemask k1. |
| EVEX.512.66.0F.W1 29 /r<br>VMOVAPD zmm2/m512 {k1}{z}, zmm1 | D | V/V | AVX512F OR AVX10.1                | Move aligned packed double precision floating-point values from zmm1 to zmm2/m512 using writemask k1. |

# SIMD

## Single Instruction Multiple Data

Intel's documentation

### MOVAPD—Move Aligned Packed Double Precision Floating-Point Values

| Opcode/<br>Instruction                                                        | Op/En | 64/32 bit<br>Mode<br>Support | CPUID Feature<br>Flag             | Description                                                                                           |
|-------------------------------------------------------------------------------|-------|------------------------------|-----------------------------------|-------------------------------------------------------------------------------------------------------|
| 66 0F 28 /r<br>MOVAPD xmm2/m128<br>movapd mm1,mm2                             | A     | V/V                          | SSE2                              | Move aligned packed double precision floating-point values from xmm2/mem to xmm1.                     |
| 66 0F 28 /r<br>MOVAPD mm1,xmm2<br>movapd mm1,mm2                              | B     | V/V                          | SSE2                              | Move aligned packed double precision floating-point values from xmm1 to xmm2/mem.                     |
| 66 0F 28 /r<br>MOVAPD zmm2/m128 {k1}{z},xmm1<br>movapd zmm2/m128{k1}z,mm1     | A     | V/V                          | AVX                               | Move aligned packed double precision floating-point values from xmm2/mem to xmm1.                     |
| 66 0F 28 /r<br>MOVAPD mm1,xmm2<br>movapd mm1,mm2                              |       |                              |                                   | Move aligned packed double precision floating-point values from xmm1 to xmm2/mem.                     |
| 66 0F 28 /r<br>MOVAPD ymm2/m256 {k1}{z},ymm1<br>movapd ymm2/m256{k1}z,mm1     |       |                              |                                   | Move aligned packed double precision floating-point values from ymm2/mem to ymm1.                     |
| 66 0F 28 /r<br>MOVAPD zmm2/m128 {k1}{z},xmm1<br>movapd zmm2/m128{k1}z,mm1     |       |                              |                                   | Move aligned packed double precision floating-point values from xmm2/m128 to xmm1 using writemask k1. |
| 66 0F 28 /r<br>MOVAPD ymm2/m256 {k1}{z},ymm1<br>movapd ymm2/m256{k1}z,mm1     |       |                              |                                   | Move aligned packed double precision floating-point values from ymm2/m256 to ymm1 using writemask k1. |
| 66 0F W1 29 /r<br>MOVAPD zmm2/m512 {k1}{z},zmm1<br>movapd zmm2/m512{k1}z,zmm1 | C     | V/V                          | AVX512F<br>OR AVX10.1             | Move aligned packed double precision floating-point values from zmm2/m512 to zmm1 using writemask k1. |
| 66 0F W1 29 /r<br>MOVAPD zmm1,xmm2<br>movapd zmm1,mm2                         | D     | V/V                          | (AVX512VL AND AVX512F) OR AVX10.1 | Move aligned packed double precision floating-point values from zmm1 to xmm2/m128 using writemask k1. |
| 66 0F W1 29 /r<br>MOVAPD ymm1,xmm2<br>movapd ymm1,mm2                         | D     | V/V                          | (AVX512VL AND AVX512F) OR AVX10.1 | Move aligned packed double precision floating-point values from ymm1 to xmm2/m256 using writemask k1. |
| 66 0F W1 29 /r<br>MOVAPD zmm1,zmm2<br>movapd zmm1,zmm2                        | D     | V/V                          | AVX512F<br>OR AVX10.1             | Move aligned packed double precision floating-point values from zmm1 to zmm2/m512 using writemask k1. |

“When the source or destination operand is a memory operand,  
the operand must be aligned”

[...]

“To move double precision floating-point values to and from  
unaligned memory locations, use the (V)MOVUPD instruction.”

The unaligned version  
must be slower...  
right?

NO!

| Opcode/<br>Instruction                                                     | Op/En | 64/32 bit<br>Mode<br>Support | CPUID Feature<br>Flag             | Description                                                                                           |
|----------------------------------------------------------------------------|-------|------------------------------|-----------------------------------|-------------------------------------------------------------------------------------------------------|
| 66 0F 28 /r<br>MOVAPD zmm2/m512 {k1}{z},zmm1<br>movapd zmm2/m512{k1}z,zmm1 | C     | V/V                          | AVX512F<br>OR AVX10.1             | Move aligned packed double precision floating-point values from zmm2/m512 to zmm1 using writemask k1. |
| 66 0F 28 /r<br>MOVAPD zmm1,xmm2<br>movapd zmm1,mm2                         | D     | V/V                          | (AVX512VL AND AVX512F) OR AVX10.1 | Move aligned packed double precision floating-point values from zmm1 to xmm2/m128 using writemask k1. |
| 66 0F 28 /r<br>MOVAPD ymm1,xmm2<br>movapd ymm1,mm2                         | D     | V/V                          | (AVX512VL AND AVX512F) OR AVX10.1 | Move aligned packed double precision floating-point values from ymm1 to xmm2/m256 using writemask k1. |
| 66 0F W1 29 /r<br>MOVAPD zmm1,zmm2<br>movapd zmm1,zmm2                     | D     | V/V                          | AVX512F<br>OR AVX10.1             | Move aligned packed double precision floating-point values from zmm1 to zmm2/m512 using writemask k1. |

# SIMD

## Single Instruction Multiple Data

Intel's documentation

“When the source is unaligned, the destination must be aligned.”  
“To move data from unaligned memory, the destination must be aligned.”

### Floating point XMM and YMM instructions

| Instruction              | Operands | μops fused domain | μops unfused domain | μops each port | Latency | Reciprocal throughput | Comments      |
|--------------------------|----------|-------------------|---------------------|----------------|---------|-----------------------|---------------|
| <b>Move instructions</b> |          |                   |                     |                |         |                       |               |
| MOVAPS/D                 | x,x      | 1                 | 1                   | p015           | 0-1     | 0.25                  | may eliminate |
| VMOVAPS/D                | y,y      | 1                 | 1                   | p015           | 0-1     | 0.25                  | may eliminate |
| <b>MOVAPS/D</b>          | x,m128   | 1                 | 1                   | p23            | 2       | 0.5                   |               |
| <b>MOVUPS/D</b>          |          |                   |                     |                |         |                       |               |
| VMOVAPS/D                | y,m256   | 1                 | 1                   | p23            | 3       | 0.5                   |               |
| VMOVUPS/D                |          |                   |                     |                |         |                       | AVX           |
| MOVAPS/D                 | m128,x   | 1                 | 2                   | p237 p4        | 3       |                       |               |
| MOVUPS/D                 |          |                   |                     |                |         |                       |               |

Source: Agner Fog

The unaligned version  
must be slower...  
right?

NO!

|                                                 |   |     |                                   |                     |
|-------------------------------------------------|---|-----|-----------------------------------|---------------------|
| 66 0F 28 /r<br>MOVAPD xmm1 {k1}{z},xmm2/m128    | D | V/V | (AVX512VL AND AVX512F) OR AVX10.1 | writemask k1.       |
| 66 0F 28 /r<br>MOVAPD ymm1 {k1}{z},ymm2/m256    | D | V/V | (AVX512VL AND AVX512F) OR AVX10.1 | writemask k1.       |
| 66 0F W1 29 /r<br>MOVAPD zmm1 {k1}{z},zmm2/m512 | D | V/V | AVX512F OR AVX10.1                | writemask k1.<br>62 |

# Alignment is Still Relevant!

(Even on Modern Platforms)



Photo by [Matteo Vistocco](#) on [Unsplash](#)

# Cache Lines



# Cache Lines



# Cache Lines



# Cache Lines & Locking



# Cache Lines & Locking



- \* To be precise, this is handled by the cache coherency mechanism.

# Cache Lines & Locking



# Benchmark

```
struct StructAligned
{
    int a = 42;
    char b = '\0';
};
```



```
#pragma pack(push, 1)
struct StructUnaligned
{
    int a = 42;
    char b = '\0';
};

#pragma pack(pop)
```



# Benchmark

```
struct AtomicAligned
{
    atomic<int> a = 42;
    char b = '\0';
};
```



```
#pragma pack(push, 1)
struct AtomicUnaligned
{
    atomic<int> a = 42;
    char b = '\0';
};

#pragma pack(pop)
```



# Benchmark

```
template <typename T>
static void Runner(State& state)
{
    constexpr size_t N = 100;
    T s[N];
    for (auto _ : state) {
        for (int i = 0; i < N; ++i) {
            int t = ++s[i].a;
            DoNotOptimize(t);
        }
    }
}
```

```
BENCHMARK(Runner<StructAligned>);
BENCHMARK(Runner<StructUnaligned>);
BENCHMARK(Runner<AtomicAligned>);
BENCHMARK(Runner<AtomicUnaligned>);
```

# Benchmark

| Benchmark             | Time    | CPU     |
|-----------------------|---------|---------|
| Runner<StructAligned> | 39.8 ns | 39.7 ns |

# Benchmark

| Benchmark               | Time    | CPU     |
|-------------------------|---------|---------|
| Runner<StructAligned>   | 39.8 ns |         |
| Runner<StructUnaligned> | 70.8 ns | 70.6 ns |

Cache line split: **78%** slower

# Benchmark

| Benchmark               | Time    | CPU                                                                                                                     |
|-------------------------|---------|-------------------------------------------------------------------------------------------------------------------------|
| Runner<StructAligned>   | 39.8 ns |  Cache line split: <b>78%</b> slower |
| Runner<StructUnaligned> | 70.8 ns |  70.6 ns                             |
| Runner<AtomicAligned>   | 669 ns  |  Atomic write: <b>9.5x</b> slower    |

# Benchmark

| Benchmark               | Time       | CPU                                                                    |
|-------------------------|------------|------------------------------------------------------------------------|
| Runner<StructAligned>   | 39.8 ns    | Cache line split: <b>78%</b> slower                                    |
| Runner<StructUnaligned> | 70.8 ns    | 70.6 ns                                                                |
| Runner<AtomicAligned>   | 669 ns     | Atomic write: <b>9.5x</b> slower                                       |
| Runner<AtomicUnaligned> | 3443049 ns | 3434979 ns<br>Atomic write with cache line split: <b>5000x</b> slower! |
|                         |            | <b>Split lock:</b> locks the whole memory bus!                         |

# Cache Lines & Locking & Multithread



# Cache Lines & Locking & Multithread



# Cache Lines & Locking & Multithread



# Benchmark: False Sharing



# Benchmark: False Sharing

```
struct AtomicAligned4
{
    atomic<int> a = 42;
};
```

sizeof(Aligned4) == 4

```
struct AtomicAligned64
{
    atomic<int> a = 42;
};
```

sizeof(Aligned64) == 64

# Benchmark: False Sharing

| Benchmark               | Time          | CPU       |
|-------------------------|---------------|-----------|
| Runner<AtomicAligned4>  | 1208372885 ns | 260510 ns |
| Runner<AtomicAligned64> | 802320730 ns  | 221603 ns |

Avoiding false sharing: 33.6% faster

# Benchmark: False Sharing, No Locks



# Benchmark: False Sharing, No Locks

```
struct Aligned4
{
    int a = 42;
};
```

```
sizeof(Aligned4) == 4
```

**alignas(64)**

```
struct Aligned64
{
    int a = 42;
};
```

```
sizeof(Aligned64) == 64
```

# Benchmark: False Sharing, No Locks

| Benchmark         | Time      | CPU      |
|-------------------|-----------|----------|
| Runner<Aligned4>  | 726761 ns | 76867 ns |
| Runner<Aligned64> | 634758 ns | 73379 ns |

A green line connects the 'Time' value for 'Runner<Aligned4>' to the 'Time' value for 'Runner<Aligned64>'. A green box highlights the text 'Avoiding false sharing: 12.5% faster'.

Avoiding false sharing: **12.5% faster**

# Summary



# Alignment – Yes or No?

- C++ alignment rules are simplistic, and maybe outdated
  - Undefined behavior → Implementation-defined?
- Only *really* needed for embedded
- Modern CPUs don't mind unaligned data *too much*
- C++ will pad structs to enforce alignment
  - Good if you need it, but wasteful otherwise
  - Reorder members to reduce padding
  - Use #pragma pack to decrease alignment, *carefully*
- Cache alignment *does* matter for performance!
- Multi-threaded: use  `to avoid false sharing`

# Thank you.

Thanks to Amir Kirsh

Tomer.Vromen@dell.com

