NEON intrinsics guide

Makes ARM NEON documentation accessible (with examples). Born from frustration with ARM documentation and general lack of examples.

Update: earlier this year (2020) ARM released new docs

Intro

When you convert your iOS code to NEON, usually it's inside loops that can be written in parallel code. Also you have to keep in mind that the more load/store operations you have, the slower your code will be.

Assumptions

This guide is about inline NEON intrinsics, which should work on both 32bit and 64bit architectures. Vectors are always supposed to be of length 4, but you can generally just remove the letter q in the instruction name to use 2-vectors.

Supported types

While you can see all available types in Apple's source code, there are mainly these scalar types contained in vectors:

uint8 (really fast, but can only use for masks or short ranges)
uint16 (typical fast values)
uint32 (full precision)
uint64 (I personally have never seen this used)
int8
int16
int32
int64
float16 (doesn't seem supported by Apple CPU intrinsics, but it is supported in Metal for example)
float32 (full precision and slowest type)
poly (used for carryless multiplication and useful for cryptography)

These are then composed into vector types like float32x4_t or int8x16_t, to make full use of the available registers (total of 128 bits), an exception is int8x8_t. There are vector of vectors types as well, but they don't provide any speed bump over the standard 128 bit ones.

Syntax

Header

Include this header in iOS to include supported ARM intrinsics and types:

#include <arm_neon.h>

Source here.

Detecting support at build time

To detect support for NEON at build time (e.g. build branches or pragmas, you want to exclude ARM instructions when running on the Simulator etc.) use __ARM_NEON__.

Float

Arithmetic

add: vaddq_f32 or vaddq_f64

float32x4_t v1 = { 1.0, 2.0, 3.0, 4.0 }, v2 = { 1.0, 1.0, 1.0, 1.0 };
float32x4_t sum = vaddq_f32(v1, v2);
// => sum = { 2.0, 3.0, 4.0, 5.0 }

multiply: vmulq_f32 or vmulq_f64

float32x4_t v1 = { 1.0, 2.0, 3.0, 4.0 }, v2 = { 1.0, 1.0, 1.0, 1.0 };
float32x4_t prod = vmulq_f32(v1, v2);
// => prod = { 1.0, 2.0, 3.0, 4.0 }

multiply and accumulate: vmlaq_f32

float32x4_t v1 = { 1.0, 2.0, 3.0, 4.0 }, v2 = { 2.0, 2.0, 2.0, 2.0 }, v3 = { 3.0, 3.0, 3.0, 3.0 };
float32x4_t acc = vmlaq_f32(v3, v1, v2);  // acc = v3 + v1 * v2
// => acc = { 5.0, 7.0, 9.0, 11.0 }

multiply by a scalar: vmulq_n_f32 or vmulq_n_f64

float32x4_t v = { 1.0, 2.0, 3.0, 4.0 };
float32_t s = 3.0;
float32x4_t prod = vmulq_n_f32(v, s);
// => prod = { 3.0, 6.0, 9.0, 12.0 }

multiply by a scalar and accumulate: vmlaq_n_f32 or vmlaq_n_f64

float32x4_t v1 = { 1.0, 2.0, 3.0, 4.0 }, v2 = { 1.0, 1.0, 1.0, 1.0 };
float32_t s = 3.0;
float32x4_t acc = vmlaq_n_f32(v1, v2, s);
// => acc = { 4.0, 5.0, 6.0, 7.0 }

invert (needed for division): vrecpeq_f32 or vrecpeq_f64

float32x4_t v = { 1.0, 2.0, 3.0, 4.0 };
float32x4_t reciprocal = vrecpeq_f32(v);
// => reciprocal = { 0.998046875, 0.499023438, 0.333007813, 0.249511719 }

invert (more accurately): use a Newton-Raphson iteration to refine the estimate

float32x4_t v = { 1.0, 2.0, 3.0, 4.0 };
float32x4_t reciprocal = vrecpeq_f32(v);
float32x4_t inverse = vmulq_f32(vrecpsq_f32(v, reciprocal), reciprocal);
// => inverse = { 0.999996185, 0.499998093, 0.333333015, 0.249999046 }

Load

load vector: vld1q_f32 or vld1q_f64

float values[5] = { 1.0, 2.0, 3.0, 4.0, 5.0 };
float32x4_t v = vld1q_f32(values);
// => v = { 1.0, 2.0, 3.0, 4.0 }

load same value for all lanes: vld1q_dup_f32 or vld1q_dup_f64

float val = 3.0;
float32x4_t v = vld1q_dup_f32(&val);
// => v = { 3.0, 3.0, 3.0, 3.0 }

set all lanes to a hardcoded value: vmovq_n_f16 or vmovq_n_f32 or vmovq_n_f64

float32x4_t v = vmovq_n_f32(1.5);
// => v = { 1.5, 1.5, 1.5, 1.5 }

Store

store vector: vst1q_f32 or vst1q_f64

float32x4_t v = { 1.0, 2.0, 3.0, 4.0 };
float values[5];
vst1q_f32(values, v);
// => values = { 1.0, 2.0, 3.0, 4.0, #undef }

store lane of array of vectors: vst4q_lane_f16 or vst4q_lane_f32 or vst4q_lane_f64 (change to vst1... / vst2... / vst3... for other array lengths);

float32x4_t v0 = { 1.0, 2.0, 3.0, 4.0 }, v1 = { 5.0, 6.0, 7.0, 8.0 }, v2 = { 9.0, 10.0, 11.0, 12.0 }, v3 = { 13.0, 14.0, 15.0, 16.0 };
float32x4x4_t u = { v0, v1, v2, v3 };
float buff[4];
vst4q_lane_f32(buff, u, 0);
// => buff = { 1.0, 5.0, 9.0, 13.0 }

Arrays

access to values: val[n]

float32x4_t v0 = { 1.0, 2.0, 3.0, 4.0 }, v1 = { 5.0, 6.0, 7.0, 8.0 }, v2 = { 9.0, 10.0, 11.0, 12.0 }, v3 = { 13.0, 14.0, 15.0, 16.0 };
float32x4x4_t ary = { v0, v1, v2, v3 };
float32x4_t v = ary.val[2];
// => v = { 9.0, 10.0, 11.0, 12.0 }

Max and min

max of two vectors, element by element:

float32x4_t v0 = { 5.0, 2.0, 3.0, 4.0 }, v1 = { 1.0, 6.0, 7.0, 8.0 };
float32x4_t v2 = vmaxq_f32(v0, v1);
// => v2 = { 5.0, 6.0, 7.0, 8.0 }

max of vector elements, using folding maximum:

float32x4_t v0 = { 1.0, 2.0, 3.0, 4.0 };
float32x2_t maxOfHalfs = vpmax_f32(vget_low_f32(v0), vget_high_f32(v0));
float32x2_t maxOfMaxOfHalfs = vpmax_f32(maxOfHalfs, maxOfHalfs);
float maxValue = vget_lane_f32(maxOfMaxOfHalfs, 0);
// => maxValue = 4.0

min of two vectors, element by element:

float32x4_t v0 = { 5.0, 2.0, 3.0, 4.0 }, v1 = { 1.0, 6.0, 7.0, 8.0 };
float32x4_t v2 = vminq_f32(v0, v1);
// => v2 = { 1.0, 2.0, 3.0, 4.0 }

min of vector elements, using folding minimum:

float32x4_t v0 = { 1.0, 2.0, 3.0, 4.0 };
float32x2_t minOfHalfs = vpmin_f32(vget_low_f32(v0), vget_high_f32(v0));
float32x2_t minOfMinOfHalfs = vpmin_f32(minOfHalfs, minOfHalfs);
float minValue = vget_lane_f32(minOfMinOfHalfs, 0);
// => minValue = 1.0

Conditionals

ternary operator: use vector comparison (for example vcltq_f32 for less than comparison)

float32x4_t v1 = { 1.0, 0.0, 1.0, 0.0 }, v2 = { 0.0, 1.0, 1.0, 0.0 };
uint32x4_t mask = vcltq_f32(v1, v2);  // v1 < v2
float32x4_t ones = vmovq_n_f32(10.0), twos = vmovq_n_f32(20.0); // the conditional branches: if condition is true returns 10.0, else returns 20.0
float32x4_t v3 = vbslq_f32(mask, ones, twos);  // will select first if mask 1, second if mask 0
// => v3 = { 20.0, 10.0, 20.0, 20.0 }

Conditional branches are really bad for NEON cpus. In general we need eager execution (calculating both branches first, and then deciding which results to actually use in which lanes) and it's not possible to skip any steps.

Links

Contributing

Change README.md and send a pull request.

Author

This has been provided as part of the development that happens at Nifty.

With Nifty, the automated measurement app for easy and confident shopping, online shopping is a unique experience tailored to each shopper allowing them to buy garments with the perfect fit even on the go.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

Repository files navigation

NEON intrinsics guide

Intro

Assumptions

Supported types

Syntax

Header

Detecting support at build time

Float

Arithmetic

Load

Store

Arrays

Max and min

Conditionals

Links

Contributing

Author

About

Releases

Packages

Contributors 2

License

thenifty/neon-guide

Folders and files

Latest commit

History

LICENSE

LICENSE

README.md

README.md

Repository files navigation

NEON intrinsics guide

Intro

Assumptions

Supported types

Syntax

Header

Detecting support at build time

Float

Arithmetic

Load

Store

Arrays

Max and min

Conditionals

Links

Contributing

Author

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages