Highly Optimized Low Precision Kernels

Our kernels are based on x64 template library BesTLA.

Support Matrix

Limited by the graph framework, we only add kernels which accept float tensor as input and output tensor.

input dtype	output dtype	compute type	compute ISA
float32	float32	float32	AVX2
float32	float32	float32	AVX512F
float32¹	float32²	int8	AVX512_VNNI
float32¹	float32²	int8	AVX512BW
float32¹	float32²	int8	AVX_VNNI
float32¹	float32²	int8	AMX_INT8
float32¹	float32²	int8	AVX2
float32/bf16	float32/bf16	bf16	AMX_BF16
float32/fp16	float32/fp16	fp16	AVX512_FP16

¹: per-batch and per-K group-wise dynamic quantization for input tensor, where per-K group-wise also applies to weight quantization group size of weight tensor; support both symmetric and asymmetric quantization. ²: per-batch dynamic dequantization for output tensor.

Weight-only Quantization Support

dtype	algo	group size
int4	symmetric or asymmetric	multiplier of 8, -1¹
int3	symmetric or asymmetric	multiplier of 8, -1¹
int2	symmetric or asymmetric	multiplier of 8, -1¹
int5	symmetric or asymmetric	multiplier of 8, -1¹
int6	symmetric or asymmetric	multiplier of 8, -1¹
int7	symmetric or asymmetric²	multiplier of 8, -1¹
int1	symmetric or asymmetric	multiplier of 8, -1¹
int8³	symmetric	multiplier of 8, -1¹
fp4		multiplier of 8
nf4		multiplier of 8

¹: group size=-1 means per channel quantization on output channel (or group size equals to input channel size).
²: int7 + asymmetric may cause numeric overflow if the device only has AVX2 without AVX_VNNI or uses AVX512BW to compute. ³: It may cause numeric overflow if the device only has AVX2 without AVX_VNNI or uses AVX512BW to compute.

NOTE:

AMX_INT8 requires group size is aligend to 128 (best hardware efficiency)
int1, int2 and int3 have accuracy loss using RTN quantization.

Hybrid quantization support

We can support the hybrid quantization combination. E.g. int4 x int2 mixed quantization.
Each model can have a unique quantization configuration. This configuration can tell the engine what quantization parameter will be applied to each weight. This allows layers can have different quantization bits, algorithms and group sizes. Referring llama int2&int4 mixed L252

Fusion Support

We support three kinds of kernel fusion for transformer models: QKV, MHA (multi-head attention), and FFN (feed-forward network) fusion.

fusion type	models	runtime ISA
QKV	GPT-J LLaMA	AMX_INT8, AVX512_VNNI, AVX512BW, AVX512F, AMX_BF16, AVX_VNNI, AVX2
FFN	GPT-J LLaMA BLOOM ChatGLM Falcon MPT	AMX_INT8, AVX512_VNNI, AVX512BW, AVX512F, AMX_BF16, AVX_VNNI, AVX2
MHA	Referring the fused-attention doc for details

Recommended Configuration for CPUs

codename	weight config	runtime ISA
Sapphire Rapids Emerald Rapids	symmetric int4 group size=128 compute type=int8	AMX_INT8
Ice Lake Cascade Lake Cooper Lake Tiger Lake Rocket Lake	symmetric int4 group size=128 compute type=int8	AVX512_VNNI
Skylake Cannon Lake	symmetric int4 group size=128 compute type=int8	AVX512BW
Alder Lake (12th Gen) Raptor Lake (13th and 14th Gen)	symmetric int4 group size=128 compute type=int8	AVX_VNNI
Older architecture (before 12th Gen)	symmetric int4 group size=128 compute type=int8	AVX2

sym int4 group=128 comp_dtype=int8 has almost the same accuracy as group=32, but is much faster (validated with LLaMa2-7B).
sym int5 group=-1 comp_dtype=int8 is the fastest configuration for the first-token with good accuracy (validated with LLaMa2-7B).
sym int3 group=128 comp_dtype=int8 is the fastest configuration for the next-token with good accuracy (validated with LLaMa2-7B).

NOTE:

group_size=-1 has the smallest model size, and the best performance. But it requires the INC's finetuned model, or it may have lower accuracy than small group sizes.
group_size=128 is a balance of accuracy and speed if you want RTN quantization only.
group_size=32, scale_dtype=bf16, compute_dtype=int8, alg=sym equals llama.cpp's Q4_0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Highly Optimized Low Precision Kernels

Support Matrix

Weight-only Quantization Support

Hybrid quantization support

Fusion Support

Recommended Configuration for CPUs

Files

README.md

Latest commit

History

README.md

File metadata and controls

Highly Optimized Low Precision Kernels

Support Matrix

Weight-only Quantization Support

Hybrid quantization support

Fusion Support

Recommended Configuration for CPUs