GitHub

Contemporary M1 / M2 / M3 machines from Apple have (at least) four different ways for low-level programmers to perform heavy computations:

Standard ARMv8 SIMD/NEON vector instructions on CPU cores (128 bits wide, issue up to four per cycle on Firestorm)
Apple's undocumented AMX instructions, issued from CPU, executed on a special accelerator execution unit
The Neural Engine (called ANE or NPU)
The GPU (e.g. Metal Compute Shaders)

This repository is all about the 2^nd of those: Apple's AMX instructions. Note that these instructions are neither documented nor supported by Apple. As a source of potential great confusion, Apple's AMX instructions are completely distinct from Intel's AMX instructions, though both are intended for issuing matrix multiply operations from a CPU.

The research was done on an Apple M1 Max (2021), with follow-up work on an M2 (2023), and additional follow-up work on an M3 (2023). Older or newer chips might have different AMX instructions. Some sources report that the M1 contains version 2 of the AMX instructions, which seems plausible (possibly everything using 7-bit writemasks comes from version 1, and everything using 9-bit writemasks is new in version 2). The M1 to M2 transition adds bf16 support, along with a few other tweaks. The M2 to M3 transition adds one extra mode to each of ldx and ldy and matint.

A good one-image summary of AMX is the following figure from abandoned patent US20180074824A1. Consider a 32x32 grid of compute units, where each unit can perform 16-bit multiply-accumulate, or a 2x2 subgrid of units can perform 32-bit multiply-accumulate, or a 4x4 subgrid can perform 64-bit multiply-accumulate. To feed this grid, there is a pool of X registers each containing 32 16-bit elements (or 16 32-bit elements, or 8 64-bit elements) and a pool of Y registers similarly containing 32 16-bit elements (or 16 32-bit elements, or 8 64-bit elements). A single instruction can perform a full outer product: multiply every element of an X register with every element of a Y register, and accumulate with the Z element in the corresponding position.

A single row of the 32x32 grid can also be used to perform vector operations (rather than matrix operations) between X and Y^T.

In terms of available data types, the general pattern is:

IEEE754 f16 or f32 or f64 (same width for all three fused-multiply-add operands)
IEEE754 f16 multiplicands, accumulating onto f32
On M2 hardware, bf16 multiplicands, accumulating onto bf16 or IEEE754 f32
Integer 8-bit or 16-bit multiplicands, accumulating onto 16 or 32 bits (in various signednesses)

This repository provides:

A tiny header for accessing AMX instructions (use at your own risk)
A description of the register file
A full description of every instruction
C code matching the behaviour of every instruction (using inline ARMv8 assembly to express certain things)
References for further reading

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.gitignore		.gitignore
Instructions.md		Instructions.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
References.md		References.md
RegisterFile.md		RegisterFile.md
aarch64.h		aarch64.h
aarch64.md		aarch64.md
emulate.h		emulate.h
extr.c		extr.c
extr_h.md		extr_h.md
extr_v.md		extr_v.md
extr_x.md		extr_x.md
extr_y.md		extr_y.md
fig2.png		fig2.png
fma.c		fma.c
fma.md		fma.md
fms.c		fms.c
fms.md		fms.md
genlut.c		genlut.c
genlut.md		genlut.md
ldst.c		ldst.c
ldst.md		ldst.md
mac16.c		mac16.c
mac16.md		mac16.md
matfp.c		matfp.c
matfp.md		matfp.md
matint.c		matint.c
matint.md		matint.md
perf.c		perf.c
perf_kernels.py		perf_kernels.py
perf_table.py		perf_table.py
setclr.md		setclr.md
test.c		test.c
test.md		test.md
vecfp.c		vecfp.c
vecfp.md		vecfp.md
vecint.c		vecint.c
vecint.md		vecint.md

License

corsix/amx

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Languages