How to optimize DGEMM on x86 CPU platforms

General matrix/matrix multiplication (GEMM) is a core routine of many popular algorithms. On modern computing platforms with hierarchical memory architecture, it is typically possible that we can reach near-optimal performance for GEMM. For example, on most x86 CPUs, Intel MKL, as well as other well-known BLAS implementations including OpenBLAS and BLIS, can provide >90% of the peak performance for GEMM. On the GPU side, cuBLAS, provided by NVIDIA, can also provide near-optimal performance for GEMM. Though optimizing serial implementation of GEMM on x86 platforms is never a new topic, a tutorial discussing optimizing GEMM on x86 platforms with AVX512 instructions is missing among existing learning resources online. Additionally, with the increasing on data width compared between AVX512 and its predecessors AVX2, AVX, SSE4 and etc, the gap between the peak computational capability and the memory bandwidth continues growing. This simultaneously gives rise of the requirement on programmers to design more delicate prefetching schemes in order to hide the memory latency. Comparing with existed turials, ours is the first one which not only touches the implementation leaveraging AVX512 instructions, and provides step-wise optimization with prefetching strategies as well. The DGEMM implementation eventually reaches comparable performance to Intel MKL.

Q & A

I warmly welcome questions and discussions through pull requests or my personal email at yujiazhai94@gmail.com

Hardware platforms and software configurations

We require a CPU with the CPU flag avx512f to run all test cases in this tutorial. This can be checked on terminal using the command: lscpu | grep "avx512f".
The experimental data shown are collected on an Intel Xeon W-2255 CPU (2 AVX512 units, base frequency 3.7 GHz, turbo boost frequency running AVX512 instructions on a single core: 4.0 GHz). This workstation is equipped with 4X8GB=32GB DRAM at 2933 GHz. The theoretical peak performance on a single core is: 2(FMA)*2(AVX512 Units)*512(data with)/64(bit of a fp64 number)*4 GHz = 128 GFLOPS.
We compiled the program with gcc 7.3.0 under Ubuntu 18.04.5 LTS.
Intel MKL version: oneMKL 2021.1 beta.

How to run

Just three steps.

We first modify the path of MKL in Makefile.
Second, type in make to compile. A binary executable dgemm_x86 will be generated.
Third, run the binary using ./dgemm_x86 [kernel_number], where kernel_number selects the kernel for benchmark. 0 represents Intel MKL and 1-19 represent 19 kernels demonstrating the optimizing strategies. Here kernel18 is the best serial version while kernel19 is the best parallel version. Both of them reach the performance comparable to / faster than Intel MKL on lastest Intel CPUs.

Step-wise Optimizations

Here we take the column-major implemetation for DGEMM.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data		data
figures		figures
include		include
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
kernels.h		kernels.h
run.sh		run.sh
test.c		test.c
utils.c		utils.c
utils.h		utils.h

License

yzhaiustc/Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F

Folders and files

Latest commit

History

Repository files navigation

How to optimize DGEMM on x86 CPU platforms

Q & A

Hardware platforms and software configurations

How to run

Related good GEMM tutorials/materials on x86-64 CPUs

Step-wise Optimizations

Kernel 1 (naive version)

Kernel 2 (register re-use)

Kernel 3 (2x2 register blocking)

Kernel 4 (4x4 register blocking)

Kernel 5 (Kernel 4 + AVX2)

Kernel 6 (Kernel 5 + loop unrolling x 4)

Kernel 7 (8x4 kernel + AVX2 + loop unrolling x 4)

Kernel 8 (Kernel 7 + cache blocking)

Kernel 9 (Kernel 8 + packing)

Kernel 10 (24x8 kernel + AVX512 + blocking + packing)

Kernel 11 (Kernel 10 + discontinous packing on B)

Kernel 12 (Kernel 11: from instrinsics to inline ASM )

Kernel 13 (Kernel 12 + changing the whole macro kernel into inline ASM)

Kernel 14 (Kernel 13 + software prefetching on A)

Kernel 15 (Kernel 14 + software prefetching on B)

Kernel 16 (Kernel 15 + software prefetching on C)

Kernel 17 (Kernel 16 + fine-tuned matrix scaling routine on C)

Kernel 18 (Kernel 17 fine-grained packing for B to benefit the CPU frequency boosting)

Kernel 18 comparison against Intel MKL

About

Topics

Resources

License

Stars

Watchers

Forks

Languages