Observing peak performance on KNL

Introduction

This repository contains some very simple code for observing peak performance on Intel's Knights Landing (KNL) processors.

The files fma.c and fma_24x.c are based off code from the Colfax Research report, Capabilities of Intel® AVX-512 in Intel® Xeon® Scalable Processors (Skylake).

The work here was used in a lightning talk at the IXPUG Annual Fall Conference 2018 for which you can see the slides or recorded video.

Compiling and modifying

To recreate the results presented (actual peak is 6/7 of theoretical peak), use version 18 of the Intel compiler suite. You will also need John McCalpin's low overhead timers (it's just two files):

$ wget https://raw.githubusercontent.com/jdmccalpin/low-overhead-timers/master/low_overhead_timers.c
$ wget https://raw.githubusercontent.com/jdmccalpin/low-overhead-timers/master/low_overhead_timers.h
$ icc -sox -O3 -xMIC-AVX512 -DVECTOR_WIDTH=8 -DUNROLLFACTOR=1 fma_24x.c low_overhead_timers.c -o fma_24x.exe

The VECTOR_WIDTH option should be 8 for AVX-512 vector registers. If you don't specify it then default value used is 8.

The UNROLL_FACTOR option tells the compiler how much to unroll the FMA loop in the code. If you don't specify it, the default value is 1, which pipelines 24 FMAs in this example. A value of 2 would attempt to pipeline 48, and so on.

To generate assembly, use the -S flag:

$ icc -sox -O3 -xMIC-AVX512 -DVECTOR_WIDTH=8 -DUNROLLFACTOR=1 -S fma_24x.c -o fma_24x.s

Open up the assembly and look for the vfmadd instructions:

..B1.15:                        # Preds ..B1.15 ..B1.14
                                # Execution count [1.00e+06]
        addl      $1, %eax                                      #69.5 c1
        vfmadd213pd %zmm0, %zmm1, %zmm25                        #71.67 c1
        vfmadd213pd %zmm0, %zmm1, %zmm24                        #72.67 c1
        vfmadd213pd %zmm0, %zmm1, %zmm23                        #73.67 c7 stall 2
        vfmadd213pd %zmm0, %zmm1, %zmm22                        #74.67 c7
        vfmadd213pd %zmm0, %zmm1, %zmm21                        #75.67 c13 stall 2
        vfmadd213pd %zmm0, %zmm1, %zmm20                        #76.67 c13
        vfmadd213pd %zmm0, %zmm1, %zmm19                        #77.67 c19 stall 2
        vfmadd213pd %zmm0, %zmm1, %zmm18                        #78.67 c19
        vfmadd213pd %zmm0, %zmm1, %zmm17                        #79.67 c25 stall 2
        vfmadd213pd %zmm0, %zmm1, %zmm16                        #80.67 c25
        vfmadd213pd %zmm0, %zmm1, %zmm15                        #81.67 c31 stall 2
        vfmadd213pd %zmm0, %zmm1, %zmm14                        #82.67 c31
        vfmadd213pd %zmm0, %zmm1, %zmm13                        #83.67 c37 stall 2
        vfmadd213pd %zmm0, %zmm1, %zmm12                        #84.67 c37
        vfmadd213pd %zmm0, %zmm1, %zmm11                        #85.67 c43 stall 2
        vfmadd213pd %zmm0, %zmm1, %zmm10                        #86.67 c43
        vfmadd213pd %zmm0, %zmm1, %zmm9                         #87.67 c49 stall 2
        vfmadd213pd %zmm0, %zmm1, %zmm8                         #88.67 c49
        vfmadd213pd %zmm0, %zmm1, %zmm7                         #89.67 c55 stall 2
        vfmadd213pd %zmm0, %zmm1, %zmm6                         #90.67 c55
        vfmadd213pd %zmm0, %zmm1, %zmm5                         #91.67 c61 stall 2
        vfmadd213pd %zmm0, %zmm1, %zmm4                         #92.67 c61
        vfmadd213pd %zmm0, %zmm1, %zmm3                         #93.67 c67 stall 2
        vfmadd213pd %zmm0, %zmm1, %zmm2                         #94.67 c67
        cmpl      $1000000000, %eax                             #69.5 c67
        jb        ..B1.15       # Prob 99%                      #69.5 c69
                                # LOE rcx rbx r13 r14 r15 eax r12d zmm0 zmm1 zmm2 zmm3 zmm4 zmm5 zmm6 zmm7 zmm8 zmm9 zmm10 zmm11 zmm12 zmm13 zmm14 zmm15 zmm16 zmm17 zmm18 zmm19 zmm20 zmm21 zmm22 zmm23 zmm24 zmm25

The loop overhead here is three instructions: addl, cmpl, and jb. You can replace these with two instructions instead for lower overhead:

Put movl $1000000000, %eax at the end of the block before B1.15;
At the end of B1.15, put the new loop controls: subl $1, %eax and jne ..B1.15.

We use subl instead of increment or decrement because they are slow on KNL.

Putting the loop overhead at the end of the block rather than straddled around the block offers the hardware the opportunity to issue perfect pairs of vfma instructions to the allocation unit, rather than a mixed subl/vmfa pair.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
fma.c		fma.c
fma_24x.c		fma_24x.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

fma.c

fma.c

fma_24x.c

fma_24x.c

Repository files navigation

Observing peak performance on KNL

Introduction

Compiling and modifying

About

Releases

Packages

Languages

dmcdougall/peak_knl_gflops

Folders and files

Latest commit

History

Repository files navigation

Observing peak performance on KNL

Introduction

Compiling and modifying

About

Resources

Stars

Watchers

Forks

Languages