Dirac ITT Benchmarks

Benchmark_ITT

Build instructions for Grid are available at https://github.com/paboyle/Grid

The key Grid benchmark is located in branch:

release/dirac-ITT

under

benchmarks/Benchmark_ITT

and in the corresponding release:

https://github.com/paboyle/Grid/releases

It should be run on

Single node run
128 node run
(Optionally 2,4,8,16,32 and 64 node runs.)

The code is Hybrid OpenMP + MPI with NUMA socket aware optimisations. The relevant options can make big changes to delivered performance.

Log files should be collected after compile, run and threading parameters and compile options are optimised.

Some example configurations, invocation commands, and expected results are given.

The best options will vary from system system and compiler to compiler. Our guidance documents best currently known approaches, but you will have to tweak and run whatever configuration and invocation gives best performance.

Information (compile instructions and our own results) is provided for

Intel Knights Landing processors, with Intel Omnipath interconnect
Intel Skylake processors, single node, dual socket
AMD EPYC processors, single node, dual socket
Compile instructions for ARM Neon nodes. We have not benchmarked specific nodes
Other processor technologies will need to use the "generic" vectorisation target

The benchmark uses two strategies, overlapping communication and computation and performing communication then computation sequentially. The best result is taken.

Network interface options.

We used hybrid OpenMP and MPI. We recommend one MPI rank per NUMA domain in a multi-socket or multi-die context. We recommend compiling with

--enable-comms=mpi3

or

--enable-comms=mpit

comms targets, using the (runtime) command line option:

--comms-threads <N>

to control how many threads try to enter MPI concurrently.

A globals comms buffer is allocated with either MMAP (default) or SHMGET (--enable-comms=mpi3). If

--shm-hugepages

is specified than the software requests that Linux provide 2MB huge pages. This requires system administrator assistance to preserve and (mip3) enable the user to map these pages.

The following advice for Intel Omnipath interconnect will probably carry over to Mellanox EDR, HDR and Cray Aries interconnects. However, other interconnects may not require to devote as many threads to communication as is recommended for OPA below.

For best performance with Intel Omnipath interconnects it is essential that 512 huge 2MB pages be preallocated by the system administrator using
```
echo 512 > /proc/sys/vm/nr_hugepages
```

In a system with multiple sockets or NUMA domains we find that using

One MPI rank per NUMA domain works best
Use OpenMP within each NUMA domain and bind these threads to each NUMA domain.
Use MPI3 comms so that between NUMA domains on the same node
```
--enable-comms=mpi3
```

shared memory is used for intranode comms. You will need a sysadmin to set up a Unix group to use huge pages for this region.

MPI will be used only for the intranode communications.

Intel Knights landing 7230 CPU, Intel ICPC 17.0.4, single node

Configuration:

`../configure --enable-simd=KNL --enable-precision=single --enable-comms=mpit CXX=mpiicpc `

Invocation:

env KMP_HW_SUBSET=1T ./Benchmark_ITT --shm 1024 --shm-hugetlb

Results: (Key section of output (at end) for single node)

==================================================================================
 Memory benchmark
==================================================================================
==================================================================================
= Benchmarking a*x + y bandwidth
==================================================================================
L  		bytes			GB/s		Gflop/s		 seconds		GB/s / node
----------------------------------------------------------
8		393216.000   		30.966		5.161		1.646		30.966
12		1990656.000   		129.703		21.617		0.393		129.703
16		6291456.000   		256.614		42.769		0.199		256.614
20		15360000.000   		345.245		57.541		0.148		345.245
24		31850496.000   		390.747		65.124		0.130		390.747
28		59006976.000   		293.532		48.922		0.173		293.532
32		100663296.000   		280.259		46.710		0.182		280.259
36		161243136.000   		278.244		46.374		0.183		278.244
40		245760000.000   		293.138		48.856		0.174		293.138
44		359817216.000   		296.198		49.366		0.171		296.198
48		509607936.000   		300.027		50.005		0.170		300.027
==================================================================================
 Per Node Summary table Ls=16
==================================================================================
L 		 Wilson		 DWF4  		 DWF5
8 		 100474.977 	 584630.187 	 703124.323
12 		 366189.052 	 306451.774 	 497540.725
16 		 340541.524 	 368178.044 	 659709.592
24 		 309891.310 	 484440.886 	 745885.026
==================================================================================
==================================================================================
 Comparison point     result: 337315 Mflop/s per node
 Comparison point robustness: 0.556
==================================================================================

Intel Knights landing 7230 CPU, dual rail Omnipath interconnect, Intel ICPC 17.0.4 (Brookhaven)

Configuration:

`../configure --enable-simd=KNL --enable-precision=single --enable-comms=mpit CXX=mpiicpc `

Invocation:

Example run on 16 nodes

export MPI=2.2.2.2
export NODES=16
export OMP_NUM_THREADS=62
export KMP_AFFINITY=explicit,proclist=[0,1,2,3,4,5,6,7,8-61]
mpirun  -np $NODES -ppn 1 ./Benchmark_ITT --mpi $MPI --comms-threads 8 --shm 1024 --shm-hugepages

or

# either 8 comms cores; 1 HT +   (1 or 2)HT x 54 cores = (62 or  116) threads
# empirically, leave a tile free for O/S, daemons etc...
# export OMP_NUM_THREADS=116
export I_MPI_THREAD_SPLIT=1
export I_MPI_THREAD_RUNTIME=openmp
export PSM2_MULTI_EP=1
export I_MPI_FABRICS=ofi
export I_MPI_THREAD_MAX=8
export I_MPI_PIN_DOMAIN=256
export MPI=2.2.2.2
export NODES=16
export OMP_NUM_THREADS=62
export KMP_AFFINITY=explicit,proclist=[0,1,2,3,4,5,6,7,8-61,72-125]
mpirun  -np $NODES -ppn 1 ./Benchmark_ITT --mpi $MPI --comms-threads 8 --shm 1024 --shm-hugepages

Results: (Key section of output (at end) for multinode runs)

==================================================================================
 Memory benchmark
==================================================================================
==================================================================================
= Benchmarking a*x + y bandwidth
==================================================================================
  L  		bytes		GB/s		Gflop/s		 seconds	GB/s / node
----------------------------------------------------------
8		6291456.000   		495.862		82.644		1.644		30.991
12		31850496.000   		2111.790	351.965		0.386		131.987
16		100663296.000   	4078.515	679.753		0.200		254.907
20		245760000.000   	5311.453	885.242		0.153		331.966
24		509607936.000   	6186.956	1031.159	0.132		386.685
28		944111616.000   	4528.025	754.671		0.180		283.002
32		1610612736.000   	4487.177	747.863		0.182		280.449
36		2579890176.000   	4624.351	770.725		0.176		289.022
40		3932160000.000   	4698.816	783.136		0.173		293.676
44		5757075456.000   	4668.114	778.019		0.174		291.757
48		8153726976.000   	4667.113	777.852		0.175		291.695
==================================================================================
 Communications benchmark
==================================================================================
===================================================================================
Benchmarking threaded STENCIL halo exchange in 4 dimensions
==================================================================================
L  	 Ls  	bytes      MB/s uni (err/min/max)		MB/s bidi (err/min/max)
4   	8	49152       2739.5  79.3 437.9 4183.1	  5479.0  158.6 875.8 8366.3
8   	8	393216     13591.7  164.0 4256.7 15051.3 27183.4  328.1 8513.5 30102.7
12  	8	1327104    18694.4  208.4 8876.9 19956.5 37388.8  416.8 17753.9 39912.9
16  	8	3145728    20960.3  45.9 15006.5 21490.9 41920.5  91.9 30012.9 42981.8
20  	8	6144000    20944.4  42.0 15723.6 21652.9 41888.8  83.9 31447.2 43305.7
24  	8	10616832   20637.2  264.8 5903.6 21733.5 41274.4  529.7 11807.1 43467.1
28  	8	16859136   21219.5  40.4 18327.6 22005.7 42439.0  80.7 36655.3 44011.4
32  	8	25165824   21146.0  74.2 14737.3 22126.2 42292.1  148.5 29474.6 44252.5
==================================================================================
 Per Node Summary table Ls=16
==================================================================================
 L 		 Wilson		 DWF4  		 DWF5
8 		 10561.8 	 59548.3 	 95783.9
12 		 40422.0 	 129713.9 	 204889.2
16 		 60763.9 	 210209.7 	 322914.2
24 		 141442.3 	 293747.1 	 440702.5
==================================================================================
==================================================================================
 Comparison point     result: 169962 Mflop/s per node
 Comparison point robustness: 0.595
==================================================================================

Intel Knights landing 7210 CPU, single rail Omnipath interconnect, Intel ICPC 2017 (Cambridge)

Configuration:

as above

Invocation:

export KMP_AFFINITY=explicit,proclist=[0,1,2,3,4,5,6,7,8-61,72-125,136-191,200-255]
export COMMS_THREADS=8
export OMP_NUM_THREADS=62
export I_MPI_THREAD_SPLIT=1
export I_MPI_THREAD_RUNTIME=openmp
export I_MPI_FABRICS=ofi
export I_MPI_PIN_DOMAIN=256
export I_MPI_THREAD_MAX=8
export PSM2_MULTI_EP=1
export FI_PSM2_LOCK_LEVEL=0
mpirun -np 16 -ppn 1 ./Benchmark_ITT --mpi 2.2.2.2 --shm 1024 --comms-threads $COMMS_THREADS

Results:

    ==================================================================================
    Memory benchmark
    ==================================================================================
    = Benchmarking a*x + y bandwidth
    ==================================================================================
    L  		bytes			GB/s		Gflop/s		 seconds		GB/s / node
    ----------------------------------------------------------
    8		6291456.000   		471.722		78.620		1.729		29.483
    12		31850496.000   		2234.161	372.360		0.365		139.635
    16		100663296.000   	4916.119	819.353		0.166		307.257
    20		245760000.000   	7531.977	1255.330	0.108		470.749
    24		509607936.000   	6649.536	1108.256	0.123		415.596
    28		944111616.000   	6119.038	1019.840	0.133		382.440
    32		1610612736.000   	5558.231	926.372		0.147		347.389
    36		2579890176.000   	5172.548	862.091		0.158		323.284
    40		3932160000.000   	6004.183	1000.697	0.136		375.261
    44		5757075456.000   	6139.139	1023.190	0.132		383.696
    48		8153726976.000   	6160.498	1026.750	0.132		385.031
    ==================================================================================
     Communications benchmark
    ==================================================================================
    ====================================================================================================
    = Benchmarking threaded STENCIL halo exchange in 4 dimensions
    ====================================================================================================
    L  	 Ls  	bytes      MB/s uni (err/min/max)		MB/s bidi (err/min/max)
    4   	8	49152       2508.7  30.2 1374.9 3817.6		 5017.4  60.5 2749.8 7635.3
    8   	8	393216      6596.4  1128.2 188.3 8548.2		13192.8  2256.4 376.6 17096.3
    12  	8	1327104     8812.4  330.9 1042.3 9669.2		17624.9  661.8 2084.6 19338.5
    16  	8	3145728     9312.5  247.3 1483.7 9807.4		18625.1  494.6 2967.3 19614.8
    20  	8	6144000     8897.3  207.3 2741.6 9891.7		17794.5  414.5 5483.3 19783.5
    24  	8	10616832    8784.3  167.9 3405.8 10149.9	17568.7  335.7 6811.7 20299.9
    28  	8	16859136    8880.7  127.9 4390.8 9932.5		17761.3  255.8 8781.7 19865.0
    32  	8	25165824    8787.4  96.8 5748.6 10122.0		17574.8  193.6 11497.1 20244.0
    ==================================================================================
     Per Node Summary table Ls=16
    ==================================================================================
 L 		 Wilson		 DWF4  		 DWF5
8 		 9042.3 	 51265.4 	 82154.6
12 		 32063.5 	 125126.9 	 195947.9
16 		 52761.9 	 199410.9 	 308859.9
24 		 131042.1 	 264027.0 	 418236.2
    ==================================================================================
    ==================================================================================
    Comparison point result: 162269 Mflop/s per node
    Comparison point robustness:  0.606
    ==================================================================================

At least so far the above data suggests that the second rail does not deliver much more application performance despite substantial effort to exploit this.

Interpretation of results

There are several metrics that we can extract from the 7210 log above.

The memory bandwidth we obtain is 385 GB/s
The large packet bidirectional bandwidth is 18 GB/s
The code execution metric was 199
The performance robustness measure was 0.606

The last is derived from worst case / best case ratio on 16^4 for 4D vectorised DWF kernels.

Intel Skylake processor

The following configuration is recommended for the Intel Skylake platform:

../configure --enable-precision=single\
         --enable-simd=AVX512     \
         --enable-comms=mpi3      \
         --enable-mkl             \
         CXX=mpiicpc

In some cases AVX2 will perform better than AVX512

../configure --enable-precision=single\
         --enable-simd=AVX2     \
         --enable-comms=mpi3      \
         --enable-mkl             \
         CXX=mpiicpc

The MKL flag enables use of BLAS and FFTW from the Intel Math Kernels Library.

If you are working on a Cray machine that does not use the mpiicpc wrapper, please use:

../configure --enable-precision=single\
         --enable-simd=AVX512     \
         --enable-comms=mpi3      \
         --enable-mkl             \
         CXX=CC CC=cc

Since Dual socket nodes are commonplace, we recommend MPI-3 as the default with the use of one rank per socket. If using the Intel MPI library, threads should be pinned to NUMA domains using

export I_MPI_PIN=1

This is the default.

Expected Skylake Platinum 8170 dual socket (single prec, single node 26+26 cores) performance using NUMA MPI mapping):

export KMP_HW_SUBSET=48c1t mpirun -n 2 benchmarks/Benchmark_ITT --mpi 2.1.1.1 --shm 1024

==================================================================================
 Memory benchmark
==================================================================================
==================================================================================
= Benchmarking a*x + y bandwidth
==================================================================================
  L  		bytes			GB/s		Gflop/s		 seconds		GB/s / node
----------------------------------------------------------
8		786432.000   		86.924		14.487		1.173		86.924
12		3981312.000   		268.574		44.762		0.379		268.574
16		12582912.000   		376.029		62.672		0.271		376.029
20		30720000.000   		330.720		55.120		0.308		330.720
24		63700992.000   		384.694		64.116		0.265		384.694
28		118013952.000   		389.749		64.958		0.261		389.749
32		201326592.000   		263.108		43.851		0.387		263.108
36		322486272.000   		247.029		41.172		0.413		247.029
40		491520000.000   		228.836		38.139		0.445		228.836
44		719634432.000   		217.325		36.221		0.467		217.325
48		1019215872.000   		211.450		35.242		0.482		211.450
==================================================================================
 Per Node Summary table Ls=16
==================================================================================
 L 		 Wilson		 DWF4  		 DWF5
8 		 120793.084 	 661006.094 	 606747.933
12 		 436396.981 	 896169.271 	 833171.916
16 		 759287.473 	 980449.360 	 941381.756
24 		 558486.691 	 520478.501 	 641010.858
==================================================================================
==================================================================================
 Comparison point result: 938309 Mflop/s per node
==================================================================================

Expected Skylake Platinum 8170 dual socket (single prec, multinode node 26+26 cores) performance

Using NUMA MPI mapping and single rail OPA network. On smaller volumes the performance is communication bound and for higher core count parts it is likely wise to also investigate dual rail configurations.

Like KNL, Skylake nodes using OPA in multinode simulation appears to require hugepages to be reserved by the system administrator in order to obtain best performance from OPA. We have not been able to access a Skylake system with huge pages reserved.

The core count, and price, of Skylake parts has a large range (unlike for KNL).

As a (poor) proxy for changing the Skylake part number to investigate the optimum, we have run a single rail OPA system on the 8170 part versus the number of threads per socket active.

12 cores / socket

==================================================================================
  L               Wilson          DWF4            DWF5
 8                32253.9         154101.5        183347.2
 12               109684.1        262507.3        292317.9
 16               194774.2        249036.2        281591.4
 24               230182.8        254606.6        291424.2
 ==================================================================================
  Comparison point     result: 249036.2 Mflop/s per node
  Comparison point robustness: 0.690
 ==================================================================================

14 cores / socket

==================================================================================
 L               Wilson          DWF4            DWF5
8                32079.1         152026.3        182516.0
12               108190.3        269583.6        306923.0
16               212425.1        275198.8        342756.1
24               238965.6        248059.5        321454.7
==================================================================================
 Comparison point     result: 275198.8 Mflop/s per node
 Comparison point robustness: 0.703
==================================================================================

16 cores / socket

==================================================================================
 L               Wilson          DWF4            DWF5
8                30231.8         154613.2        188445.1
12               104854.1        273350.6        330746.8
16               223157.8        308796.0        379889.6
24               249450.3        299750.2        374744.2
==================================================================================
 Comparison point     result: 308796.0 Mflop/s per node
 Comparison point robustness: 0.715
==================================================================================

18 cores / socket

==================================================================================
 L               Wilson          DWF4            DWF5
8                30362.8         155402.7        186385.9
12               106136.0        281990.1        328575.3
16               227842.8        320035.4        395040.0
24               268774.4        281899.5        384608.6
==================================================================================
 Comparison point     result: 320035.4 Mflop/s per node
 Comparison point robustness: 0.736
==================================================================================

20 cores / socket

==================================================================================
 L               Wilson          DWF4            DWF5
8                28321.5         152194.9        187719.0
12               101859.6        282493.1        335522.2
16               218783.2        322475.3        403770.4
24               265264.5        287054.6        391544.9
==================================================================================
 Comparison point     result: 322475.3 Mflop/s per node
 Comparison point robustness: 0.738
==================================================================================

24 cores / socket

==================================================================================
 L               Wilson          DWF4            DWF5
8                20621.0         145140.7        185795.4
12               90771.4         287326.0        335079.4
16               209516.7        337189.0        413882.4
24               285756.5        298828.3        415698.2
==================================================================================
 Comparison point     result: 337189.0 Mflop/s per node
 Comparison point robustness: 0.744
==================================================================================

AMD EPYC processors

We have not run the Benchmark_ITT programme on EPYC, as we do not have continuous access to nodes. However we have run the similar Benchmark_memory_bandwidth and Benchmark_dwf codes on a single dual EPYC node.

The AMD EPYC is a multichip module comprising 32 cores spread over four distinct chips each with 8 cores. So, even with a single socket node there is a quad-chip module. Dual socket nodes with 64 cores total are common. Each chip within the module exposes a separate NUMA domain. There are four NUMA domains per socket and we recommend one MPI rank per NUMA domain. MPI-3 is recommended with the use of four ranks per socket, and 8 threads per rank.

The best advice we have is as follows.

Configuration:

../configure --enable-precision=single\
             --enable-simd=AVX2       \
             --enable-comms=mpi3 \
             CXX=mpicxx

Invocation:

Using MPICH and g++ v4.9.2, best performance can be obtained using explicit GOMP_CPU_AFFINITY flags for each MPI rank. This can be done by invoking MPI on a wrapper script omp_bind.sh to handle this.

It is recommended to run 8 MPI ranks on a single dual socket AMD EPYC, with 8 threads per rank using MPI3 and shared memory to communicate within this node:

mpirun -np 8 ./omp_bind.sh ./Benchmark_dwf --mpi 2.2.2.1 --dslash-unroll --threads 8 --grid 16.16.16.16 --cacheblocking 4.4.4.4

Where omp_bind.sh does the following:

#!/bin/bash
numanode=` expr $PMI_RANK % 8 `
basecore=`expr $numanode \* 16`
core0=`expr $basecore + 0 `
core1=`expr $basecore + 2 `
core2=`expr $basecore + 4 `
core3=`expr $basecore + 6 `
core4=`expr $basecore + 8 `
core5=`expr $basecore + 10 `
core6=`expr $basecore + 12 `
core7=`expr $basecore + 14 `
export GOMP_CPU_AFFINITY="$core0 $core1 $core2 $core3 $core4 $core5 $core6 $core7"
echo GOMP_CUP_AFFINITY $GOMP_CPU_AFFINITY
$@

Since the cacheblocking that was optimal is non default behaviour, the blocking in Benchmark_ITT.cc must be modified prior to compiling.

Results: Expected AMD EPYC 7601 dual socket (single prec, single node 32+32 cores) with NUMA MPI:

Average mflops/s per call per node (full): 420235 : 4d vec
Average mflops/s per call per node (full): 437617 : 4d vec, fp16 comms
Average mflops/s per call per node (full): 522988 : 5d vec
Average mflops/s per call per node (full): 588984 : 5d vec, red black
Average mflops/s per call per node (full): 508423 : 4d vec, red black

Memory test:

mpirun -np  8 ./omp_bind.sh ./Benchmark_memory_bandwidth --threads 8 --mpi 1.2.2.2

Results:

====================================================================================================
  L  		bytes			GB/s		Gflop/s		 seconds
----------------------------------------------------------
8		3.15e+06   		516		86.1		0.158
16		5.03e+07   		886		148		0.0921
24		2.55e+08   		332		55.3		0.246
32		8.05e+08   		254		42.3		0.321
40		1.97e+09   		254		42.3		0.317
48		4.08e+09   		254		42.3		0.321
56		7.55e+09   		255		42.5		0.297
64		1.29e+10   		254		42.3		0.304
72		2.06e+10   		254		42.4		0.244
80		3.15e+10   		255		42.5		0.247
88 		4.61e+10   		254		42.4		0.181

Two STREAMS read bandwidth exceeded 290GB/s using Benchmark_memory_bandwidth.

Performance was somewhat brittle, with the above NUMA optimisation required to obtain good performance

Intel Haswell and Broadwell

The following configuration is recommended for the Intel Haswell platform:

../configure --enable-precision=double\
             --enable-simd=AVX2       \
             --enable-comms=mpi3-auto \
             --enable-mkl             \
             CXX=icpc MPICXX=mpiicpc

The MKL flag enables use of BLAS and FFTW from the Intel Math Kernels Library.

ARM Neon nodes

These nodes are supported courtesy of work by Nils Meyer and Guido Cossu.

ARM is part of our TeamCity continuous integration structure, thanks to assistance and cycle provisioning by the University of Regensburg. The code is thus expected to work on multi-core ARM servers, but performance results are presently absent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dirac ITT Benchmarks

Benchmark_ITT

Network interface options.

Intel Knights landing 7230 CPU, Intel ICPC 17.0.4, single node

Intel Knights landing 7230 CPU, dual rail Omnipath interconnect, Intel ICPC 17.0.4 (Brookhaven)

Intel Knights landing 7210 CPU, single rail Omnipath interconnect, Intel ICPC 2017 (Cambridge)

Interpretation of results

Intel Skylake processor

12 cores / socket

14 cores / socket

16 cores / socket

18 cores / socket

20 cores / socket

24 cores / socket

AMD EPYC processors

Intel Haswell and Broadwell

ARM Neon nodes

Clone this wiki locally

Dirac ITT Benchmarks

Benchmark_ITT

Network interface options.

NUMA related options.

Intel Knights landing 7230 CPU, Intel ICPC 17.0.4, single node

Intel Knights landing 7230 CPU, dual rail Omnipath interconnect, Intel ICPC 17.0.4 (Brookhaven)

Intel Knights landing 7210 CPU, single rail Omnipath interconnect, Intel ICPC 2017 (Cambridge)

Interpretation of results

Intel Skylake processor

12 cores / socket

14 cores / socket

16 cores / socket

18 cores / socket

20 cores / socket

24 cores / socket

AMD EPYC processors

Intel Haswell and Broadwell

ARM Neon nodes

Clone this wiki locally