Skip to content

Dirac ITT Benchmarks

Peter Boyle edited this page Aug 27, 2017 · 39 revisions

Benchmark_ITT

The key Grid benchmark is located in branch:

release/dirac-ITT

under

benchmarks/Benchmark_ITT

It should be run on

  1. Single node run

  2. 128 node run

  3. (Optionally 2,4,8,16,32 and 64 node runs.)

The code is Hybrid OpenMP + MPI with NUMA socket aware optimisations. The relevant options can make big changes to delivered performance.

Log files should be collected after compile, run and threading parameters and compile options are optimised.

Some example configurations, invocation commands, and expected results are given.

The best options will vary from system system and compiler to compiler. Our guidance documents best currently known approaches, but you will have to tweak and run whatever configuration and invocation gives best performance.

Information (compile instructions and our own results) is provided for

  1. Intel Knights Landing processors, with Intel Omnipath interconnect

  2. Intel Skylake processors, single node, dual socket

  3. AMD EPYC processors, single node, dual socket

  4. Compile instructions for ARM Neon nodes. We have not benchmarked specific nodes

  5. Other processor technologies will need to use the "generic" vectorisation target

The benchmark uses two strategies, overlapping communication and computation and performing communication then computation sequentially. The best result is taken.

Network interface options.

We used hybrid OpenMP and MPI. We recommend one MPI rank per NUMA domain in a multi-socket or multi-die context. We recommend compiling with

--enable-comms=mpi3

or

--enable-comms=mpit

comms targets, using the (runtime) command line option:

--comms-threads <N>

to control how many threads try to enter MPI concurrently.

A globals comms buffer is allocated with either MMAP or SHMGET. It’s size (MB) is controlled at runtime via a command line argument

--shm=1024

If

--shm-hugepages

is specified than the software requests Linux 2MB huge pages. This requires system administrator assistance.

Advice for Intel Omnipath interconnect will probably carry over to Mellanox EDR, HDR and Cray Aries interconnects. However, other interconnects may not require to devote as many threads to communication as is recommended for OPA below.

  • For best performance with Intel Omnipath interconnects it is essential that 512 huge 2MB pages be preallocated by the system administrator using

    echo 20 > /proc/sys/vm/nr_hugepages

NUMA related options.

In a system with multiple sockets or NUMA domains we find that using

  • One MPI rank per NUMA domain works best

  • Use OpenMP within each NUMA domain and bind these threads to each NUMA domain.

  • Use MPI3 comms so that between NUMA domains on the same node

    --enable-comms=mpi3

shared memory is used for intranode comms. You will need a sysadmin to set up a Unix group to use huge pages for this region.

  • MPI will be used only for the intranode communications.

Intel Knights landing 7230 CPU, Intel ICPC 17.0.4, single node

Configuration:

`../configure --enable-simd=KNL --enable-precision=single --enable-comms=mpit CXX=mpiicpc `

Invocation:

env KMP_HW_SUBSET=1T ./Benchmark_ITT --shm 1024 --shm-hugetlb

Results: (Key section of output (at end) for single node)

==================================================================================
Memory benchmark
==================================================================================
==================================================================================
= Benchmarking a*x + y bandwidth
==================================================================================
L  		bytes			GB/s		Gflop/s		 seconds		GB/s / node
----------------------------------------------------------
8		393216.000   		32.585		5.431		1.564		32.585
12		1990656.000   		134.283		22.380		0.380		134.283
16		6291456.000   		261.692		43.615		0.195		261.692
20		15360000.000   		350.491		58.415		0.145		350.491
24		31850496.000   		394.898		65.816		0.129		394.898
28		59006976.000   		312.450		52.075		0.163		312.450
32		100663296.000   	284.628		47.438		0.179		284.628
36		161243136.000   	258.811		43.135		0.197		258.811
40		245760000.000   	289.912		48.319		0.175		289.912
44		359817216.000   	292.181		48.697		0.174		292.181
48		509607936.000   	291.781		48.630		0.175		291.781
==================================================================================
Per Node Summary table Ls=16
==================================================================================
L 		 Wilson		 DWF4  		 DWF5
8 		 106649.3 	 609628.8 	 685557.2
12 		 392833.6 	 323613.4 	 587197.1
16 		 352872.0 	 377815.6 	 688062.8
24 		 312523.3 	 487883.3 	 766128.7
==================================================================================
==================================================================================
Comparison point result: 377815.6
==================================================================================

Intel Knights landing 7230 CPU, dual rail Omnipath interconnect, Intel ICPC 17.0.4 (Brookhaven)

Configuration:

`../configure --enable-simd=KNL --enable-precision=single --enable-comms=mpit CXX=mpiicpc `

Invocation:

Example run on 16 nodes

export MPI=2.2.2.2
export NODES=16
export OMP_NUM_THREADS=62
export KMP_AFFINITY=explicit,proclist=[0,1,2,3,4,5,6,7,8-61]
mpirun  -np $NODES -ppn 1 ./Benchmark_ITT --mpi $MPI --comms-threads 8 --shm 1024 --shm-hugepages

or

# either 8 comms cores; 1 HT +   (1 or 2)HT x 54 cores = (62 or  116) threads
# empirically, leave a tile free for O/S, daemons etc...
# export OMP_NUM_THREADS=116
export I_MPI_THREAD_SPLIT=1
export I_MPI_THREAD_RUNTIME=openmp
export PSM2_MULTI_EP=1
export I_MPI_FABRICS=ofi
export I_MPI_THREAD_MAX=8
export I_MPI_PIN_DOMAIN=256
export MPI=2.2.2.2
export NODES=16
export OMP_NUM_THREADS=62
export KMP_AFFINITY=explicit,proclist=[0,1,2,3,4,5,6,7,8-61,72-125]
mpirun  -np $NODES -ppn 1 ./Benchmark_ITT --mpi $MPI --comms-threads 8 --shm 1024 --shm-hugepages

Results: (Key section of output (at end) for multinode runs)

==================================================================================
 Memory benchmark
==================================================================================
==================================================================================
= Benchmarking a*x + y bandwidth
==================================================================================
  L            bytes                   GB/s            Gflop/s          seconds
----------------------------------------------------------
 8             6.29e+06                493             82.1            1.65
 12            3.19e+07                2.07e+03        346             0.393
 16            1.01e+08                4e+03           667             0.204
 20            2.46e+08                5.42e+03        903             0.15
 24            5.1e+08                 6.18e+03        1.03e+03        0.132
 28            9.44e+08                4.89e+03        814             0.167
 32            1.61e+09                4.62e+03        771             0.176
 36            2.58e+09                4.58e+03        763             0.178
 40            3.93e+09                4.62e+03        770             0.176
 44            5.76e+09                4.76e+03        793             0.171
 48            8.15e+09                4.79e+03        798             0.17
==================================================================================
 Communications benchmark
==================================================================================
====================================================================================================
= Benchmarking threaded STENCIL halo exchange in 4 dimensions
====================================================================================================
 L     Ls     bytes      MB/s uni (err/min/max)               MB/s bidi (err/min/max)
 4     8       49152       1649.5  620.8 21.8 4626.1            3299.0  1241.6 43.6 9252.1
 8     8       393216     11909.5  1438.8 481.6 14979.7         23819.1  2877.5 963.2 29959.3
 12    8       1327104    18764.1  223.1 7982.6 19956.5         37528.2  446.2 15965.2 39912.9
 16    8       3145728    20819.9  77.6 13086.8 21546.1         41639.8  155.3 26173.5 43092.2
 20    8       6144000    20897.5  25.3 17757.2 21719.8         41794.9  50.7 35514.5 43439.7
 24    8       10616832   21060.5  33.1 17687.4 21722.4         42121.0  66.2 35374.7 43444.8
 28    8       16859136   21309.5  36.3 17273.7 21973.5         42619.0  72.6 34547.4 43946.9
 32    8       25165824   21290.0  104.1 11080.8 22034.2        42580.1  208.3 22161.5 44068.4
 ==================================================================================
 ==================================================================================
  Per Node Summary table Ls=16
 ==================================================================================
   L           Wilson          DWF4            DWF5
   8            10568.9         58932.6         96280.6
   12           39713.7         133451.0        212301.3
   16           60111.1         209252.9        322348.7
   24           141385.1        290914.6        440702.5
 ==================================================================================
 ==================================================================================
  Comparison point result: 209252.9
 ==================================================================================

Intel Knights landing 7210 CPU, single rail Omnipath interconnect, Intel ICPC 2017 (Cambridge)

Configuration:

as above

Invocation:

export KMP_AFFINITY=explicit,proclist=[0,1,2,3,4,5,6,7,8-61,72-125,136-191,200-255]
export COMMS_THREADS=8
export OMP_NUM_THREADS=62
export I_MPI_THREAD_SPLIT=1
export I_MPI_THREAD_RUNTIME=openmp
export I_MPI_FABRICS=ofi
export I_MPI_PIN_DOMAIN=256
export I_MPI_THREAD_MAX=8
export PSM2_MULTI_EP=1
export FI_PSM2_LOCK_LEVEL=0
mpirun -np 16 -ppn 1 ./Benchmark_ITT --mpi 2.2.2.2 --shm 1024 --comms-threads $COMMS_THREADS

Results:

    ==================================================================================
    Memory benchmark
    ==================================================================================
    = Benchmarking a*x + y bandwidth
    ==================================================================================
    L  		bytes			GB/s		Gflop/s		 seconds		GB/s / node
    ----------------------------------------------------------
    8		6291456.000   		471.722		78.620		1.729		29.483
    12		31850496.000   		2234.161	372.360		0.365		139.635
    16		100663296.000   	4916.119	819.353		0.166		307.257
    20		245760000.000   	7531.977	1255.330	0.108		470.749
    24		509607936.000   	6649.536	1108.256	0.123		415.596
    28		944111616.000   	6119.038	1019.840	0.133		382.440
    32		1610612736.000   	5558.231	926.372		0.147		347.389
    36		2579890176.000   	5172.548	862.091		0.158		323.284
    40		3932160000.000   	6004.183	1000.697	0.136		375.261
    44		5757075456.000   	6139.139	1023.190	0.132		383.696
    48		8153726976.000   	6160.498	1026.750	0.132		385.031
    ==================================================================================
     Communications benchmark
    ==================================================================================
    ====================================================================================================
    = Benchmarking threaded STENCIL halo exchange in 4 dimensions
    ====================================================================================================
    L  	 Ls  	bytes      MB/s uni (err/min/max)		MB/s bidi (err/min/max)
    4   	8	49152       2508.7  30.2 1374.9 3817.6		 5017.4  60.5 2749.8 7635.3
    8   	8	393216      6596.4  1128.2 188.3 8548.2		13192.8  2256.4 376.6 17096.3
    12  	8	1327104     8812.4  330.9 1042.3 9669.2		17624.9  661.8 2084.6 19338.5
    16  	8	3145728     9312.5  247.3 1483.7 9807.4		18625.1  494.6 2967.3 19614.8
    20  	8	6144000     8897.3  207.3 2741.6 9891.7		17794.5  414.5 5483.3 19783.5
    24  	8	10616832    8784.3  167.9 3405.8 10149.9	17568.7  335.7 6811.7 20299.9
    28  	8	16859136    8880.7  127.9 4390.8 9932.5		17761.3  255.8 8781.7 19865.0
    32  	8	25165824    8787.4  96.8 5748.6 10122.0		17574.8  193.6 11497.1 20244.0
    ==================================================================================
     Per Node Summary table Ls=16
    ==================================================================================
 L 		 Wilson		 DWF4  		 DWF5
8 		 9042.3 	 51265.4 	 82154.6
12 		 32063.5 	 125126.9 	 195947.9
16 		 52761.9 	 199410.9 	 308859.9
24 		 131042.1 	 264027.0 	 418236.2
    ==================================================================================
    ==================================================================================
    Comparison point result: 199410.9
    ==================================================================================

At least so far the above data suggests that the second rail does not deliver much more application performance despite substantial effort to exploit this.

Intel Skylake processor

The following configuration is recommended for the Intel Skylake platform:

../configure --enable-precision=single\
         --enable-simd=AVX512     \
         --enable-comms=mpi3      \
         --enable-mkl             \
         CXX=mpiicpc

The MKL flag enables use of BLAS and FFTW from the Intel Math Kernels Library.

If you are working on a Cray machine that does not use the mpiicpc wrapper, please use:

../configure --enable-precision=single\
         --enable-simd=AVX512     \
         --enable-comms=mpi3      \
         --enable-mkl             \
         CXX=CC CC=cc

Since Dual socket nodes are commonplace, we recommend MPI-3 as the default with the use of one rank per socket. If using the Intel MPI library, threads should be pinned to NUMA domains using

export I_MPI_PIN=1

This is the default.

  • Expected Skylake Gold 6148 dual socket (single prec, single node 20+20 cores) performance using NUMA MPI mapping):

    mpirun -n 2 benchmarks/Benchmark_dwf --grid 16.16.16.16 --mpi 2.1.1.1 --cacheblocking 2.2.2.2 --dslash-asm --shm 1024 --threads 18
    Average mflops/s per call per node (full):  498739 : 4d vec
    Average mflops/s per call per node (full):  457786 : 4d vec, fp16 comms
    Average mflops/s per call per node (full):  572645 : 5d vec
    Average mflops/s per call per node (full):  721206 : 5d vec, red black
    Average mflops/s per call per node (full):  634542 : 4d vec, red black

AMD EPYC processors

We have not run the Benchmark_ITT programme on EPYC, as we do not have continuous access to nodes. However we have run the similar Benchmark_memory_bandwidth and Benchmark_dwf codes on a single dual EPYC node.

The AMD EPYC is a multichip module comprising 32 cores spread over four distinct chips each with 8 cores. So, even with a single socket node there is a quad-chip module. Dual socket nodes with 64 cores total are common. Each chip within the module exposes a separate NUMA domain. There are four NUMA domains per socket and we recommend one MPI rank per NUMA domain. MPI-3 is recommended with the use of four ranks per socket, and 8 threads per rank.

The best advice we have is as follows.

  • Configuration:

../configure --enable-precision=single\
             --enable-simd=AVX2       \
             --enable-comms=mpi3 \
             CXX=mpicxx 
  • Invocation:

Using MPICH and g++ v4.9.2, best performance can be obtained using explicit GOMP_CPU_AFFINITY flags for each MPI rank. This can be done by invoking MPI on a wrapper script omp_bind.sh to handle this.

It is recommended to run 8 MPI ranks on a single dual socket AMD EPYC, with 8 threads per rank using MPI3 and shared memory to communicate within this node:

mpirun -np 8 ./omp_bind.sh ./Benchmark_dwf --mpi 2.2.2.1 --dslash-unroll --threads 8 --grid 16.16.16.16 --cacheblocking 4.4.4.4

Where omp_bind.sh does the following:

#!/bin/bash
numanode=` expr $PMI_RANK % 8 `
basecore=`expr $numanode \* 16`
core0=`expr $basecore + 0 `
core1=`expr $basecore + 2 `
core2=`expr $basecore + 4 `
core3=`expr $basecore + 6 `
core4=`expr $basecore + 8 `
core5=`expr $basecore + 10 `
core6=`expr $basecore + 12 `
core7=`expr $basecore + 14 `
export GOMP_CPU_AFFINITY="$core0 $core1 $core2 $core3 $core4 $core5 $core6 $core7"
echo GOMP_CUP_AFFINITY $GOMP_CPU_AFFINITY
$@
  • Results: Expected AMD EPYC 7601 dual socket (single prec, single node 32+32 cores) with NUMA MPI:

    Average mflops/s per call per node (full): 420235 : 4d vec
    Average mflops/s per call per node (full): 437617 : 4d vec, fp16 comms
    Average mflops/s per call per node (full): 522988 : 5d vec
    Average mflops/s per call per node (full): 588984 : 5d vec, red black
    Average mflops/s per call per node (full): 508423 : 4d vec, red black

Memory test:

mpirun -np  8 ./omp_bind.sh ./Benchmark_memory_bandwidth --threads 8 --mpi 1.2.2.2

Results:

====================================================================================================
  L  		bytes			GB/s		Gflop/s		 seconds
----------------------------------------------------------
8		3.15e+06   		516		86.1		0.158
16		5.03e+07   		886		148		0.0921
24		2.55e+08   		332		55.3		0.246
32		8.05e+08   		254		42.3		0.321
40		1.97e+09   		254		42.3		0.317
48		4.08e+09   		254		42.3		0.321
56		7.55e+09   		255		42.5		0.297
64		1.29e+10   		254		42.3		0.304
72		2.06e+10   		254		42.4		0.244
80		3.15e+10   		255		42.5		0.247
88 		4.61e+10   		254		42.4		0.181

Two STREAMS read bandwidth exceeded 290GB/s using Benchmark_memory_bandwidth.

  • Performance was somewhat brittle, with the above NUMA optimisation required to obtain good performance