-
Notifications
You must be signed in to change notification settings - Fork 106
Dirac ITT Benchmarks
The key Grid benchmark is located in branch:
release/dirac-ITT
under
benchmarks/Benchmark_ITT
It should be run on
-
Single node run
-
128 node run
-
(Optionally 2,4,8,16,32 and 64 node runs.)
The code is Hybrid OpenMP + MPI with NUMA socket aware optimisations. The relevant options can make big changes to delivered performance.
Log files should be collected after compile, run and threading parameters and compile options are optimised.
Some example configurations, invocation commands, and expected results are given.
The best options will vary from system system and compiler to compiler. Our guidance documents best currently known approaches, but you will have to tweak and run whatever configuration and invocation gives best performance.
Information (compile instructions and our own results) is provided for
-
Intel Knights Landing processors, with Intel Omnipath interconnect
-
Intel Skylake processors, single node, dual socket
-
AMD EPYC processors, single node, dual socket
-
Compile instructions for ARM Neon nodes. We have not benchmarked specific nodes
-
Other processor technologies will need to use the "generic" vectorisation target
The benchmark uses two strategies, overlapping communication and computation and performing communication then computation sequentially. The best result is taken.
We used hybrid OpenMP and MPI. We recommend one MPI rank per NUMA domain in a multi-socket or multi-die context. We recommend compiling with
--enable-comms=mpi3
or
--enable-comms=mpit
comms targets, using the (runtime) command line option:
--comms-threads <N>
to control how many threads try to enter MPI concurrently.
A globals comms buffer is allocated with either MMAP or SHMGET. It’s size (MB) is controlled at runtime via a command line argument
--shm=1024
If
--shm-hugepages
is specified than the software requests Linux 2MB huge pages. This requires system administrator assistance.
Advice for Intel Omnipath interconnect will probably carry over to Mellanox EDR, HDR and Cray Aries interconnects. However, other interconnects may not require to devote as many threads to communication as is recommended for OPA below.
-
For best performance with Intel Omnipath interconnects it is essential that 512 huge 2MB pages be preallocated by the system administrator using
echo 20 > /proc/sys/vm/nr_hugepages
In a system with multiple sockets or NUMA domains we find that using
-
One MPI rank per NUMA domain works best
-
Use OpenMP within each NUMA domain and bind these threads to each NUMA domain.
-
Use MPI3 comms so that between NUMA domains on the same node
--enable-comms=mpi3
shared memory is used for intranode comms. You will need a sysadmin to set up a Unix group to use huge pages for this region.
-
MPI will be used only for the intranode communications.
Configuration:
`../configure --enable-simd=KNL --enable-precision=single --enable-comms=mpit CXX=mpiicpc `
Invocation:
env KMP_HW_SUBSET=1T ./Benchmark_ITT --shm 1024 --shm-hugetlb
Results: (Key section of output (at end) for single node)
================================================================================== Memory benchmark ================================================================================== ================================================================================== = Benchmarking a*x + y bandwidth ================================================================================== L bytes GB/s Gflop/s seconds GB/s / node ---------------------------------------------------------- 8 393216.000 32.585 5.431 1.564 32.585 12 1990656.000 134.283 22.380 0.380 134.283 16 6291456.000 261.692 43.615 0.195 261.692 20 15360000.000 350.491 58.415 0.145 350.491 24 31850496.000 394.898 65.816 0.129 394.898 28 59006976.000 312.450 52.075 0.163 312.450 32 100663296.000 284.628 47.438 0.179 284.628 36 161243136.000 258.811 43.135 0.197 258.811 40 245760000.000 289.912 48.319 0.175 289.912 44 359817216.000 292.181 48.697 0.174 292.181 48 509607936.000 291.781 48.630 0.175 291.781 ================================================================================== Per Node Summary table Ls=16 ================================================================================== L Wilson DWF4 DWF5 8 106649.3 609628.8 685557.2 12 392833.6 323613.4 587197.1 16 352872.0 377815.6 688062.8 24 312523.3 487883.3 766128.7 ================================================================================== ================================================================================== Comparison point result: 377815.6 ==================================================================================
Configuration:
`../configure --enable-simd=KNL --enable-precision=single --enable-comms=mpit CXX=mpiicpc `
Invocation:
Example run on 16 nodes
export MPI=2.2.2.2 export NODES=16 export OMP_NUM_THREADS=62 export KMP_AFFINITY=explicit,proclist=[0,1,2,3,4,5,6,7,8-61] mpirun -np $NODES -ppn 1 ./Benchmark_ITT --mpi $MPI --comms-threads 8 --shm 1024 --shm-hugepages
or
# either 8 comms cores; 1 HT + (1 or 2)HT x 54 cores = (62 or 116) threads # empirically, leave a tile free for O/S, daemons etc... # export OMP_NUM_THREADS=116 export I_MPI_THREAD_SPLIT=1 export I_MPI_THREAD_RUNTIME=openmp export PSM2_MULTI_EP=1 export I_MPI_FABRICS=ofi export I_MPI_THREAD_MAX=8 export I_MPI_PIN_DOMAIN=256 export MPI=2.2.2.2 export NODES=16 export OMP_NUM_THREADS=62 export KMP_AFFINITY=explicit,proclist=[0,1,2,3,4,5,6,7,8-61,72-125] mpirun -np $NODES -ppn 1 ./Benchmark_ITT --mpi $MPI --comms-threads 8 --shm 1024 --shm-hugepages
Results: (Key section of output (at end) for multinode runs)
================================================================================== Memory benchmark ================================================================================== ================================================================================== = Benchmarking a*x + y bandwidth ================================================================================== L bytes GB/s Gflop/s seconds ---------------------------------------------------------- 8 6.29e+06 493 82.1 1.65 12 3.19e+07 2.07e+03 346 0.393 16 1.01e+08 4e+03 667 0.204 20 2.46e+08 5.42e+03 903 0.15 24 5.1e+08 6.18e+03 1.03e+03 0.132 28 9.44e+08 4.89e+03 814 0.167 32 1.61e+09 4.62e+03 771 0.176 36 2.58e+09 4.58e+03 763 0.178 40 3.93e+09 4.62e+03 770 0.176 44 5.76e+09 4.76e+03 793 0.171 48 8.15e+09 4.79e+03 798 0.17 ================================================================================== Communications benchmark ================================================================================== ==================================================================================================== = Benchmarking threaded STENCIL halo exchange in 4 dimensions ==================================================================================================== L Ls bytes MB/s uni (err/min/max) MB/s bidi (err/min/max) 4 8 49152 1649.5 620.8 21.8 4626.1 3299.0 1241.6 43.6 9252.1 8 8 393216 11909.5 1438.8 481.6 14979.7 23819.1 2877.5 963.2 29959.3 12 8 1327104 18764.1 223.1 7982.6 19956.5 37528.2 446.2 15965.2 39912.9 16 8 3145728 20819.9 77.6 13086.8 21546.1 41639.8 155.3 26173.5 43092.2 20 8 6144000 20897.5 25.3 17757.2 21719.8 41794.9 50.7 35514.5 43439.7 24 8 10616832 21060.5 33.1 17687.4 21722.4 42121.0 66.2 35374.7 43444.8 28 8 16859136 21309.5 36.3 17273.7 21973.5 42619.0 72.6 34547.4 43946.9 32 8 25165824 21290.0 104.1 11080.8 22034.2 42580.1 208.3 22161.5 44068.4 ================================================================================== ================================================================================== Per Node Summary table Ls=16 ================================================================================== L Wilson DWF4 DWF5 8 10568.9 58932.6 96280.6 12 39713.7 133451.0 212301.3 16 60111.1 209252.9 322348.7 24 141385.1 290914.6 440702.5 ================================================================================== ================================================================================== Comparison point result: 209252.9 ==================================================================================
Configuration:
as above
Invocation:
export KMP_AFFINITY=explicit,proclist=[0,1,2,3,4,5,6,7,8-61,72-125,136-191,200-255] export COMMS_THREADS=8 export OMP_NUM_THREADS=62 export I_MPI_THREAD_SPLIT=1 export I_MPI_THREAD_RUNTIME=openmp export I_MPI_FABRICS=ofi export I_MPI_PIN_DOMAIN=256 export I_MPI_THREAD_MAX=8 export PSM2_MULTI_EP=1 export FI_PSM2_LOCK_LEVEL=0 mpirun -np 16 -ppn 1 ./Benchmark_ITT --mpi 2.2.2.2 --shm 1024 --comms-threads $COMMS_THREADS
Results:
================================================================================== Memory benchmark ================================================================================== = Benchmarking a*x + y bandwidth ================================================================================== L bytes GB/s Gflop/s seconds GB/s / node ---------------------------------------------------------- 8 6291456.000 471.722 78.620 1.729 29.483 12 31850496.000 2234.161 372.360 0.365 139.635 16 100663296.000 4916.119 819.353 0.166 307.257 20 245760000.000 7531.977 1255.330 0.108 470.749 24 509607936.000 6649.536 1108.256 0.123 415.596 28 944111616.000 6119.038 1019.840 0.133 382.440 32 1610612736.000 5558.231 926.372 0.147 347.389 36 2579890176.000 5172.548 862.091 0.158 323.284 40 3932160000.000 6004.183 1000.697 0.136 375.261 44 5757075456.000 6139.139 1023.190 0.132 383.696 48 8153726976.000 6160.498 1026.750 0.132 385.031 ================================================================================== Communications benchmark ================================================================================== ==================================================================================================== = Benchmarking threaded STENCIL halo exchange in 4 dimensions ==================================================================================================== L Ls bytes MB/s uni (err/min/max) MB/s bidi (err/min/max) 4 8 49152 2508.7 30.2 1374.9 3817.6 5017.4 60.5 2749.8 7635.3 8 8 393216 6596.4 1128.2 188.3 8548.2 13192.8 2256.4 376.6 17096.3 12 8 1327104 8812.4 330.9 1042.3 9669.2 17624.9 661.8 2084.6 19338.5 16 8 3145728 9312.5 247.3 1483.7 9807.4 18625.1 494.6 2967.3 19614.8 20 8 6144000 8897.3 207.3 2741.6 9891.7 17794.5 414.5 5483.3 19783.5 24 8 10616832 8784.3 167.9 3405.8 10149.9 17568.7 335.7 6811.7 20299.9 28 8 16859136 8880.7 127.9 4390.8 9932.5 17761.3 255.8 8781.7 19865.0 32 8 25165824 8787.4 96.8 5748.6 10122.0 17574.8 193.6 11497.1 20244.0 ================================================================================== Per Node Summary table Ls=16 ================================================================================== L Wilson DWF4 DWF5 8 9042.3 51265.4 82154.6 12 32063.5 125126.9 195947.9 16 52761.9 199410.9 308859.9 24 131042.1 264027.0 418236.2 ================================================================================== ================================================================================== Comparison point result: 199410.9 ==================================================================================
At least so far the above data suggests that the second rail does not deliver much more application performance despite substantial effort to exploit this.
# Intel Skylake processor
The following configuration is recommended for the Intel Skylake platform:
../configure --enable-precision=single\ --enable-simd=AVX512 \ --enable-comms=mpi3 \ --enable-mkl \ CXX=mpiicpc
The MKL flag enables use of BLAS and FFTW from the Intel Math Kernels Library.
If you are working on a Cray machine that does not use the mpiicpc
wrapper, please use:
../configure --enable-precision=single\ --enable-simd=AVX512 \ --enable-comms=mpi3 \ --enable-mkl \ CXX=CC CC=cc
Since Dual socket nodes are commonplace, we recommend MPI-3 as the default with the use of one rank per socket. If using the Intel MPI library, threads should be pinned to NUMA domains using
export I_MPI_PIN=1
This is the default.
-
Expected Skylake Gold 6148 dual socket (single prec, single node 20+20 cores) performance using NUMA MPI mapping):
mpirun -n 2 benchmarks/Benchmark_dwf --grid 16.16.16.16 --mpi 2.1.1.1 --cacheblocking 2.2.2.2 --dslash-asm --shm 1024 --threads 18
Average mflops/s per call per node (full): 498739 : 4d vec Average mflops/s per call per node (full): 457786 : 4d vec, fp16 comms Average mflops/s per call per node (full): 572645 : 5d vec Average mflops/s per call per node (full): 721206 : 5d vec, red black Average mflops/s per call per node (full): 634542 : 4d vec, red black
We have not run the Benchmark_ITT programme on EPYC, as we do not have continuous access to nodes. However we have run the similar Benchmark_memory_bandwidth and Benchmark_dwf codes on a single dual EPYC node.
The AMD EPYC is a multichip module comprising 32 cores spread over four distinct chips each with 8 cores. So, even with a single socket node there is a quad-chip module. Dual socket nodes with 64 cores total are common. Each chip within the module exposes a separate NUMA domain. There are four NUMA domains per socket and we recommend one MPI rank per NUMA domain. MPI-3 is recommended with the use of four ranks per socket, and 8 threads per rank.
The best advice we have is as follows.
-
Configuration:
../configure --enable-precision=single\
--enable-simd=AVX2 \
--enable-comms=mpi3 \
CXX=mpicxx
-
Invocation:
Using MPICH and g++ v4.9.2, best performance can be obtained using explicit GOMP_CPU_AFFINITY flags for each MPI rank. This can be done by invoking MPI on a wrapper script omp_bind.sh to handle this.
It is recommended to run 8 MPI ranks on a single dual socket AMD EPYC, with 8 threads per rank using MPI3 and shared memory to communicate within this node:
mpirun -np 8 ./omp_bind.sh ./Benchmark_dwf --mpi 2.2.2.1 --dslash-unroll --threads 8 --grid 16.16.16.16 --cacheblocking 4.4.4.4
Where omp_bind.sh does the following:
#!/bin/bash numanode=` expr $PMI_RANK % 8 ` basecore=`expr $numanode \* 16` core0=`expr $basecore + 0 ` core1=`expr $basecore + 2 ` core2=`expr $basecore + 4 ` core3=`expr $basecore + 6 ` core4=`expr $basecore + 8 ` core5=`expr $basecore + 10 ` core6=`expr $basecore + 12 ` core7=`expr $basecore + 14 ` export GOMP_CPU_AFFINITY="$core0 $core1 $core2 $core3 $core4 $core5 $core6 $core7" echo GOMP_CUP_AFFINITY $GOMP_CPU_AFFINITY $@
-
Results: Expected AMD EPYC 7601 dual socket (single prec, single node 32+32 cores) with NUMA MPI:
Average mflops/s per call per node (full): 420235 : 4d vec Average mflops/s per call per node (full): 437617 : 4d vec, fp16 comms Average mflops/s per call per node (full): 522988 : 5d vec Average mflops/s per call per node (full): 588984 : 5d vec, red black Average mflops/s per call per node (full): 508423 : 4d vec, red black
Memory test:
mpirun -np 8 ./omp_bind.sh ./Benchmark_memory_bandwidth --threads 8 --mpi 1.2.2.2
Results:
==================================================================================================== L bytes GB/s Gflop/s seconds ---------------------------------------------------------- 8 3.15e+06 516 86.1 0.158 16 5.03e+07 886 148 0.0921 24 2.55e+08 332 55.3 0.246 32 8.05e+08 254 42.3 0.321 40 1.97e+09 254 42.3 0.317 48 4.08e+09 254 42.3 0.321 56 7.55e+09 255 42.5 0.297 64 1.29e+10 254 42.3 0.304 72 2.06e+10 254 42.4 0.244 80 3.15e+10 255 42.5 0.247 88 4.61e+10 254 42.4 0.181
Two STREAMS read bandwidth exceeded 290GB/s using Benchmark_memory_bandwidth.
-
Performance was somewhat brittle, with the above NUMA optimisation required to obtain good performance