Skip to content

Benchmark_ITT: Example output

Peter Boyle edited this page Nov 20, 2020 · 17 revisions

Interpreting output

The important score is at the end of the output of Benchmark_ITT:

Grid : Message : 153.294206 s : ==================================================================================
Grid : Message : 153.294221 s :  Per Node Summary table Ls=12
Grid : Message : 153.294235 s : ==================================================================================
Grid : Message : 153.294247 s :  L 		 Wilson		 DWF4		 Staggered 
Grid : Message : 153.294258 s : 8 		 78414.022 	 801369.528 	 15600.473
Grid : Message : 153.294278 s : 12 		 365420.689 	 2042350.152 	 125976.526
Grid : Message : 153.294297 s : 16 		 936905.225 	 3940917.595 	 146384.177
Grid : Message : 153.294316 s : 24 		 2456508.219 	 4861140.649 	 265750.399
Grid : Message : 153.294335 s : 32 		 2257740.031 	 5776951.106 	 285434.293
Grid : Message : 153.294354 s : ==================================================================================
Grid : Message : 153.294366 s : ==================================================================================
Grid : Message : 153.294378 s :  Comparison point     result: 5319045.877 Mflop/s per node
Grid : Message : 153.294393 s :  Comparison point is 0.5*(5776951.106+4861140.649) 
Grid : Message : 153.294410 s : ==================================================================================

The result for the single node is 5.319 TF/s

Comparison point     result: 5319045.877 Mflop/s per node

Example output (single node of Summit):

This system has 6 V100 GPUs per node

grid.configure.summary

Invocation:

jsrun --smpiargs=-gpu --nrs 6 --rs_per_host 6 --tasks_per_rs 1 --cpu_per_rs 6 --gpu_per_rs 1 ./Benchmark_ITT --mpi 1.1.1.6 --shm 2048  

Single node, 6 GPU, run log (5.3TF/s result).

This log gave the above result. The per node performance drops when more than one node is used. Summit has only dual rail EDR, and is network bandwidth limited giving around 1.2 TF/s per node on 8 nodes or more. Increased interconnect provision would make sense.

A100 is expected to perform best with 1:1 ratio of GPU's to 200 Gbit/s interfaces.

Example output (single node of ICE-XA dual Skylake Silver 4116 (12+12 cores):

Configuration

../configure --enable-comms=mpi-auto --enable-simd=AVX2 --prefix /home/dp008/dp008/paboyle/prefix-cpu \
	     CXX=clang++ MPICXX=mpiicpc \
	     LDFLAGS=-L/home/dp008/dp008/paboyle/prefix/lib/ \
	     CXXFLAGS="-I/home/dp008/dp008/paboyle/prefix/include/ -std=c++11 -fpermissive" 

Invocation

mpirun -np 2 ./Benchmark_ITT --mpi 1.1.1.2 --threads 12

grid.configure.summary

231 Gflop/s per node result:

Single node, 2 CPU, run log.

Rome results from July 2020 DIRAC AMD Hackathon

We have only had temporary access to dual socket Rome CPU's during a Summer 2020 AMD Hackathon and following weeks. Unfortunately this predated the freezing of our Benchmark_ITT.

We benchmarked the nodes using Benchmark_dwf, on a different volume, and the following slides were produced at the time.

Rome, 2 CPU, 64+64 cores Benchmark_dwf Slides (PDF)

The results (up to 1.9 TF/s on a carefully chosen volume) are likely an overestimate of the Benchmark_ITT as the ITT volume is larger and will likely spill from cache residency. We are unable to access appropriate nodes to verify this.

The cache edge is visible in a synthetic benchmark on the above slides page 2.

Booster results

From Juelich Booster system single node which has 4 x A100 GPUs per node

Configuration

MPICXX=mpicxx ../configure \
       --enable-unified=yes \
       --enable-accelerator=cuda \
       --enable-comms=mpi-auto \
       --enable-simd=GPU \
       CXX=nvcc \
       CXXFLAGS="-ccbin g++ -gencode arch=compute_80,code=sm_80 -std=c++14" \
       LIBS="-lrt -lmpi "

Invocation wrapped in NUMA aware script, carefully matched to lstopo output

srun -n 4 ./rungrid.sh

Wapper script rungrid.sh

Single node, 4 GPU, run log.

Multinode on Booster and UCX Bugs

We have been informed that Grid hits issues for multinode GPU running on at least some systems and software versions. We reproduced this on the Booster System at Juelich with UCX v1.8.1 and OpenMPI 4.0.1rc1. At time of writing we have good reason to believe these are issues with the UCX software rather than Grid, but the problem has not been resolved to conclusion.

Option 1) (ideal solution)

We have been told that

LDFLAGS="--cudart shared"
CXXFLAGS="--cudart shared"

addresses the issue with UCX. UCX intercepts CUDA calls at runtime to track memory regions, but only with dynamic linking. We have not been able to verify this due to machine availability. This page will be updated when we are able to check this solution.

**18 Nov: Update!!! Confirmed this works on Booster at Juelich **

Option 2) (short term hack, non-ideal, and not for acceptance)

Work around is deprecated since it is not needed.