Initial GPU port based on CUDA. #22

sfantao · 2018-10-31T15:09:55Z

This patch introduces acceleration code in Code_Saturne for NVIDIA GPUs. This is a partial port in the sense that a limited set of testcases are supported.

This has been tested on OpenPOWER platforms but it should work as well for other platforms that support CUDA. We tested this on both Power8 + P100 and Power9 + V100 machines. For the former you should expect to see over 2x speedup at scale if the amount of cells per GPU is over 100k. For the latter that speedup goes up to at least 3x while providing better strong scaling - we tested the code successfully on Summit Supercomputer in Oak Ridge National Lab in up to 512 nodes.

The overall idea is to reduce the effect of latencies in the code for the different vector and matrix-vector operations. We employ a template packing technique to statically bundle multiple operations in the same CUDA kernel. Also, we create data environments to keep data in the GPU for longer.

The implementation introduces the implementation of the GPU acceleration port in /src/cuda and the various entry points are invoked from all around the code.

This code is prepared to be launched with NVIDIA Multi-Process Service (MPS) so that multiple ranks can use the same GPU. I tested this successfully with up to 5 ranks per GPU. In order for this to work CUDA GPU visibilities have to be such that each rank only sees the GPU it is meant to use.

The patch introduces a way to assess the number of local ranks which expects an OpenMPI compatible environment - e.g. SpectrumMPI from IBM Spectrum Scale.

The patch also introduces changes in the build system so that the code can be easily built with GPU support. Building without GPU support would be equivalent to run Code Saturne in its current version: with CPU-only support.

To build the code you should use a C/C++ compiler that supports C++11 as the CUDA code requires that. Here's an example on how to build the code (note the --enable-cuda-offload flag):

module load openmpi gcc/6.4.0 cuda

git clone https://github.com/sfantao/code_saturne src

cd src && ./sbin/bootstrap
cd -

mkdir obj && cd obj
../src/configure \
--disable-shared \
--enable-static \
--enable-openmp \
--enable-openmp \
--enable-cuda-offload \
--enable-long-gnum \
--host=ppc64le \
--build=ppc64le \
--without-modules \
--disable-gui \
--without-libxml2 \
--without-hdf5 \
--without-salome-kernel \
--without-salome-gui \
--prefix=`pwd`/../install    \
CC=mpicc CFLAGS="-g -O3" \
CXX=mpic++ CXXFLAGS="-g -O3" \
FC=mpifort FCFLAGS="-g -O3" && make install

To run with MPS support, there are multiple ways. We used both Spectrum Scale LSF and LSF+CSM. Here is an example of LSF script to submit at job:

#!/bin/bash
#
#BSUB -J code_saturne_gpu_templ  # job name
#BSUB -W 01:30                   # wall-clock time (hrs:mins)
#BSUB -q normal                  # queue
#BSUB -e errors.%J.log           # error file name in which %J is replaced by the job ID
#BSUB -oo output.%J.log          # output file name in which %J is replaced by the job ID
# #BSUB -x                         # exclusive mode
#BSUB -n 20                      # number of tasks in job - we need to have a multiple 4 jobs per node
#BSUB -R "span[ptile=20]"        # make sure we have the same number of ranks in each node 
#BSUB -gpu "num=4:mode=shared"   # activate the 4 GPUs
#---------------------------------------

ulimit -s 10240
export NUM_PROCS=`echo "$LSB_HOSTS" | wc -w`

unset OPAL_OUTPUT_REDIRECT
export BIND_THREADS=yes

# Change the number of OpenMP threads as required.
export OMP_NUM_THREADS=8

# Run the solver making sure they are distributed by socket.
mpirun --report-bindings --map-by socket --bind-to core --rank-by core -np $NUM_PROCS ../../cs_solver_gpu &> myout.log

Here, ../../cs_solver_gpu is a proxy script that starts MPS servers (one per GPU) and launches the cs_solver application. Here are its contents:

#!/bin/bash

if [ -z "$OMPI_COMM_WORLD_LOCAL_SIZE" ]; then
  let OMPI_COMM_WORLD_LOCAL_SIZE=1
  let OMPI_COMM_WORLD_LOCAL_RANK=0
fi

Devices=`nvidia-smi | grep Tesla | wc -l`
Sockets=`lscpu | grep Socket | sed 's/[^0-9]*//g'`

# The code is prepared to read the number of devices in 
# the system from this variable. 
export CS_NUMBER_OF_GPUS_IN_THE_SYSTEM=$Devices

# We assume that ranks are distributed by socket.

# Ranks per device is the ceiling of #Ranks / #Devices
let RanksPerDevice=(OMPI_COMM_WORLD_LOCAL_SIZE+Devices-1)/Devices
let DeviceID=OMPI_COMM_WORLD_LOCAL_RANK/RanksPerDevice

# We select one rank to start the MPS server for a given device.
let NotDeviceMaster=OMPI_COMM_WORLD_LOCAL_RANK%RanksPerDevice

#---------------------------------------------
# start MPS
#---------------------------------------------
if [ $NotDeviceMaster = 0 ]; then
  if [ $OMPI_COMM_WORLD_RANK = 0 ]; then
    echo starting mps ...
  fi

  rm -rf /dev/shm/${USER}/mps_$DeviceID
  rm -rf /dev/shm/${USER}/mps_log_$DeviceID
  mkdir -p /dev/shm/${USER}/mps_$DeviceID
  mkdir -p /dev/shm/${USER}/mps_log_$DeviceID
  export CUDA_VISIBLE_DEVICES=$DeviceID
  export CUDA_MPS_PIPE_DIRECTORY=/dev/shm/${USER}/mps_$DeviceID
  export CUDA_MPS_LOG_DIRECTORY=/dev/shm/${USER}/mps_log_$DeviceID
  /usr/bin/nvidia-cuda-mps-control -d
fi

# Make sure that all ranks start only when the MPS server started.
sleep 5

#---------------------------------------------
# set CUDA_MPS_PIPE_DIRECTORY per MPI rank
#---------------------------------------------
printf -v myfile "/dev/shm/${USER}/mps_%d" $DeviceID

echo "Rank $OMPI_COMM_WORLD_LOCAL_RANK is using device $DeviceID"

export CUDA_MPS_PIPE_DIRECTORY=$myfile
unset CUDA_VISIBLE_DEVICES

#---------------------------------------------
# run the program
#---------------------------------------------
./cs_solver

#---------------------------------------------
# stop  MPS
#---------------------------------------------
if [ $NotDeviceMaster = 0 ]; then
  if [ $OMPI_COMM_WORLD_RANK = 0 ]; then
    echo stoping mps ...
  fi

  export CUDA_MPS_PIPE_DIRECTORY=/dev/shm/${USER}/mps_$DeviceID
  echo "quit" | /usr/bin/nvidia-cuda-mps-control
  sleep 1
  rm -rf /dev/shm/${USER}/mps_$DeviceID
  rm -rf /dev/shm/${USER}/mps_log_$DeviceID

fi

One MPS server per GPU may be overkill, 2 per GPU is in most cases sufficient.

We tested the code with a cavity load flow. Here is an example using a 13M mesh:

https://ibm.box.com/s/2rhbavxqgxhvrfi4ws98w36h74i7aqat

To run it, download the testcase from this link and then launch the job from cs_test/SRC as in the LSF script above.

Initial GPU port based on CUDA.

d9fe625

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial GPU port based on CUDA. #22

Initial GPU port based on CUDA. #22

sfantao commented Oct 31, 2018 •

edited

Initial GPU port based on CUDA. #22

Are you sure you want to change the base?

Initial GPU port based on CUDA. #22

Conversation

sfantao commented Oct 31, 2018 • edited

sfantao commented Oct 31, 2018 •

edited