Cuda error out of memory with Wilson Fermions on Volta V100 GPUs #362

LupoA · 2021-06-30T10:56:18Z

Hi, I am having trouble running some applications with Grid (develop) on Marconi100 at CINECA (2xIBM power AC922 with 4 NVIDIA Volta V100 GPUs, NVLink 2.0)

I am using

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Thu_Oct_24_17:58:26_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

gcc (GCC) 8.4.0

mpirun (IBM Spectrum MPI) 10.3.1.02rtm0

The error appears in different tests involving Wilson fermions. The one we ultimately need to run is Test_hmc_WCMixedRepFG_Production with Nc=4 but the error can be reproduced by simply running Test_wilson_force with Nc=3, see below. The error message is

accelerator_barrier(): Cuda error out of memory 
File /m100/home/userexternal/alupo000/production_src/Grid/Grid/lattice/Lattice_local.h Line 83
Test_wilson_force: /m100/home/userexternal/alupo000/production_src/Grid/Grid/lattice/Lattice_local.h:83: Grid::Lattice<decltype (Grid::outerProduct(ll(), rr()))> Grid::outerProduct(const Grid::Lattice<vobj>&, const Grid::Lattice<obj2>&) [with ll = Grid::iScalar<Grid::iVector<Grid::iVector<Grid::Grid_simd<thrust::complex<double>, Grid::GpuVector<4, Grid::GpuComplex<double2> > >, 3>, 4> >; rr = Grid::iScalar<Grid::iVector<Grid::iVector<Grid::Grid_simd<thrust::complex<double>, Grid::GpuVector<4, Grid::GpuComplex<double2> > >, 3>, 4> >; decltype (Grid::outerProduct(ll(), rr())) = Grid::iScalar<Grid::iMatrix<Grid::iMatrix<Grid::Grid_simd<thrust::complex<double>, Grid::GpuVector<4, Grid::GpuComplex<double2> > >, 3>, 4> >]: Assertion `err==cudaSuccess' failed.

Let me list a few cases in which the error appears. In all the following examples I am using 4^4 local lattices.

Test_wilson_force Nc=3, 1 node and 2 GPUs works (e.g. mpirun -np 2 Test_wilson_force —grid 4.4.4.8 —mpi 1.1.1.2)
Test_wilson_force Nc=3, 1 node and 4 GPUs fails (e.g. mpirun -np 4 Test_wilson_force —grid 4.4.8.8 —mpi 1.1.2.2)
Test_wilson_force Nc=4, 1 node and 2 GPU fails
Test_wilson_force Nc=4, 1 node and 4 GPU fails
Test_hmc_WCMixedRepFG_Production fails always when running on GPUs.

Other informations:
-Benchmark dwf, ITT, and comms_host_device work fine.
-The error does not appear on Jureca.
-I ran Test_wilson_force with the —men-debug option and the profile I see for the allocated memory is the same for jureca and marconi100 (until the latter dies).
-Reducing --enable-gen-simd-width allows Test_wilson_force to work but Test_hmc_WCMixedRepFG_Production will ultimately fail, especially with Nc=4.

The configure line I am using is inspired by the instructions for Summit on the grid wiki.

../configure
--enable-comms=mpi
--enable-simd=GPU
CXX=nvcc
CXXFLAGS="-ccbin mpicxx -gencode arch=compute_70,code=sm_70 -I$HOME/prefix/include/ -std=c++11 -Xcompiler -mno-float128"
--disable-gparity
--enable-accelerator=cuda
--enable-shm=nvlink
--enable-accelerator-cshift
--enable-Nc=4

The cxxflag “-Xcompiler -mno-float128” seems necessary when using Cuda 10 with gcc. The most recent Cuda version on Marconi100 is 11.0, which I also tried by disabling the macro error in CompilerCompatible.h, but the error persists.
I have tried to add and remove several configure options with no luck, e.g. the ones suggested in #346.

I am attaching config.log, grid.configure.summary and the output of make v=1

thanks for the help,
Alessandro

config.log
grid.configure.summary.log
makeV1.log

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda error out of memory with Wilson Fermions on Volta V100 GPUs #362

Cuda error out of memory with Wilson Fermions on Volta V100 GPUs #362

LupoA commented Jun 30, 2021

Cuda error out of memory with Wilson Fermions on Volta V100 GPUs #362

Cuda error out of memory with Wilson Fermions on Volta V100 GPUs #362

Comments

LupoA commented Jun 30, 2021