Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Benchmark_ITT segfaults with MPI and ranks > 1 #393

Open
james-simone opened this issue Apr 14, 2022 · 9 comments
Open

GPU Benchmark_ITT segfaults with MPI and ranks > 1 #393

james-simone opened this issue Apr 14, 2022 · 9 comments

Comments

@james-simone
Copy link

james-simone commented Apr 14, 2022

Hi,

Benchmark_ITT segfaults in MPI run when nranks > 1 just after printing "Initialised RNGs". I have observed this same segfault on Perlmutter as well as a local cluster.
$ mpirun -np 2 ./Benchmark_ITT --mpi 1.1.1.2 --shm 2048

Git tag: 605cf40 although I found the same issue trying earlier working sets as far back as mid-February.

Environment: 3) gnu10/10.2.0 4) cuda11/11.6.0 5) openmpi3/3.1.4
Similar env was used on Perlmutter: gnu10 and CUDA 11.5 as I recall

$ ../configure --enable-simd=GPU --enable-accelerator=cuda --enable-comms=mpi3-auto --enable-gen-simd-width=32 --enable-openmp CXX=nvcc CXXFLAGS="-ccbin mpicxx -gencode arch=compute_70,code=sm_70 -std=c++14"

@james-simone
Copy link
Author

Added comment about segfaults from other benchmarks run on our local cluster
Benchmark_comms: faults after "Benchmarking concurrent STENCIL halo exchange in 1 dimensions"
Benchmark_comms_host_device: fault after "Benchmarking sequential halo exchange from GPU memory"
Benchmark_memory_asynch: runs OK
Benchmark_dwf: faults after "Setting up Cshift based reference"

@james-simone
Copy link
Author

On Perlmutter, I built git branch c0d56a1 according to the recipe in ./systems/Perlmutter. Benchmark_ITT generates
a segfault for parallel runs while the code runs correctly on a single GPU.

$ srun -n4 ~/grid/bench/perlmutter/bind_gpu4.sh ./benchmarks/Benchmark_ITT --mpi 1.1.2.2 --threads 8 --shm 2048 --debug-mem --debug-signals
... edited ...
Grid : Message : 3.494448 s : Benchmark DWF on 32^4 local volume
Grid : Message : 3.494451 s : * Nc             : 3
Grid : Message : 3.494453 s : * Global volume  : 32 32 64 64
Grid : Message : 3.494462 s : * Ls             : 1
Grid : Message : 3.494464 s : * ranks          : 4
Grid : Message : 3.494466 s : * nodes          : 1
Grid : Message : 3.494469 s : * ranks/node     : 4
Grid : Message : 3.494471 s : * ranks geom     : 1 1 2 2
Grid : Message : 3.494474 s : * Using 8 threads
Grid : Message : 3.494476 s : ==================================================================================
Grid : Message : 3.708264 s : Initialised RNGs
(GTL DEBUG: 1) cuIpcOpenMemHandle: resource already mapped, CUDA_ERROR_ALREADY_MAPPED, line no 272
(GTL DEBUG: 3) cuIpcOpenMemHandle: resource already mapped, CUDA_ERROR_ALREADY_MAPPED, line no 272
(GTL DEBUG: 2) cuIpcOpenMemHandle: resource already mapped, CUDA_ERROR_ALREADY_MAPPED, line no 272
MPICH ERROR [Rank 1] [job id 2148669.5] [Tue May 10 07:44:32 2022] [nid001520] - Abort(70900482) (rank 1 in comm 0): Fatal error in PMPI_Sendrecv: Invalid count, error stack:
PMPI_Sendrecv(249)........................: MPI_Sendrecv(sbuf=0x7f3edfc00000, scount=589824, MPI_CHAR, dest=3, stag=1, rbuf=0x7f3d29520000, rcount=589824, MPI_CHAR, src=3, rtag=3, comm=0xc4000199, status=0x1) failed
... edited...
(unknown)(): Invalid count
BackTrace Strings: 0 ./benchmarks/Benchmark_ITT() [0x450419]
BackTrace Strings: 1 /lib64/libc.so.6(+0x4db09) [0x7fcf57d33b09]
BackTrace Strings: 2 /lib64/libc.so.6(+0x4dc9a) [0x7fcf57d33c9a]
BackTrace Strings: 3 /opt/cray/pe/lib64/libpmi2.so.0(PMI_Get_base_rank_in_app+0) [0x7fcf57408dd2]
BackTrace Strings: 4 /opt/cray/pe/lib64/libmpi_gnu_91.so.12(+0x202cca2) [0x7fcf5b0e9ca2]
BackTrace Strings: 5 /opt/cray/pe/lib64/libmpi_gnu_91.so.12(+0x1ebfa5c) [0x7fcf5af7ca5c]
BackTrace Strings: 6 /opt/cray/pe/lib64/libmpi_gnu_91.so.12(MPIR_Err_return_comm+0x11b) [0x7fcf5af7cb8b]
BackTrace Strings: 7 /opt/cray/pe/lib64/libmpi_gnu_91.so.12(MPI_Sendrecv+0x3b9) [0x7fcf59aa2729]
BackTrace Strings: 8 /global/common/software/nersc/pm-2021q4/sw/darshan/3.3.1/lib/libdarshan.so(MPI_Sendrecv+0x84) [0x7fcf5bd1ddd4]
BackTrace Strings: 9 ./benchmarks/Benchmark_ITT() [0x46fcd5]
BackTrace Strings: 10 ./benchmarks/Benchmark_ITT() [0x57168d]
BackTrace Strings: 11 ./benchmarks/Benchmark_ITT() [0x572039]
BackTrace Strings: 12 ./benchmarks/Benchmark_ITT() [0x572dd6]
BackTrace Strings: 13 ./benchmarks/Benchmark_ITT() [0x573cee]
BackTrace Strings: 14 ./benchmarks/Benchmark_ITT() [0x574baf]
BackTrace Strings: 15 ./benchmarks/Benchmark_ITT() [0x484b65]
BackTrace Strings: 16 ./benchmarks/Benchmark_ITT() [0x4445fc]
BackTrace Strings: 17 ./benchmarks/Benchmark_ITT() [0x411b51]
BackTrace Strings: 18 /lib64/libc.so.6(__libc_start_main+0xef) [0x7fcf57d1b2bd]
BackTrace Strings: 19 ./benchmarks/Benchmark_ITT() [0x4164ca]

@james-simone
Copy link
Author

Unfortunately, this problem still persists on the develop 042ab1a branch dated Mon Jun 27, 2022.

@lcebaman
Copy link

Any updates on this?

@james-simone
Copy link
Author

Unfortunately, no updates. I see similar segfaults on systems other than Perlmutter. I suspect it is more of a problem with the mpich family of MPI and later versions of Grid, though openmpi has also shown segfaults.

@knepley
Copy link

knepley commented Jun 28, 2023

Are you using GPU-aware MPI? We have seen several unexplained segfaults with this that vanish using the normal build of MPI. So far, the implementors have not been motivated to fix these.

@lcebaman
Copy link

I see the same segfaults using CUDA aware OpenMPI, I cannot confirm this is the case with normal MPI. Do you suggest to use normal OpenMPI instead?

@knepley
Copy link

knepley commented Jun 28, 2023

Yes

@lcebaman
Copy link

there must be something else going on:

 0 0x0000000000012ce0 __funlockfile()  :0
 1 0x00000000000cfcc3 __memmove_avx_unaligned_erms()  :0
 2 0x000000000004c194 ucp_dt_pack()  ???:0
 3 0x00000000000853e4 ucp_tag_offload_unexp_eager()  ???:0
 4 0x000000000001b962 uct_mm_ep_am_bcopy()  ???:0
 5 0x0000000000085a14 ucp_tag_offload_unexp_eager()  ???:0
 6 0x0000000000090897 ucp_tag_send_nbx()  ???:0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants