Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI2 romio321 library fails when reading >= 2GB per rank #381

Open
mmphys opened this issue Jan 25, 2022 · 2 comments
Open

MPI2 romio321 library fails when reading >= 2GB per rank #381

mmphys opened this issue Jan 25, 2022 · 2 comments

Comments

@mmphys
Copy link
Contributor

mmphys commented Jan 25, 2022

Git commit

develop HEAD 135808d

Target Platform

University of Edinburgh Extreme Scaling system “Tursa”
Each node: 2 x AMD ROME EPYC 32, Nvidia A100 (40GB), 1TB RAM
Linux tursa-login1 4.18.0-305.10.2.el8_4.x86_64 #1 SMP Mon Jul 12 04:43:18 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

Configure

../configure --enable-comms=mpi --enable-simd=GPU --enable-shm=nvlink --enable-gen-simd-width=64 --enable-accelerator=cuda --enable-accelerator-cshift --enable-unified \
--with-gmp=/mnt/lustre/tursafs1/home/dp207/dp207/shared/env/spack/opt/spack/linux-rhel8-zen2/gcc-9.4.0/gmp-6.2.1-4qzl4yfdllwmf42zewg44gb4y54bgy2d \
--with-mpfr=/mnt/lustre/tursafs1/home/dp207/dp207/shared/env/spack/opt/spack/linux-rhel8-zen2/gcc-9.4.0/mpfr-4.1.0-agsa52nljiqbbrzrpln5ebgclzxesm7a \
--with-fftw=/mnt/lustre/tursafs1/home/dp207/dp207/shared/env/spack/opt/spack/linux-rhel8-zen2/gcc-9.4.0/fftw-3.3.10-bdpumbnknoewgtzgirxrvy3weveminw3 \
--with-hdf5=/mnt/lustre/tursafs1/home/dp207/dp207/shared/env/spack/opt/spack/linux-rhel8-zen2/gcc-9.4.0/hdf5-1.10.7-qld75yuu7gpncparpqq46hvuqzz4s6zx \
--with-lime=/mnt/lustre/tursafs1/home/dp207/dp207/shared/env/spack/opt/spack/linux-rhel8-zen2/gcc-9.4.0/c-lime-2-3-9-ie76iwlrgadc24aniq57wz5rv7dmt4b4 \
CXX=nvcc \
CXXFLAGS='-ccbin mpicxx -gencode arch=compute_80,code=sm_80 -std=c++14 -cudart shared -I/mnt/lustre/tursafs1/apps/basestack/cuda-11.4/openmpi/4.1.1-cuda11.4/include’ \
LDFLAGS='-cudart shared -L/mnt/lustre/tursafs1/apps/basestack/cuda-11.4/openmpi/4.1.1-cuda11.4/lib’ \
LIBS='-lrt -lmpi’ \
--prefix=/mnt/lustre/tursafs1/home/dp207/dp207/shared/runs/semilep/code/3/Prefix

Attachments

  • config.log
  • grid.configure.summary
  • GridMakeV1.txt Output from make V=1
  • MPIRead32.cpp Minimal reproducer available https://github.com/mmphys/MPIRead32
    • Bad.log Minimal reproducer output showing issue
    • Good.log Minimal reproducer output showing workaround
  • GaugeLoad.cpp Reproducer using Grid to load gauge field
    • Bad.log Grid reproducer output showing issue
    • Good.log Grid reproducer output showing workaround

Issue Description

When MPI2 is configured to use the romio321 library for I/O, MPI_File_read_all() fails when reading >=2GB into a single MPI rank.

Issue Workaround

Other MPI2 I/O libraries do not have this limit / bug. Switching to ompio for example resolves the issue on Tursa.

Note: romio321 is currently the recommended MPI2 I/O library on Tursa. Commissioning performance tests were carried out using romio321. I see a performance hit when using ompio (~5 GBPS) instead of romio321 (~10 GBPS) on a single node, but I have not tested to see how this scales.

Minimal reproducer -- MPIRead32.cpp

MPIRead32.cpp https://github.com/mmphys/MPIRead32 is the minimal code to reproduce the issue. Note, this is independent of Grid.

To demonstrate the issue we run the following command on Tursa:

mpirun --mca io romio321 -np 2 MPIRead32 a.out 0 2.1 2304.4608 &> Bad.log

Re-running the same command, but this time choosing the ompio I/O library works around the issue:

mpirun --mca io    ompio -np 2 MPIRead32 a.out 0 2.1 2304.4608 > Good.log

Grid reproducer -- GaugeLoad.cpp

The issue was first noticed on Tursa when using Grid to load a Gauge field.

To demonstrate the issue we run the following command on Tursa:

mpirun --mca io romio321 -np 2 GaugeLoad /mnt/lustre/tursafs1/home/dp207/dp207/shared/dwf_2+1f/F1M/ckpoint_EODWF_lat.200 --grid 48.48.48.96 --mpi 2.1.1.1 &> GridBad.log

Re-running the same command, but this time choosing the ompio I/O library works around the issue:

mpirun --mca io ompio    -np 2 GaugeLoad /mnt/lustre/tursafs1/home/dp207/dp207/shared/dwf_2+1f/F1M/ckpoint_EODWF_lat.200 --grid 48.48.48.96 --mpi 2.1.1.1  > GridGood.log

config.log
grid.configure.summary.log
GridMakeV1.txt
MPIRead32.cpp.txt
Bad.log
Good.log
GaugeLoad.cpp.txt
GridBad.log
GridGood.log

@roblatham00
Copy link

Sorry to hear you are running into problems with ROMIO from MPICH-3.2.1

The patch which promotes the offending datatype to a 64 bit value is this one: pmodels/mpich@3a479ab0 though it might not be worth backporting to whichever version of OpenMPI you are running: Openmpi has updated their ROMIO to 3.4.1 which should contain the fix.

@mmphys
Copy link
Contributor Author

mmphys commented Feb 16, 2022

Thanks for the pointer to the fix. Will ask whether we can update Tursa to Open MPI's ROMIO 3.4.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants