Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

free(): double free detected in tcache 2 Error message from parallel HDF5 MPI-IO using one-sided ROMIO aggregation #6983

Open
pkcoff opened this issue Apr 18, 2024 · 0 comments

Comments

@pkcoff
Copy link
Contributor

pkcoff commented Apr 18, 2024

Using the mpich build 'mpich/20231026/icc-all-pmix-gpu' on sunspot I am seeing the following error:
free(): double free detected in tcache 2
Usng the HDF5 h5bench exerciser benchmark which uses collective MPI-IO for the backend. To get this error I need to do one-sided aggregation which needs to use the lustre file system specified with the following env vars:

ROMIO_FSTYPE_FORCE=lustre:
ROMIO_WRITE_AGGMETHOD=2
ROMIO_READ_AGGMETHOD=2

This will do ROMIO one-sided aggregation using a derived type to transfer the data to the collective buffer. If I additionally specify this env var:

ROMIO_ONESIDED_ALWAYS_RMW=1

The error goes away, this additional setting tells ROMIO to do a read-modify-write for every collective buffer aggregation, HDF5 does alot of read-modify-write anyway but maybe not for every call, so this setting is prebably resulting in more reads, but looking at the one-sided code I can't see anything to explain this, maybe a timing issue? Also, if I set:

ROMIO_WRITE_AGGMETHOD=1
ROMIO_READ_AGGMETHOD=1

So no derived type is used for the one-sided aggregation, instead multiple MPI_Put / MPI_Get are used for each contiguous chunk of data, this error goes away, but instead I get data corruption in the HDF5 file. I have seen data corruption before using this benchmark with just the regular GEN aggregation in previous MPICH builds that went away with this one, so I suspect there is a broader issue in the messaging layer that this ROMIO code uses as opposed to an issue with this one-sided aggregation code itself. So to reproduce on sunspot:
Start interactive job: qsub -lwalltime=60:00 -lselect=1 -A Aurora_deployment -q workq -I

cd /lus/gila/projects/Aurora_deployment/pkcoff/tarurundir
module unload mpich/icc-all-pmix-gpu/52.2
module use /soft/preview-modulefiles/24.086.0
module load mpich/20231026/icc-all-pmix-gpu
export ROMIO_FSTYPE_FORCE=lustre:
export ROMIO_WRITE_AGGMETHOD=2
export ROMIO_READ_AGGMETHOD=2
export ROMIO_HINTS=/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/romio_hints
export MPIR_CVAR_ENABLE_GPU=1
export MPIR_CVAR_BCAST_POSIX_INTRA_ALGORITHM=mpir
export MPIR_CVAR_ALLREDUCE_POSIX_INTRA_ALGORITHM=mpir
export MPIR_CVAR_BARRIER_POSIX_INTRA_ALGORITHM=mpir
export MPIR_CVAR_REDUCE_POSIX_INTRA_ALGORITHM=mpir
unset MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE
unset MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE
unset MPIR_CVAR_CH4_POSIX_COLL_SELECTION_TUNING_JSON_FILE
export LD_LIBRARY_PATH=/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir:/soft/datascience/aurora_nre_models_frameworks-2024.0/lib/
export FI_PROVIDER=cxi
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_CQ_FILL_PERCENT=20
export FI_MR_CACHE_MONITOR=disabled
export FI_CXI_OVFLOW_BUF_SIZE=8388608

LD_PRELOAD=/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/libdarshan.so:/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/libhdf5.so:/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/libpnetcdf.so mpiexec -np 16 -ppn 16 --cpu-bind=verbose,list:4:56:5:57:6:58:7:59:8:60:9:61:10:62:11:63 --no-vni -envall -genvall /soft/tools/mpi_wrapper_utils/gpu_tile_compact.sh ./hdf5Exerciser --numdims 3 --minels 128 128 128 --nsizes 1 --bufmult 2 2 2 --metacoll --addattr --usechunked --maxcheck 100000 --fileblocks 128 128 128 --filestrides 128 128 128 --memstride 128 --memblock 128

You should see this:

free(): double free detected in tcache 2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant