You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using the mpich build 'mpich/20231026/icc-all-pmix-gpu' on sunspot I am seeing the following error:
free(): double free detected in tcache 2
Usng the HDF5 h5bench exerciser benchmark which uses collective MPI-IO for the backend. To get this error I need to do one-sided aggregation which needs to use the lustre file system specified with the following env vars:
This will do ROMIO one-sided aggregation using a derived type to transfer the data to the collective buffer. If I additionally specify this env var:
ROMIO_ONESIDED_ALWAYS_RMW=1
The error goes away, this additional setting tells ROMIO to do a read-modify-write for every collective buffer aggregation, HDF5 does alot of read-modify-write anyway but maybe not for every call, so this setting is prebably resulting in more reads, but looking at the one-sided code I can't see anything to explain this, maybe a timing issue? Also, if I set:
ROMIO_WRITE_AGGMETHOD=1
ROMIO_READ_AGGMETHOD=1
So no derived type is used for the one-sided aggregation, instead multiple MPI_Put / MPI_Get are used for each contiguous chunk of data, this error goes away, but instead I get data corruption in the HDF5 file. I have seen data corruption before using this benchmark with just the regular GEN aggregation in previous MPICH builds that went away with this one, so I suspect there is a broader issue in the messaging layer that this ROMIO code uses as opposed to an issue with this one-sided aggregation code itself. So to reproduce on sunspot:
Start interactive job: qsub -lwalltime=60:00 -lselect=1 -A Aurora_deployment -q workq -I
Using the mpich build 'mpich/20231026/icc-all-pmix-gpu' on sunspot I am seeing the following error:
free(): double free detected in tcache 2
Usng the HDF5 h5bench exerciser benchmark which uses collective MPI-IO for the backend. To get this error I need to do one-sided aggregation which needs to use the lustre file system specified with the following env vars:
This will do ROMIO one-sided aggregation using a derived type to transfer the data to the collective buffer. If I additionally specify this env var:
The error goes away, this additional setting tells ROMIO to do a read-modify-write for every collective buffer aggregation, HDF5 does alot of read-modify-write anyway but maybe not for every call, so this setting is prebably resulting in more reads, but looking at the one-sided code I can't see anything to explain this, maybe a timing issue? Also, if I set:
So no derived type is used for the one-sided aggregation, instead multiple MPI_Put / MPI_Get are used for each contiguous chunk of data, this error goes away, but instead I get data corruption in the HDF5 file. I have seen data corruption before using this benchmark with just the regular GEN aggregation in previous MPICH builds that went away with this one, so I suspect there is a broader issue in the messaging layer that this ROMIO code uses as opposed to an issue with this one-sided aggregation code itself. So to reproduce on sunspot:
Start interactive job:
qsub -lwalltime=60:00 -lselect=1 -A Aurora_deployment -q workq -I
You should see this:
The text was updated successfully, but these errors were encountered: