Injection of hot electron beam raises error `double free or corruption (out)` #349

weipengyao · 2020-12-29T07:15:37Z

Description

I am using the injection module for hot electron transport in solid target.

When the temperature of the injected electron beam is high, like Te=100 keV, the code runs hundreds of steps and then crashes with the error double free or corruption (out): 0x0000000003b5f6f0 ***.
While when reduce the temperature, e.g. Te=50 eV, the code runs fine (at least within the simulation time).

Please find the related output files here:

a.out.txt
test.py.txt
a.err.txt

Steps to reproduce the problem

To reproduce the problem, just use the namelist above and compare the two cases with different temperatures.

And this info. about iterator validity might be helpful.

Parameters

make env gives:

SMILEICXX : mpicxx
PYTHONEXE : python3
MPIVERSION :
VERSION : b'v4.4-784-gc3f8cc8'-b'work'
OPENMP_FLAG : -fopenmp -D_OMP
HDF5_ROOT_DIR : /scinet/niagara/software/2019b/modules/intel-2019u3-intelmpi-2019u3/hdf5/1.10.5
SITEDIR : /home/a/anticipa/weipeng/.local/lib/python3.6/site-packages
PY_CXXFLAGS : -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/lib/python3.6/site-packages/numpy/core/include -DSMILEI_USE_NUMPY -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION
PY_LDFLAGS : -lpython3.6m -lpthread -ldl -lutil -lm -Xlinker -export-dynamic
CXXFLAGS : -D__VERSION=\"b'v4.4-784-gc3f8cc8'-b'work'\" -D_VECTO -std=c++11 -Wall  -I/scinet/niagara/software/2019b/modules/intel-2019u3-intelmpi-2019u3/hdf5/1.10.5/include -Isrc -Isrc/Params -Isrc/ElectroMagnSolver -Isrc/ElectroMagn -Isrc/ElectroMagnBC -Isrc/Particles -Isrc/Radiation -Isrc/Ionization -Isrc/Interpolator -Isrc/Collisions -Isrc/Merging -Isrc/Tools -Isrc/Python -Isrc/Projector -Isrc/DomainDecomposition -Isrc/MovWindow -Isrc/Profiles -Isrc/picsar_interface -Isrc/Checkpoint -Isrc/Pusher -Isrc/Field -Isrc/MultiphotonBreitWheeler -Isrc/SmileiMPI -Isrc/Species -Isrc/Diagnostic -Isrc/ParticleInjector -Isrc/Patch -Ibuild/src/Python -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/lib/python3.6/site-packages/numpy/core/include -DSMILEI_USE_NUMPY -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -O3 -g  -fopenmp -D_OMP
LDFLAGS : -L/scinet/niagara/software/2019b/modules/intel-2019u3-intelmpi-2019u3/hdf5/1.10.5/lib   -lhdf5 -lpython3.6m -lpthread -ldl -lutil -lm -Xlinker -export-dynamic -L/scinet/niagara/software/2019b/opt/base/python/3.6.8/lib -lm -fopenmp -D_OMP

The text was updated successfully, but these errors were encountered:

xxirii · 2021-01-12T13:47:00Z

Dear @weipengyao,

I am studying your issue and so far I could not reproduce your problem.

If you use a supercomputer can you show me your launch script (the one you use to launch the simulation) or the exact configuration that you use (number of MPI tasks, OpenMP threads...)

Thank you

weipengyao · 2021-01-12T14:22:12Z

Dear @xxirii,

Thanks for your time and reply.

I checked again with the namelist attached and found that this error occurred (at 200 timestep) with 160 cores, but not with 40 cores (which might happen later).

I am running this on the supercomputer Niagara, and I use the smilei.sh 160 test.py in the debug cluster, with 4 nodes (40 cores per node).
From the a.out.txt, you may notice that I just use:

...
Initializing MPI
 --------------------------------------------------------------------------------
	 MPI_THREAD_MULTIPLE enabled
	 Number of MPI process : 160
	 Number of patches : 
		 dimension 0 - number_of_patches : 128
		 dimension 1 - number_of_patches : 128
	 Patch size :
		 dimension 0 - n_space : 20 cells.
		 dimension 1 - n_space : 20 cells.
	 Dynamic load balancing: never
 
 OpenMP
 --------------------------------------------------------------------------------
	 Number of thread per MPI process : 1
...

Let me know if you need anything else.

Best,
Yao

xxirii · 2021-01-12T14:25:33Z

Thank you, do you use any particular OMP environment variable like a specific SCHEDULER or thread placement?

weipengyao · 2021-01-12T14:38:19Z

I don't think I do.

Here's the script I use to compile Smilei on Niagara (hope it can help anyone else using Smilei on Niagara).
compile_smilei_niagara.sh.txt

To save your time from downloading, it reads:

module purge
module load NiaEnv/2019b
module load intel/2019u3
module load intelmpi/2019u3
module load hdf5/1.10.5
module load python/3.6.8
export HDF5_ROOT_DIR=/scinet/niagara/software/2019b/modules/intel-2019u3-intelmpi-2019u3/hdf5/1.10.5
export PYTHONEXE=python3

export OMP_NUM_THREADS=1
export OMP_SCHEDULE=dynamic
export OMP_PROC_BIND=true
export OMPI_MCA_btl_portals4_use_rdma=0

# For MPI-tags:
export MPIR_CVAR_CH4_OFI_TAG_BITS=26
export MPIR_CVAR_CH4_OFI_RANK_BITS=13

I only have something 'special' for MPI-tag related issues (#307).

I checked my ~/.bashrc, and I don't have anything related there.
Do you think I need to check any other possible places?

Thanks!

xxirii · 2021-01-12T14:42:55Z

Thank you,

I have managed to reproduce the bug using exactly your configuration. It does not appear when you use a hybrid mode with more than 1 OpenMP thread per MPI. I will investigate but you should be able to run your case in hybrid if you need the results soon for science.

Moreover, in my case, I have a hdf5 issue when I use the variable debug_every in the collision. So if you have the same you can comment it.

xxirii · 2021-01-12T14:44:20Z

For instance, using 16 mpi tasks and 10 OpenMP threads per task I am at iteration 3700 after 8 minutes.

weipengyao · 2021-01-12T14:53:54Z

Dear @xxirii,

Thanks for the timely reply.

For me, I need to use ten times cores, i.e. 1600, with more particles like ppc=256 (in order to suppress the noise).
It seems that this crash appears when you use a core number above a certain value (and that explains the 16x10 scheme to be working).

About the debug_every related hdf5 issue, I don't have it in my case, for now. But I remember when I try to use multiple species in the Collision a long time ago, there's a problem (see #307).

I hope it helps.

xxirii · 2021-01-12T15:08:19Z

Right, it's surprising to see that it works with 159 MPI tasks and segfault with 160. Very strange.

xxirii · 2021-01-12T15:23:05Z

Note that the bug only occurs when I use strictly 160 cores. When I use more it seems to work. Have you tried a case with more ppc and more MPI tasks that crashes?

weipengyao · 2021-01-12T15:31:33Z

Yes, I have. Please see this output file for example.

HEB2D_dep2_Inj128_Z10_T100_np1_Th1k_FixIon_SBC_Collee.py-4673320.out.txt

weipengyao added the bug label Dec 29, 2020

xxirii self-assigned this Dec 30, 2020

xxirii mentioned this issue Aug 28, 2021

Unexpected fluctuations at the injector #271

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Injection of hot electron beam raises error `double free or corruption (out)` #349

Injection of hot electron beam raises error `double free or corruption (out)` #349

weipengyao commented Dec 29, 2020

xxirii commented Jan 12, 2021 •

edited

weipengyao commented Jan 12, 2021

xxirii commented Jan 12, 2021

weipengyao commented Jan 12, 2021

xxirii commented Jan 12, 2021

xxirii commented Jan 12, 2021

weipengyao commented Jan 12, 2021

xxirii commented Jan 12, 2021

xxirii commented Jan 12, 2021

weipengyao commented Jan 12, 2021

Injection of hot electron beam raises error double free or corruption (out) #349

Injection of hot electron beam raises error double free or corruption (out) #349

Comments

weipengyao commented Dec 29, 2020

Description

Steps to reproduce the problem

Parameters

xxirii commented Jan 12, 2021 • edited

weipengyao commented Jan 12, 2021

xxirii commented Jan 12, 2021

weipengyao commented Jan 12, 2021

xxirii commented Jan 12, 2021

xxirii commented Jan 12, 2021

weipengyao commented Jan 12, 2021

xxirii commented Jan 12, 2021

xxirii commented Jan 12, 2021

weipengyao commented Jan 12, 2021

Injection of hot electron beam raises error `double free or corruption (out)` #349

Injection of hot electron beam raises error `double free or corruption (out)` #349

xxirii commented Jan 12, 2021 •

edited