Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Injection of hot electron beam raises error double free or corruption (out) #349

Open
weipengyao opened this issue Dec 29, 2020 · 10 comments
Assignees
Labels

Comments

@weipengyao
Copy link
Contributor

Description

I am using the injection module for hot electron transport in solid target.

When the temperature of the injected electron beam is high, like Te=100 keV, the code runs hundreds of steps and then crashes with the error double free or corruption (out): 0x0000000003b5f6f0 ***.
While when reduce the temperature, e.g. Te=50 eV, the code runs fine (at least within the simulation time).

Please find the related output files here:

a.out.txt
test.py.txt
a.err.txt

Steps to reproduce the problem

To reproduce the problem, just use the namelist above and compare the two cases with different temperatures.

And this info. about iterator validity might be helpful.

Parameters

make env gives:

SMILEICXX : mpicxx
PYTHONEXE : python3
MPIVERSION :
VERSION : b'v4.4-784-gc3f8cc8'-b'work'
OPENMP_FLAG : -fopenmp -D_OMP
HDF5_ROOT_DIR : /scinet/niagara/software/2019b/modules/intel-2019u3-intelmpi-2019u3/hdf5/1.10.5
SITEDIR : /home/a/anticipa/weipeng/.local/lib/python3.6/site-packages
PY_CXXFLAGS : -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/lib/python3.6/site-packages/numpy/core/include -DSMILEI_USE_NUMPY -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION
PY_LDFLAGS : -lpython3.6m -lpthread -ldl -lutil -lm -Xlinker -export-dynamic
CXXFLAGS : -D__VERSION=\"b'v4.4-784-gc3f8cc8'-b'work'\" -D_VECTO -std=c++11 -Wall  -I/scinet/niagara/software/2019b/modules/intel-2019u3-intelmpi-2019u3/hdf5/1.10.5/include -Isrc -Isrc/Params -Isrc/ElectroMagnSolver -Isrc/ElectroMagn -Isrc/ElectroMagnBC -Isrc/Particles -Isrc/Radiation -Isrc/Ionization -Isrc/Interpolator -Isrc/Collisions -Isrc/Merging -Isrc/Tools -Isrc/Python -Isrc/Projector -Isrc/DomainDecomposition -Isrc/MovWindow -Isrc/Profiles -Isrc/picsar_interface -Isrc/Checkpoint -Isrc/Pusher -Isrc/Field -Isrc/MultiphotonBreitWheeler -Isrc/SmileiMPI -Isrc/Species -Isrc/Diagnostic -Isrc/ParticleInjector -Isrc/Patch -Ibuild/src/Python -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/include/python3.6m -I/scinet/niagara/software/2019b/opt/base/python/3.6.8/lib/python3.6/site-packages/numpy/core/include -DSMILEI_USE_NUMPY -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -O3 -g  -fopenmp -D_OMP
LDFLAGS : -L/scinet/niagara/software/2019b/modules/intel-2019u3-intelmpi-2019u3/hdf5/1.10.5/lib   -lhdf5 -lpython3.6m -lpthread -ldl -lutil -lm -Xlinker -export-dynamic -L/scinet/niagara/software/2019b/opt/base/python/3.6.8/lib -lm -fopenmp -D_OMP
@weipengyao weipengyao added the bug label Dec 29, 2020
@xxirii xxirii self-assigned this Dec 30, 2020
@xxirii
Copy link
Contributor

xxirii commented Jan 12, 2021

Dear @weipengyao,

I am studying your issue and so far I could not reproduce your problem.

If you use a supercomputer can you show me your launch script (the one you use to launch the simulation) or the exact configuration that you use (number of MPI tasks, OpenMP threads...)

Thank you

@weipengyao
Copy link
Contributor Author

Dear @xxirii,

Thanks for your time and reply.

I checked again with the namelist attached and found that this error occurred (at 200 timestep) with 160 cores, but not with 40 cores (which might happen later).

I am running this on the supercomputer Niagara, and I use the smilei.sh 160 test.py in the debug cluster, with 4 nodes (40 cores per node).
From the a.out.txt, you may notice that I just use:

...
Initializing MPI
 --------------------------------------------------------------------------------
	 MPI_THREAD_MULTIPLE enabled
	 Number of MPI process : 160
	 Number of patches : 
		 dimension 0 - number_of_patches : 128
		 dimension 1 - number_of_patches : 128
	 Patch size :
		 dimension 0 - n_space : 20 cells.
		 dimension 1 - n_space : 20 cells.
	 Dynamic load balancing: never
 
 OpenMP
 --------------------------------------------------------------------------------
	 Number of thread per MPI process : 1
...

Let me know if you need anything else.

Best,
Yao

@xxirii
Copy link
Contributor

xxirii commented Jan 12, 2021

Thank you, do you use any particular OMP environment variable like a specific SCHEDULER or thread placement?

@weipengyao
Copy link
Contributor Author

I don't think I do.

Here's the script I use to compile Smilei on Niagara (hope it can help anyone else using Smilei on Niagara).
compile_smilei_niagara.sh.txt

To save your time from downloading, it reads:

module purge
module load NiaEnv/2019b
module load intel/2019u3
module load intelmpi/2019u3
module load hdf5/1.10.5
module load python/3.6.8
export HDF5_ROOT_DIR=/scinet/niagara/software/2019b/modules/intel-2019u3-intelmpi-2019u3/hdf5/1.10.5
export PYTHONEXE=python3

export OMP_NUM_THREADS=1
export OMP_SCHEDULE=dynamic
export OMP_PROC_BIND=true
export OMPI_MCA_btl_portals4_use_rdma=0

# For MPI-tags:
export MPIR_CVAR_CH4_OFI_TAG_BITS=26
export MPIR_CVAR_CH4_OFI_RANK_BITS=13

I only have something 'special' for MPI-tag related issues (#307).

I checked my ~/.bashrc, and I don't have anything related there.
Do you think I need to check any other possible places?

Thanks!

@xxirii
Copy link
Contributor

xxirii commented Jan 12, 2021

Thank you,

I have managed to reproduce the bug using exactly your configuration. It does not appear when you use a hybrid mode with more than 1 OpenMP thread per MPI. I will investigate but you should be able to run your case in hybrid if you need the results soon for science.

Moreover, in my case, I have a hdf5 issue when I use the variable debug_every in the collision. So if you have the same you can comment it.

@xxirii
Copy link
Contributor

xxirii commented Jan 12, 2021

For instance, using 16 mpi tasks and 10 OpenMP threads per task I am at iteration 3700 after 8 minutes.

@weipengyao
Copy link
Contributor Author

Dear @xxirii,

Thanks for the timely reply.

For me, I need to use ten times cores, i.e. 1600, with more particles like ppc=256 (in order to suppress the noise).
It seems that this crash appears when you use a core number above a certain value (and that explains the 16x10 scheme to be working).

About the debug_every related hdf5 issue, I don't have it in my case, for now. But I remember when I try to use multiple species in the Collision a long time ago, there's a problem (see #307).

I hope it helps.

@xxirii
Copy link
Contributor

xxirii commented Jan 12, 2021

Right, it's surprising to see that it works with 159 MPI tasks and segfault with 160. Very strange.

@xxirii
Copy link
Contributor

xxirii commented Jan 12, 2021

Note that the bug only occurs when I use strictly 160 cores. When I use more it seems to work. Have you tried a case with more ppc and more MPI tasks that crashes?

@weipengyao
Copy link
Contributor Author

Yes, I have. Please see this output file for example.

HEB2D_dep2_Inj128_Z10_T100_np1_Th1k_FixIon_SBC_Collee.py-4673320.out.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants