Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving window crashes in large simulations #405

Open
DoubleAgentDave opened this issue May 18, 2021 · 10 comments
Open

Moving window crashes in large simulations #405

DoubleAgentDave opened this issue May 18, 2021 · 10 comments
Labels
performances computing time, memory consumption, load balancing, vectorization

Comments

@DoubleAgentDave
Copy link
Contributor

Hi, I'm having a lot of fairly critical crashes when using a moving window in large (requiring >64 knl nodes) simulation runs. When I run a test simulation at half resolution there are no problems but I frequently get crashes in a number of scenarios. A basic outline of the simulation is that a laser hits a thin(ish) target and then the window follows the laser after it penetrates the target.

moving window after say 80 timesteps or so it just crashes, ussually only occurs when there are > 24+12+12 ppc for different species
moving window plus file output, though not reliably
moving window plus load balancing, this reliably causes crashes on large simulations so I've resorted to a particular scheme I'll outline below
The most reliable way it crashes is if the load balancing is run during the time when the bulk target density moves out of the box. So I have basically resorted to a scheme where I make the code only load balance before the moving window starts and after the target has moved out of the simulation box. This does mean that the code performs extremely poorly during the period where the target is exiting the box and then suddenly a huge jump in performance occurs after it has left and the box is rebalanced. Occasionally reducing the number of patches in the window movement direction can avoid a crash, but this is not always reliable and can again result in poor performance.

I will get together as much information for this as possible, but as for the exact namelist file I'd prefer to share that more privately by email as it's for a pending publication.

I have used the multiple decomposition facility, however this crash has happened without this problem in the past. There have been some occasions where the code has simply stalled and not output an error at all.

ELI_film_low_LG_e7743507.txt
ELI_film_low_LG_o7743507.txt

I've plotted the cpu usage, memory usage and networking, nothing seems particularly out of the ordinary. Disk access looks totally normal as well, though the plot for that is not really comprehensible without some context so I've not plotted that here.

cpu_usage_knl

net_usage_knl

mem_usage_knl

The machine file I've made for compiling it on the knl nodes at theTACC Stampede 2 cluster is here:
stampede2_knl.txt
the compile instruction:
module load python3
module load phdf5
module load boost
make clean
export BUILD_DIR=build_knl_intel
make -j machine=stampede2_knl config=no_mpi_tm

The run_script goes something like:
#SBATCH --nodes=64
#SBATCH --ntasks-per-node 32
export OMP_NUM_THREADS=2
export OMP_SCHEDULE=dynamic
export OMP_PROC_BIND=true
module load phdf5
module load python3
module load remora
export TACC_IBRUN_DEBUG=1
remora ibrun mem_afinity ./smilei expanded_target smilei_helper_funcs.py laser_profiles.py angled_target.py

btw remora is the monitoring program i used to create the graphs. The debug option just prints extra information at the begining of the code output describing the MPI environment the code runs in.

The bits of the name list that might be relevant are:

Main(
geometry = "3Dcartesian",
interpolation_order = 4,
number_of_cells = box_shape_cells,
cell_length = cell_lens,
# number_of_timesteps = 2,
simulation_time = sim_time,
timestep_over_CFL = 0.95,
solve_poisson=True,
number_of_patches = [16,64,64],
maxwell_solver = 'Yee',
EM_boundary_conditions = [
["silver-muller", "silver-muller"],
["silver-muller", "silver-muller"],
["silver-muller", "silver-muller"]
],
print_expected_disk_usage = True,
print_every = 10,
random_seed = 0,
)

MultipleDecomposition(
region_ghost_cells = 4
)

LoadBalancing(
initial_balance = True,
every = 1400,
cell_load = 1.,
frozen_particle_load = 1.0
)

Vectorization(
mode = "adaptive",
reconfigure_every = 20,
initial_mode = "on"
)

MovingWindow(
time_start = move_window_time,
velocity_x = 1.0,
number_of_additional_shifts = 0.,
additional_shifts_time = 0.,
)

@DoubleAgentDave
Copy link
Contributor Author

It's worth noting that in the scenario above, the code just crashed 100 timesteps after the moving window started, no load balancing or any other periodic thing, except maybe vectorisation changes, was done. There were four species in the simulation, one with 24 ppc in the target region, and three with 12 ppc.

@mccoys
Copy link
Contributor

mccoys commented May 19, 2021

The reported crash happens during a memset within HDF5 functions. If this is always the case, it might point to an issue with HDF5. It could also be a lack-of-memory issue as the moving window does need significant extra memory. Of course, it might also be a memory leak related to the moving window. Do you have the possibility to test your case by using the same simulation on more cpus, so that there is more available memory? Do you have the possibility to test another version of HDF5?

@DoubleAgentDave
Copy link
Contributor Author

OK, I'd not thought of a problem with hdf5 like that. It'll take some time to do what you're saying so I likely won't update this too soon. I had initially suspected a memory problem, but looking at the actually vmem usage, there is only one node that seems even close to 50% usage so I'd be surpised if that was the problem, however, I did try this reducing the number of particles rather than increasing the number of nodes and this solved the crash issue. However I've not tried the load balancing yet as I think I'll need a cheaper way to test this problem.

@mccoys
Copy link
Contributor

mccoys commented Jun 2, 2021

We do not have much time to work on this yet, but it really looks like a memory issue. Note that memory diagnostics may not account for temporary buffers which can be very significant.

@DoubleAgentDave
Copy link
Contributor Author

OK, thanks for letting me know. It's difficult to test this, if it's going into a temporary buffer then there isn't much I can do to monitor it. The only thing I can ask at the moment is how are MPI tasks that are due to create a new patch due to a moving window accounted for in the load balance? If they aren't would it be possible to give the patches nearest the relevant edges (i.e. the edge where new patches are made) be given a larger load value?

@beck-llr
Copy link
Contributor

beck-llr commented Jun 4, 2021

Thank you very much for the very detailed feedback. As @mccoys said we have little time to investigate this problem but it is something we have been facing already before and want to understand and improve. This will help us a lot.

You notice huge performance improvement once the load balancing resume after the target exits the domain. Isn't that simply because there are much less particles in total in your box ? Or are you positive that it is a balancing effect ?

To answer your last question, for the moment load balancing and moving window are completely decorrelated. During the moving window, patches are passed to their left ( -x direction) neighbour and an MPI communication occurs if that makes the patch change MPI domain.
The load balance however is applied only at the times defined by the user.

A first quick test that I could do is to force load balancing to be done every time the moving window is applied.

In my opinion this is not enough and some operations in order to prevent memory usage spikes should be done too. But for that, advanced memory analysis tools should be used. We hope we can do that soon.

@mccoys mccoys added the performances computing time, memory consumption, load balancing, vectorization label Jun 4, 2021
@DoubleAgentDave
Copy link
Contributor Author

The performance gain is specifically after the code is rebalanced, not at the exact moment the target leaves the box, so yes I'm certain it is a balancing effect. The simulation slows down massively when the moving window is applied and there is no load balancing, I'll plot a graph of some of this stuff at some point to point out what's going on exactly. One of the things I am noticing with this is that it slows down particularly badly when I increase the number of nodes these simulations are performed across (i mean in a relative sense).

@DoubleAgentDave
Copy link
Contributor Author

Also I would expect the performance to get better as bits of the target exit the box, but the performance gain from that is relatively modest compared to the rebalanced simulation. If you do force load balancing every time the window is applied could you leave an option in to turn this off? Memory spikes are particularly murderous when you're trying to run this on a limited number of nodes.

@DoubleAgentDave
Copy link
Contributor Author

I'm still wondering if the balancing can take into account the moving window itself. If the target is at one end of the box, and the moving edge at the other end of the box the MPI task that has to create the new patch(es) will likely also have a lot of patches on it's end as well, how much memory, time does it take to cycle through a patch with no particles? Though this effect I've still noticed when using the multi-decomposition domain thing, so maybe not.

Also I totally understand you've not got a lot of time on your hands.

@beck-llr
Copy link
Contributor

beck-llr commented Jun 4, 2021

If you do force load balancing every time the window is applied could you leave an option in to turn this off?

Of course !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performances computing time, memory consumption, load balancing, vectorization
Projects
None yet
Development

No branches or pull requests

3 participants