P4est MPI simulation gets stuck when not all ranks have BC faces #1878

benegee · 2024-03-19T11:54:00Z

I first encountered this issue for the baroclinic instability test case. Then I found a reduced example based on elixir_advection_basic but with nonperiodic boundary conditions:

3 cells
3 MPI ranks
periodic boundary conditions in x-direction
Dirichlet boundary conditions in y-direction
important is that the rank in the middle does not have any boundaries with Dirichlet boundary condition

Running this with system MPI and on 3 ranks makes the simulation hang. Running with tmpi and looking at the backtraces when aborting shows that two ranks have called init_boundaries and eventually p4est_iterate, while the third (the middle one) is already somewhere in rhs!.

It seems to be caused by the check

Trixi.jl/src/solvers/dgsem_p4est/containers.jl

Lines 266 to 268 in 2dfde7f

    
           if n_boundaries > 0 
        
               init_boundaries!(boundaries, mesh) 
        
           end

Consequently only two ranks eventually call p4est_iterate. On the p4est side p4est_iterate first calls p4est_is_valid, which calls MPI_Allreduce.

This would explain the blocking. What I do not understand is why it works when not using system MPI.

MWE

using OrdinaryDiffEq
using Trixi

###############################################################################
# semidiscretization of the linear advection-diffusion equation

advection_velocity = (1.0, 0.0)
equations = LinearScalarAdvectionEquation2D(advection_velocity)

initial_condition = initial_condition_gauss

boundary_conditions = Dict(:y_neg => BoundaryConditionDirichlet(initial_condition),
                           :y_pos => BoundaryConditionDirichlet(initial_condition))

# Create DG solver with polynomial degree = 3 and (local) Lax-Friedrichs/Rusanov flux as surface flux
solver = DGSEM(polydeg = 3, surface_flux = flux_lax_friedrichs)

coordinates_min = (-5.0, -5.0)
coordinates_max = (5.0, 5.0)

trees_per_dimension = (1, 3)

mesh = P4estMesh(trees_per_dimension, polydeg = 3,
coordinates_min = coordinates_min, coordinates_max = coordinates_max, initial_refinement_level = 0,
periodicity = (true, false))

# A semidiscretization collects data structures and functions for the spatial discretization
semi = SemidiscretizationHyperbolic(mesh, equations,
                                    initial_condition, solver,
                                    boundary_conditions = boundary_conditions)

###############################################################################
# ODE solvers, callbacks etc.

# Create ODE problem with time span `tspan`
tspan = (0.0, 1.0)
ode = semidiscretize(semi, tspan);

# At the beginning of the main loop, the SummaryCallback prints a summary of the simulation setup
# and resets the timers
summary_callback = SummaryCallback()

callbacks = CallbackSet(summary_callback)

###############################################################################
# run the simulation

# OrdinaryDiffEq's `solve` method evolves the solution in time and executes the passed callbacks
sol = solve(ode, CarpenterKennedy2N54(williamson_condition = false),
            dt = 0.01,
            save_everystep = false, callback = callbacks);

# Print the timer summary
summary_callback()

The text was updated successfully, but these errors were encountered:

sloede · 2024-03-21T15:55:35Z

This would explain the blocking. What I do not understand is why it works when not using system MPI.

I share your conclusion. It seems to be a good explanation for the blocking, but does not explain why it works with the out-of-the-box MPI. Maybe you can verify (using a "wololo" statement or just plain output to stderr) that indeed the init_boundaries! function is not called on the middle rank for the Julia-provided MPI but gets called by the system MPI?

benegee · 2024-03-22T22:27:02Z

I double-checked this: init_boundaries! is not called on the middle rank, neither for julia-provided MPI nor for system MPI. For both variants there are 26 calls of p4est_iterate in total whereas the number should be a multiple of 3. However the simulation just continues with julia-provided MPI but hangs with system MPI.

sloede · 2024-03-23T05:58:10Z

This becomes more and more baffling... Alas, the joys of MPI-based parallelism and debugging 😱

What happens if you put an MPI_Barrier after the if n_boundaries > 0 ... end check, then print something from all ranks, then again an MPI_Barrier. Will all ranks reach that statement?

I do not see how a simulation might continue if a collective communication is not issued from all ranks, unless it is mixed with later calls to collective calls that somehow "match" with one MPI implementation and do not match with others.

Incidentally, are system MPI and Julia MPI the same MPI implementations? If not, can you check what happens if you use the same? That is, maybe it's not a Julia thing but really an MPI thing...

benegee · 2024-03-25T20:27:00Z

Thanks for the suggestions! Here is a small update:

When I put MPI_Barriers after if n_boundaries > 0 ... end they are passed by all ranks. I see the corresponding printlns of all ranks and the program does not hang.

When I put the barriers within if n_boundaries > 0 ... end only two ranks show the corresponding messages and the program does hang, also with julia MPI.

I also tried to change the MPI implementation. MPItrampoline_jll showed the same behavior. Unfortunately I could not make OpenMPI_jll work so far. It led to errors while precompiling P4est.jll and T8code.jll.

benegee added bug Something isn't working parallelization Related to MPI, threading, tasks etc. labels Mar 19, 2024

benegee linked a pull request Mar 19, 2024 that will close this issue

Fix hanging P4est MPI simulation with boundary conditions #1879

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P4est MPI simulation gets stuck when not all ranks have BC faces #1878

P4est MPI simulation gets stuck when not all ranks have BC faces #1878

benegee commented Mar 19, 2024

sloede commented Mar 21, 2024

benegee commented Mar 22, 2024

sloede commented Mar 23, 2024

benegee commented Mar 25, 2024 •

edited

P4est MPI simulation gets stuck when not all ranks have BC faces #1878

P4est MPI simulation gets stuck when not all ranks have BC faces #1878

Comments

benegee commented Mar 19, 2024

sloede commented Mar 21, 2024

benegee commented Mar 22, 2024

sloede commented Mar 23, 2024

benegee commented Mar 25, 2024 • edited

benegee commented Mar 25, 2024 •

edited