Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run Veros via MPI on a cluster with slurm? #412

Open
HuangLianghong opened this issue Dec 17, 2022 · 1 comment
Open

How to run Veros via MPI on a cluster with slurm? #412

HuangLianghong opened this issue Dec 17, 2022 · 1 comment

Comments

@HuangLianghong
Copy link

Hi, I am trying to run Veros on a cluster, but I cannot figure it out.
Here is my batch script:

#!/bin/bash -l
#
#SBATCH -p work
#SBATCH --job-name=veros_hlh
#SBATCH --nodes=2
#SBATCH --ntasks=64
#SBATCH --cpus-per-task=1
#SBATCH --exclusive


# load module dependencies
# module load petsc4py mpi4py h5py ...

export OMP_NUM_THREADS=1

# adapt srun command to your available scheduler / MPI implementation
veros resubmit -i my_run -n 8 -l 7776000 \
    -c "srun --mpi=none -- veros run global_flexible/global_flexible.py -b numpy -n 4 4" \
    --callback "sbatch veros_batch.sh"

error informations:

 Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(805).................: fail failed
MPID_Init(1859).......................: channel initialization failed
MPIDI_CH3_Init(147)...................: fail failed
dapl_rc_setup_all_connections_20(1394): generic failure with errno = 872598799
getConnInfoKVS(956)...................: PMI_KVS_Get failed
[unset]: readline failed
srun: error: cpn256: task 2: Exited with exit code 15
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(805).................: fail failed
MPID_Init(1859).......................: channel initialization failed
MPIDI_CH3_Init(147)...................: fail failed
dapl_rc_setup_all_connections_20(1394): generic failure with errno = 872598799
getConnInfoKVS(956)...................: PMI_KVS_Get failed
[unset]: readline failed
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(805).................: fail failed
MPID_Init(1859).......................: channel initialization failed
MPIDI_CH3_Init(147)...................: fail failed
dapl_rc_setup_all_connections_20(1394): generic failure with errno = 872598799
getConnInfoKVS(956)...................: PMI_KVS_Get failed
[unset]: readline failed
srun: First task exited 60s ago
srun: step:2344490.0 task 3: running
srun: step:2344490.0 tasks 0-2: exited abnormally
srun: Terminating job step 2344490.0
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
srun: got SIGCONT
slurmstepd: error: *** JOB 2344490 ON cpn34 CANCELLED AT 2022-12-17T18:31:56 ***
srun: error: cpn273: task 3: Killed

Do you think I didn't install mpi4py and h5py correctly or any orther advice?
I used IMPI during my installation.

@dionhaefner
Copy link
Collaborator

Do you have any guidance from your cluster provider on how to execute jobs? It looks like everything is correct on the Veros side of things, but you probably need different settings in your batch file or srun. (A good candidate is --mpi=none, try removing that.)

If the problem persists I recommend you talk to your cluster support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants