Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job sometimes crashes with Open-MPI-related message #6

Open
calebwin opened this issue Aug 4, 2021 · 7 comments
Open

Job sometimes crashes with Open-MPI-related message #6

calebwin opened this issue Aug 4, 2021 · 7 comments
Labels
banyan-jl Concerning Banyan.jl bug Something isn't working jl-executor Concerning the Julia code executor

Comments

@calebwin
Copy link
Contributor

calebwin commented Aug 4, 2021

Here is the output (when job is run with return_logs=true):

slurmstepd: error: *** JOB 3737 ON compute-dy-t3large-2 CANCELLED AT 2021-08-03T16:28:28 ***
slurmstepd: error: *** STEP 3737.0 ON compute-dy-t3large-2 CANCELLED AT 2021-08-03T16:28:28 ***

signal (15): Terminated
in expression starting at /home/ec2-user/executor.jl:52
epoll_wait at /lib64/libc.so.6 (unknown line)

signal (15): Terminated
in expression starting at /home/ec2-user/executor.jl:52
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
mca_btl_vader_fbox_read_header at /codebuild/output/src084091651/src/ompi_build/BUILD/openmpi-4.1.0/opal/mca/btl/vader/btl_vader_fbox.h:72 [inlined]
mca_btl_vader_check_fboxes at /codebuild/output/src084091651/src/ompi_build/BUILD/openmpi-4.1.0/opal/mca/btl/vader/btl_vader_fbox.h:195 [inlined]
mca_btl_vader_component_progress at /codebuild/output/src084091651/src/ompi_build/BUILD/openmpi-4.1.0/opal/mca/btl/vader/btl_vader_component.c:765

@cailinw has also experienced this. The job hangs for like 10 minutes after having executed some code but not all the code for the job and then prints the above message.

@calebwin calebwin added bug Something isn't working banyan-jl Concerning Banyan.jl jl-executor Concerning the Julia code executor labels Aug 4, 2021
@calebwin
Copy link
Contributor Author

calebwin commented Aug 4, 2021

A similar error message:

slurmstepd: error: *** JOB 3757 ON compute-dy-t3large-1 CANCELLED AT 2021-08-03T18:56:59 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 3757.0 ON compute-dy-t3large-1 CANCELLED AT 2021-08-03T18:56:59 ***

signal (15): Terminated
in expression starting at /home/ec2-user/executor.jl:52
epoll_wait at /lib64/libc.so.6 (unknown line)

signal (15): Terminated
in expression starting at /home/ec2-user/executor.jl:52
uv__io_poll at /workspace/srcdir/libuv/src/unix/linux-core.c:309
uv_run at /workspace/srcdir/libuv/src/unix/core.c:379
jl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:472

@calebwin
Copy link
Contributor Author

calebwin commented Aug 4, 2021

This happens rarely and was hard to reproduce.

@calebwin
Copy link
Contributor Author

This issue happened again:

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 94 ON compute-dy-t3large-1 CANCELLED AT 2021-08-15T01:02:13 ***
slurmstepd: error: *** STEP 94.0 ON compute-dy-t3large-1 CANCELLED AT 2021-08-15T01:02:13 ***

signal (15): Terminated
in expression starting at /home/ec2-user/executor.jl:52
epoll_wait at /lib64/libc.so.6 (unknown line)
uv__io_poll at /workspace/srcdir/libuv/src/unix/linux-core.c:309
uv_run at /workspace/srcdir/libuv/src/unix/core.c:379

signal (15): Terminated
in expression starting at /home/ec2-user/executor.jl:52
mca_btl_vader_fbox_read_header at /codebuild/output/src084091651/src/ompi_build/BUILD/openmpi-4.1.0/opal/mca/btl/vader/btl_vader_fbox.h:72 [inlined]
mca_btl_vader_check_fboxes at /codebuild/output/src084091651/src/ompi_build/BUILD/openmpi-4.1.0/opal/mca/btl/vader/btl_vader_fbox.h:195 [inlined]
mca_btl_vader_component_progress at /codebuild/output/src084091651/src/ompi_build/BUILD/openmpi-4.1.0/opal/mca/btl/vader/btl_vader_component.c:765
jl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:472

It does feel like this correlates to lots of stuff getting printed out.

@calebwin
Copy link
Contributor Author

Sometimes there is a similar message referencing epoll_wait but not mentioning Slurm:

Start write
In Write on worker 1 on batch 1
slurmstepd: error: *** JOB 862 ON compute-dy-t3large-1 CANCELLED AT 2021-08-28T22:12:51 ***
slurmstepd: error: *** STEP 862.0 ON compute-dy-t3large-1 CANCELLED AT 2021-08-28T22:12:51 ***

signal (15): Terminated
in expression starting at /home/ec2-user/executor.jl:52
epoll_wait at /lib64/libc.so.6 (unknown line)

@calebwin
Copy link
Contributor Author

Start write
In Write on worker 1 on batch 1
slurmstepd: error: *** JOB 862 ON compute-dy-t3large-1 CANCELLED AT 2021-08-28T22:12:51 ***
slurmstepd: error: *** STEP 862.0 ON compute-dy-t3large-1 CANCELLED AT 2021-08-28T22:12:51 ***

signal (15): Terminated
in expression starting at /home/ec2-user/executor.jl:52
epoll_wait at /lib64/libc.so.6 (unknown line)

Actually the above is definitely not an issue with Open MPI. It's because the jobs was actually canceled.

@calebwin
Copy link
Contributor Author

calebwin commented Nov 8, 2021

Related issue:

Going to write to efs/job_2021-11-08-015625cb28a658f39f12ed0de8bedbfc341a65_val_19/part2_nrows=5453429.arrow
Going to write to efs/job_2021-11-08-015625cb28a658f39f12ed0de8bedbfc341a65_val_19/part1_nrows=5453429.arrow
srun: error: compute-dy-t3large-2: task 1: Exited with exit code 1
slurmstepd: error: compute-dy-t3large-2 [0] pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25: Interrupted system call (4)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 2259.0 ON compute-dy-t3large-2 CANCELLED AT 2021-11-08T02:37:41 ***
srun: error: compute-dy-t3large-2: task 0: Killed

@calebwin
Copy link
Contributor Author

This is typically because the job ran out of memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
banyan-jl Concerning Banyan.jl bug Something isn't working jl-executor Concerning the Julia code executor
Projects
None yet
Development

No branches or pull requests

1 participant