You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While debugging MPI abort handling I noticed that there's what appears to be an unnecessary delay in job termination when exit-on-error triggers:
bash-4.4$ flux run -vvv -N4 -o exit-on-error t/mpi/abort 3
jobid: ƒEwG5HGST
0.000s: job.submit {"userid":1000,"urgency":16,"flags":0,"version":1}
0.014s: job.validate
0.026s: job.depend
0.026s: job.priority {"priority":16}
0.028s: job.alloc {"annotations":{"sched":{"resource_summary":"rank[0-3]/core[0-3]"}}}
0.037s: job.start
0.029s: exec.init
0.032s: exec.starting
0.103s: exec.shell.init {"service":"1000-shell-fEwG5HGST","leader-rank":0,"size":4}
0.116s: exec.shell.start {"taskmap":{"version":1,"map":[[0,4,1,1]]}}
0.132s: exec.shell.task-exit {"localid":0,"rank":3,"state":"Exited","pid":173362,"wait_status":10752,"signaled":0,"exitcode":42}
0.133s: flux-shell[0]: FATAL: doom: rank 3 failed and exit-on-error is set
Apr 26 20:57:40.355708 job-exec.err[0]: ƒEwG5HGST: exec_kill: eel (rank 3): No such process
0.136s: job.exception type=exec severity=0 rank 3 failed and exit-on-error is set
25.159s: exec.complete {"status":36608}
25.159s: exec.done
flux-job: task(s) Terminated
25.159s: job.finish {"status":36608}
Note it takes about 25s before the job is fully terminated.
This doesn't happen with -o exit-timeout=1s, but does occur with -o exit-timeout=0s so there must be some kind of race when raising an exception immediately.
The text was updated successfully, but these errors were encountered:
While debugging MPI abort handling I noticed that there's what appears to be an unnecessary delay in job termination when
exit-on-error
triggers:Note it takes about 25s before the job is fully terminated.
This doesn't happen with
-o exit-timeout=1s
, but does occur with-o exit-timeout=0s
so there must be some kind of race when raising an exception immediately.The text was updated successfully, but these errors were encountered: