New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error handling #130
Comments
I think we're drifting into ULFM land here a bit with these concerns. The way to make this code "work" in the presence of faults or other error returns from collective operations. Item 1 above has been the source for numerous papers and phd dissertation theses, and code (in c). Putting on my ULFM hat I'd write the code like this: fn f(comm) -> Result<(), CollectiveError> {
let rc = comm.barrier();
let flag = 0;
if (MPI_SUCCESS == rc) {
flag = 1
}
comm.mpix_comm_agree(&flag);
if (!flag) {
/* uh oh, something happened to one or more of my partners */
comm.MPIX_Comm_revoke; /* this will cause other processes who maybe got success on the barrier and charging ahead to the allreduce to get a an error there, so no deadlock */
return MPIX_ERR_PROC_FAILED;
}
rc = comm.all_reduce_into(...);
/* same check above if rc == MPI_SUCCESS) using the mpix_comm_agree consensus */
else
if (MPIX_ERR_REVOKED == rc) {
/* we failed out because some processes died between the barrier call and this collective, so return error */
return MPI_ERR_REVOKED;
}
Ok(());
} I think the most common error that wouldn't involve process failure would be invalid parameters, which means a buggy program, in which case we could have a default rustmpi error handler just do the rusty thing of aborting the program by doing a MPI_Abort on MPI_COMM_WORLD. Note that by default, with Open MPI 5.0.0 and later, ULFM MPI extensions are available. |
(Commenting to follow) |
I see those There's a really nice post by Andrew Gallant (author of ripgrep) on Rust error handling, and when It would be useful to outline what kind of errors could be raised by MPI and that a user would possibly expect to recover from (at least enough to clean up, say closing local file handles). If the answer is none, then we can stick with the current strategy, though I think we should still work with |
this is probably a place to start: https://github.com/ICLDisco/ulfm-testing/ note for Open MPI this stuff is only in the 5.0.0 pre-release and main. Yes to your point above, we would need to do a taxonomy of possible errors returned from collective calls (including non-blocking), decide which ones the application might be able to handle (MPI_ERR_ARG maybe), and for now flame out for MPI_ERR_PROC_FAILED, etc. I mainly brought up ULFM here because of the consensus function it introduced, not so much for all the shrinking/recovery functionality. |
After talking with Howard about this, we think there's a couple different ways that this could be solved. I agree, like you said, there's probably some way that we can utilize the type system here to avoid deadlock. First of all when a process fails for some reason or another, most current MPI runners are simply going to kill everything. Howard mentioned that you might be able to add some parameters to the runner to avoid this, but this is not the default. So in the case where everything is killed, the errors that will be returned are either going to be argument-related errors or some kind of internal error that you probably wouldn't be able to recover from. If we were only looking at catching these types of errors, then we might be able to use some MPI-only calls to avoid a deadlock. For instance, after a collective operation, we could do an On the other hand, if we wanted to also support recovery from process-failure, then we would probably need to use the ULFM extensions. Being an extension, RSMPI would probably need a new user-specified feature or have to do something in a I'm going to experiment with some of these methods and see what might work. Whether or not we attempt to implement this type of error handling, I think it's a good idea to have most calls return |
Label under harebrained ideas.
MPI operations are generally fallible, but rsmpi generally ignores any such errors and proceeds as though the operation was successful. The natural way to handle fallibility would be to return
Result<T, E>
, but the question is what isE
and how does this interact withdrop
. Suppose we write the following:If the first line errors on one rank, it'll return from
f
while the other ranks deadlock in theall_reduce_into
. I think we need to avoid this and ensure that collectives are only called in functions that returnCollectiveError<'c>
where'c
is a lifetime associated with the communicator. MPI primitives would returnResult<T, NoncollectiveError>
and the caller would need to be responsible for making them collective in order to safely proceed to the next line. Now I'm just spitballing, but I think there are two ways to do this:Create a
result.consensus(comm)?;
that promotesResult<T, NoncollectiveError>
toResult<T, CollectiveError>
. This sort of thing requires polling on a nonblocking collective, possibly with a timeout to panic if a rank has vanished (in which case it's hard to even determine which ranks are alive, much less return with a consistent state). If any ranks areErr(NoncollectiveError(e))
on input, then all ranks willErr(CollectiveError(e))
.Variant of 1 in which
NoncollectiveError
"knows" which communicator it's associated with so one can useresult.consensus()?;
. You'dimpl From<NoncollectiveError<'c1>> for NoncollectiveError<'c2>
, with constraints on the relationship between'c1
and'c2
(these might need to be runtime checks rather than static lifetime checks).Note that in both cases, one will probably want an execution mode to be unsafe or panic to avoid the run-time cost of the nonblocking collectives necessary for consensus.
MPI_Abort
and setMPI_ERRORS_ARE_FATAL
.@hppritcha @jtronge I think this is a summary of my half-baked ideas for handling errors. I think with objects like an
MPI_Win
that must be dropped collectively, you'd want any scope that can have ownership of theWin
to return a resultE
that cannot be converted to non-collective usingfrom
or other functions. That way you could only call in a function returning suitable collective errors. I haven't thought deliberately about whether lifetimes can be made to enforce all the constraints statically.The text was updated successfully, but these errors were encountered: