Error handling #130

jedbrown · 2022-08-17T20:28:22Z

Label under harebrained ideas.

MPI operations are generally fallible, but rsmpi generally ignores any such errors and proceeds as though the operation was successful. The natural way to handle fallibility would be to return Result<T, E>, but the question is what is E and how does this interact with drop. Suppose we write the following:

fn f(comm) -> Result<(), NoncollectiveError> {
  comm.barrier()?;
  comm.all_reduce_into(...)?;
  Ok(());
}

If the first line errors on one rank, it'll return from f while the other ranks deadlock in the all_reduce_into. I think we need to avoid this and ensure that collectives are only called in functions that return CollectiveError<'c> where 'c is a lifetime associated with the communicator. MPI primitives would return Result<T, NoncollectiveError> and the caller would need to be responsible for making them collective in order to safely proceed to the next line. Now I'm just spitballing, but I think there are two ways to do this:

Create a result.consensus(comm)?; that promotes Result<T, NoncollectiveError> to Result<T, CollectiveError>. This sort of thing requires polling on a nonblocking collective, possibly with a timeout to panic if a rank has vanished (in which case it's hard to even determine which ranks are alive, much less return with a consistent state). If any ranks are Err(NoncollectiveError(e)) on input, then all ranks will Err(CollectiveError(e)).
Variant of 1 in which NoncollectiveError "knows" which communicator it's associated with so one can use result.consensus()?;. You'd impl From<NoncollectiveError<'c1>> for NoncollectiveError<'c2>, with constraints on the relationship between 'c1 and 'c2 (these might need to be runtime checks rather than static lifetime checks).

Note that in both cases, one will probably want an execution mode to be unsafe or panic to avoid the run-time cost of the nonblocking collectives necessary for consensus.

Give up on safely propagating errors and instead get panics to use MPI_Abort and set MPI_ERRORS_ARE_FATAL.

@hppritcha @jtronge I think this is a summary of my half-baked ideas for handling errors. I think with objects like an MPI_Win that must be dropped collectively, you'd want any scope that can have ownership of the Win to return a result E that cannot be converted to non-collective using from or other functions. That way you could only call in a function returning suitable collective errors. I haven't thought deliberately about whether lifetimes can be made to enforce all the constraints statically.

The text was updated successfully, but these errors were encountered:

hppritcha · 2022-08-18T15:19:47Z

I think we're drifting into ULFM land here a bit with these concerns. The way to make this code "work" in the presence of faults or other error returns from collective operations. Item 1 above has been the source for numerous papers and phd dissertation theses, and code (in c). Putting on my ULFM hat I'd write the code like this:

fn f(comm) -> Result<(), CollectiveError> {
  let rc = comm.barrier();
  let flag = 0;
  if (MPI_SUCCESS == rc) {
     flag = 1
  }
  comm.mpix_comm_agree(&flag);
  if (!flag) {
     /* uh oh, something happened to one or more of my partners */
    comm.MPIX_Comm_revoke;  /* this will cause other processes who maybe got success on the barrier and charging ahead to the allreduce to get a an error there, so no deadlock */
      return MPIX_ERR_PROC_FAILED;
}
rc = comm.all_reduce_into(...);
/* same check above if rc == MPI_SUCCESS) using the mpix_comm_agree consensus  */
else
if (MPIX_ERR_REVOKED == rc) {
/* we failed out because some processes died between the barrier call and this collective, so return error */
   return MPI_ERR_REVOKED;
  }
  Ok(());
}

I think the most common error that wouldn't involve process failure would be invalid parameters, which means a buggy program, in which case we could have a default rustmpi error handler just do the rusty thing of aborting the program by doing a MPI_Abort on MPI_COMM_WORLD.

Note that by default, with Open MPI 5.0.0 and later, ULFM MPI extensions are available.

mhoemmen · 2022-08-18T15:57:07Z

(Commenting to follow)

jedbrown · 2022-08-18T15:59:40Z

I see those MPIX_ functions are available in MPICH-4.1a1 as well. Is there a good place to look for complete examples that can handle vanishing processes? I don't have a sense for whether that's a realistic mode for rsmpi, though if we can make it safer using the type system, maybe it'll be a place people will seek out for writing this sort of fault-tolerant MPI.

There's a really nice post by Andrew Gallant (author of ripgrep) on Rust error handling, and when unwrap (panic) is okay.
https://blog.burntsushi.net/unwrap/

It would be useful to outline what kind of errors could be raised by MPI and that a user would possibly expect to recover from (at least enough to clean up, say closing local file handles). If the answer is none, then we can stick with the current strategy, though I think we should still work with MPI_ERRORS_RETURN (raising our own panic/MPI_Abort). If there are good use cases, we'll want to work out how much we can get the type system to do.

hppritcha · 2022-08-18T16:14:40Z

this is probably a place to start:

https://github.com/ICLDisco/ulfm-testing/

note for Open MPI this stuff is only in the 5.0.0 pre-release and main.

Yes to your point above, we would need to do a taxonomy of possible errors returned from collective calls (including non-blocking), decide which ones the application might be able to handle (MPI_ERR_ARG maybe), and for now flame out for MPI_ERR_PROC_FAILED, etc.

I mainly brought up ULFM here because of the consensus function it introduced, not so much for all the shrinking/recovery functionality.

jtronge · 2022-08-19T18:45:14Z

After talking with Howard about this, we think there's a couple different ways that this could be solved. I agree, like you said, there's probably some way that we can utilize the type system here to avoid deadlock.

First of all when a process fails for some reason or another, most current MPI runners are simply going to kill everything. Howard mentioned that you might be able to add some parameters to the runner to avoid this, but this is not the default. So in the case where everything is killed, the errors that will be returned are either going to be argument-related errors or some kind of internal error that you probably wouldn't be able to recover from.

If we were only looking at catching these types of errors, then we might be able to use some MPI-only calls to avoid a deadlock. For instance, after a collective operation, we could do an allreduce on either the errorcode or a flag indicating failure. If a failure did occur, then we would need some way of determining what error occurred and on what rank. Of course there could have been multiple errors that occurred on different ranks, and at that point it becomes difficult to determine which error should actually be returned to the user.

On the other hand, if we wanted to also support recovery from process-failure, then we would probably need to use the ULFM extensions. Being an extension, RSMPI would probably need a new user-specified feature or have to do something in a build.rs that checks for ULFM support. Otherwise this would break compatiblity with older MPI implementations and those that don't support ULFM yet.

I'm going to experiment with some of these methods and see what might work. Whether or not we attempt to implement this type of error handling, I think it's a good idea to have most calls return Result.

jedbrown added enhancement soundness labels Aug 17, 2022

jtronge mentioned this issue Sep 9, 2022

Type Mismatch Safety #133

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error handling #130

Error handling #130

jedbrown commented Aug 17, 2022

hppritcha commented Aug 18, 2022 •

edited by jedbrown

mhoemmen commented Aug 18, 2022

jedbrown commented Aug 18, 2022

hppritcha commented Aug 18, 2022

jtronge commented Aug 19, 2022

Error handling #130

Error handling #130

Comments

jedbrown commented Aug 17, 2022

hppritcha commented Aug 18, 2022 • edited by jedbrown

mhoemmen commented Aug 18, 2022

jedbrown commented Aug 18, 2022

hppritcha commented Aug 18, 2022

jtronge commented Aug 19, 2022

hppritcha commented Aug 18, 2022 •

edited by jedbrown