Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error handling #130

Open
jedbrown opened this issue Aug 17, 2022 · 5 comments
Open

Error handling #130

jedbrown opened this issue Aug 17, 2022 · 5 comments

Comments

@jedbrown
Copy link
Contributor

Label under harebrained ideas.

MPI operations are generally fallible, but rsmpi generally ignores any such errors and proceeds as though the operation was successful. The natural way to handle fallibility would be to return Result<T, E>, but the question is what is E and how does this interact with drop. Suppose we write the following:

fn f(comm) -> Result<(), NoncollectiveError> {
  comm.barrier()?;
  comm.all_reduce_into(...)?;
  Ok(());
}

If the first line errors on one rank, it'll return from f while the other ranks deadlock in the all_reduce_into. I think we need to avoid this and ensure that collectives are only called in functions that return CollectiveError<'c> where 'c is a lifetime associated with the communicator. MPI primitives would return Result<T, NoncollectiveError> and the caller would need to be responsible for making them collective in order to safely proceed to the next line. Now I'm just spitballing, but I think there are two ways to do this:

  1. Create a result.consensus(comm)?; that promotes Result<T, NoncollectiveError> to Result<T, CollectiveError>. This sort of thing requires polling on a nonblocking collective, possibly with a timeout to panic if a rank has vanished (in which case it's hard to even determine which ranks are alive, much less return with a consistent state). If any ranks are Err(NoncollectiveError(e)) on input, then all ranks will Err(CollectiveError(e)).

  2. Variant of 1 in which NoncollectiveError "knows" which communicator it's associated with so one can use result.consensus()?;. You'd impl From<NoncollectiveError<'c1>> for NoncollectiveError<'c2>, with constraints on the relationship between 'c1 and 'c2 (these might need to be runtime checks rather than static lifetime checks).

Note that in both cases, one will probably want an execution mode to be unsafe or panic to avoid the run-time cost of the nonblocking collectives necessary for consensus.

  1. Give up on safely propagating errors and instead get panics to use MPI_Abort and set MPI_ERRORS_ARE_FATAL.

@hppritcha @jtronge I think this is a summary of my half-baked ideas for handling errors. I think with objects like an MPI_Win that must be dropped collectively, you'd want any scope that can have ownership of the Win to return a result E that cannot be converted to non-collective using from or other functions. That way you could only call in a function returning suitable collective errors. I haven't thought deliberately about whether lifetimes can be made to enforce all the constraints statically.

@hppritcha
Copy link

hppritcha commented Aug 18, 2022

I think we're drifting into ULFM land here a bit with these concerns. The way to make this code "work" in the presence of faults or other error returns from collective operations. Item 1 above has been the source for numerous papers and phd dissertation theses, and code (in c). Putting on my ULFM hat I'd write the code like this:

fn f(comm) -> Result<(), CollectiveError> {
  let rc = comm.barrier();
  let flag = 0;
  if (MPI_SUCCESS == rc) {
     flag = 1
  }
  comm.mpix_comm_agree(&flag);
  if (!flag) {
     /* uh oh, something happened to one or more of my partners */
    comm.MPIX_Comm_revoke;  /* this will cause other processes who maybe got success on the barrier and charging ahead to the allreduce to get a an error there, so no deadlock */
      return MPIX_ERR_PROC_FAILED;
}
rc = comm.all_reduce_into(...);
/* same check above if rc == MPI_SUCCESS) using the mpix_comm_agree consensus  */
else
if (MPIX_ERR_REVOKED == rc) {
/* we failed out because some processes died between the barrier call and this collective, so return error */
   return MPI_ERR_REVOKED;
  }
  Ok(());
}

I think the most common error that wouldn't involve process failure would be invalid parameters, which means a buggy program, in which case we could have a default rustmpi error handler just do the rusty thing of aborting the program by doing a MPI_Abort on MPI_COMM_WORLD.

Note that by default, with Open MPI 5.0.0 and later, ULFM MPI extensions are available.

@mhoemmen
Copy link

(Commenting to follow)

@jedbrown
Copy link
Contributor Author

I see those MPIX_ functions are available in MPICH-4.1a1 as well. Is there a good place to look for complete examples that can handle vanishing processes? I don't have a sense for whether that's a realistic mode for rsmpi, though if we can make it safer using the type system, maybe it'll be a place people will seek out for writing this sort of fault-tolerant MPI.

There's a really nice post by Andrew Gallant (author of ripgrep) on Rust error handling, and when unwrap (panic) is okay.
https://blog.burntsushi.net/unwrap/

It would be useful to outline what kind of errors could be raised by MPI and that a user would possibly expect to recover from (at least enough to clean up, say closing local file handles). If the answer is none, then we can stick with the current strategy, though I think we should still work with MPI_ERRORS_RETURN (raising our own panic/MPI_Abort). If there are good use cases, we'll want to work out how much we can get the type system to do.

@hppritcha
Copy link

this is probably a place to start:

https://github.com/ICLDisco/ulfm-testing/

note for Open MPI this stuff is only in the 5.0.0 pre-release and main.

Yes to your point above, we would need to do a taxonomy of possible errors returned from collective calls (including non-blocking), decide which ones the application might be able to handle (MPI_ERR_ARG maybe), and for now flame out for MPI_ERR_PROC_FAILED, etc.

I mainly brought up ULFM here because of the consensus function it introduced, not so much for all the shrinking/recovery functionality.

@jtronge
Copy link
Collaborator

jtronge commented Aug 19, 2022

After talking with Howard about this, we think there's a couple different ways that this could be solved. I agree, like you said, there's probably some way that we can utilize the type system here to avoid deadlock.

First of all when a process fails for some reason or another, most current MPI runners are simply going to kill everything. Howard mentioned that you might be able to add some parameters to the runner to avoid this, but this is not the default. So in the case where everything is killed, the errors that will be returned are either going to be argument-related errors or some kind of internal error that you probably wouldn't be able to recover from.

If we were only looking at catching these types of errors, then we might be able to use some MPI-only calls to avoid a deadlock. For instance, after a collective operation, we could do an allreduce on either the errorcode or a flag indicating failure. If a failure did occur, then we would need some way of determining what error occurred and on what rank. Of course there could have been multiple errors that occurred on different ranks, and at that point it becomes difficult to determine which error should actually be returned to the user.

On the other hand, if we wanted to also support recovery from process-failure, then we would probably need to use the ULFM extensions. Being an extension, RSMPI would probably need a new user-specified feature or have to do something in a build.rs that checks for ULFM support. Otherwise this would break compatiblity with older MPI implementations and those that don't support ULFM yet.

I'm going to experiment with some of these methods and see what might work. Whether or not we attempt to implement this type of error handling, I think it's a good idea to have most calls return Result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants