Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix error handling for generalized requests #12392

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

dalcinl
Copy link
Contributor

@dalcinl dalcinl commented Mar 5, 2024

This is a followup of #11683 from @bosilca addressing outstanding issues related to error handling with generalized requests. Some of these issues where reported here #11681 (comment). I did my best to produce separate commits setting the stage for the fixes specific to generalized requests. I believe this PR is easier to review if you go commit by commit.

Throughout testing is available in this mpi4py branch to be merged once this PR lands in v5.0.x.
A test run of that mpi4py branch against this PR was submitted here.

@dalcinl
Copy link
Contributor Author

dalcinl commented Mar 6, 2024

@jsquyres Given that you reviewed, approved, and merged #11683, may I ask for your review to this one? I will also need help with putting it into v5.0.x, I'm not really sure how that work (just a cherry-pick from commits in main?).

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in responding; I had to go refresh myself on all these details.

This is unfortunately harping on some grey areas in the MPI standard. ☹️ I think there's 3 areas of question:

  1. Why does Open MPI require 2 calls to actually free the resources? MPICH apparently does not.
    • I think that this is an area of ambiguity in the standard. Which of Open MPI or MPICH is right? I'm not sure.
    • For better or worse, Open MPI's current intended usage is like this:
    // Or the equivalent in sessions
    MPI_Comm_set_errhandler(MPI_COMM_SELF, MPI_ERRORS_RETURN);
    
    int err = MPI_Wait(&req);
    if (MPI_SUCCESS != err && req_is_generalized_request) {
        // Try again, because we might be in the case 
        // where the user-defined free function failed
        err = MPI_Request_free(&req);
    }
    if (MPI_SUCCESS != err) {
        // handle error
    }
  2. In the above code sample, why return an error from MPI_REQUEST_FREE if there was no error in freeing the internal MPI data structures?
    • Yeah, you're right: that's not intended. If MPI_REQUEST_FREE is able to free everything successfully, it should return MPI_SUCCESS. It shouldn't be affected by the previous error returned by the user's free function.
  3. In the above code sample, why doesn't MPI_REQUEST_FREE set the request to MPI_REQUEST_NULL?
    • Yeah, you're right: that's not intended. If MPI_REQUEST_FREE is able to free everything successfully, it should set the INOUT request parameter to MPI_REQUEST_NULL.

I.e., a fix should probably ensure that when the user's free function returns an error:

  • We set the user's free function pointer to NULL
  • We return that error back up out the initial MPI function that was invoked (e.g., MPI_WAIT, in the sample code above). Depending on the first MPI function that was invoked, this may mean returning the value from the function or setting the value on the appropriate status.MPI_ERROR (if not MPI_STATUS_IGNORE).
  • When MPI_REQUEST_FREE is invoked, free the rest of the internal OMPI data structures. Assuming that happens properly, return MPI_SUCCESS. This might be a little complicated if we set the initial error -- from the user's free function -- on status.MPI_ERROR. I think in this case, we should reset status.MPI_ERROR back to MPI_SUCCESS. Or, at least, the non-SUCCESS value on status.MPI_ERROR shouldn't cause us to go into error handling code in the internals of MPI_REQUEST_FREE.

Finally, I'll note that the intended behavior didn't make it into documentation anywhere. Sigh. We definitely need to add this to the docs somewhere -- the individual TEST/WAIT man pages are probably the right place to document this behavior.

@@ -532,7 +532,7 @@ static inline int ompi_request_complete(ompi_request_t* request, bool with_signa

ompi_wait_sync_t *tmp_sync = (ompi_wait_sync_t *) OPAL_ATOMIC_SWAP_PTR(&request->req_complete,
REQUEST_COMPLETED);
if( REQUEST_PENDING != tmp_sync ) {
if (REQUEST_PENDING != tmp_sync && REQUEST_COMPLETED != tmp_sync) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that a core premise of ompi_request_complete() is that it should only be invoked once for a given request. It wasn't intended to be safe to be called multiple times -- e.g., the first time, it actually completes the request, and subsequent times it effectively does nothing.

Copy link
Contributor Author

@dalcinl dalcinl Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ompi_request_complete() called a second time should either fail cleanly returning error, or just happily continue as a non-op because the request already completed, but it should NEVER segfault as would happen currently if the user call MPI_Request_complete() two times. The MPI standard does not say whether the routine can be called multiple times on the same request handle. With this fix, I'm just playing safe and allowing multiple calls to MPI_Request_complete(). For reference, IIRC, MPICH errors not the second but the third time, this is something I was planning to report and/or eventually fix.

IMHO, it is probably easier to just allow multiple MPI_Request_complete() calls on a request, just like I'm doing here. If you do not like such semantics and prefer to error, I believe the fix is trivial.

@@ -57,6 +57,10 @@ int MPI_Request_free(MPI_Request *request)
}

rc = ompi_request_free(request);
if (OPAL_UNLIKELY(OMPI_SUCCESS != rc && MPI_REQUEST_NULL != *request)) {
(*request)->req_status.MPI_ERROR = rc;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did ompi_request_free(request) already set this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the generic user-level MPI_Request_free() for all kinds of requests, not only generalized requests. I'm not so expert with the Open MPI implementation to 100% sure ompi_request_free() will set the status error field on errors. Without that guarantee, my use of ompi_errhandler_request_invoke() may not work as I intended. Note that ompi_errhandler_request_invoke() has the side-effect of cleaning up the request and deallocating the object. Sorry, it's a bit complicated, but I had to work within the constraints of what I was given.

For example, look at ompi_comm_request_free()... I don't see that routine setting the status error field. Therefore, I think being overzealous here is justified. IMHO, this line is a good one and should stay. Otherwise, all internal request types used in Open MPI (and future ones that may be implemented) should be audited and eventually fixed to make sure a failing free sets the error field.

@dalcinl
Copy link
Contributor Author

dalcinl commented Mar 13, 2024

@jsquyres I think I need to clarify what this PR is exactly doing, which IMHO is an improvement over previous behavior.

When MPI_Wait/MPI_Test fail on a generalized (because of free_fn), now the following happen:

  1. The MPI_Wait/MPI_Test/MPI_Request_free call return an error code.
  2. All the internal resources to the request object are release, no memory leaks.
  3. The wait/test/request_free call sets the request to MPI_REQUEST_NULL.

Therefore, from (3), there is no need at all to call MPI_Request_free() again to dispose the failed request. All disposals have already occured internally, and the request handle has been set to NULL back to the user. What's the point of requiring users to deallocate a failed request, if OMPI can do it properly after the changes in this PR?
From the point of view of a user, all what she should care about is that a request failed, and that error has to be handled somehow, but release of resources are no longer a concern.

You can play with the following simple reproducer. I used a much convoluted example (using multiple calls to MPI_Request_complete, calling MPI_Request_get_status, etc) to make sure everything worked as expected. I ran all that against MPICH to double-check the behavior is identical (modulo that MPICH issue with multiple complete calls I mentioned before).

#include <assert.h>
#include <stdio.h>
#include <mpi.h>

static int query_fn  (void *ctx, MPI_Status *s) { return MPI_SUCCESS;   }
static int free_fn   (void *ctx)                { return MPI_ERR_OTHER; }
static int cancel_fn (void *ctx, int c)         { return MPI_SUCCESS;   }

int main(int argc, char *argv[])
{
  int ierr;
  MPI_Request request = MPI_REQUEST_NULL;
  MPI_Init(&argc, &argv);

  MPI_Comm_set_errhandler(MPI_COMM_SELF, MPI_ERRORS_RETURN);
  MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
  MPI_Grequest_start(query_fn, free_fn, cancel_fn, NULL, &request);

  ierr = MPI_Grequest_complete(request);
  assert (MPI_SUCCESS == ierr);

  //int flag = 0;
  //ierr = MPI_Test(&request, &flag, MPI_STATUS_IGNORE);
  ierr = MPI_Wait(&request, MPI_STATUS_IGNORE);
  //ierr = MPI_Request_free(&request);
  assert (MPI_SUCCESS != ierr); // check that the call above actually failed
  assert (MPI_REQUEST_NULL == request); // check that the request handle was deallocated.
  
  MPI_Finalize();
  return 0;
}

@dalcinl
Copy link
Contributor Author

dalcinl commented Mar 20, 2024

Gentle requests to keep this PR moving forward.

FYI, mpi4py has very good testing of error handling for generalized requests, albeit disabled waiting for the fixes from this PR. See the following link:
https://github.com/mpi4py/mpi4py/blob/master/test/test_grequest.py#L49
I have a commit waiting for this PR to be merge, adding even more testing
mpi4py/mpi4py@88f6064
BTW, in that test code code, lines with

if greq:
    greq.Free()

are no longer required, as the previous greq.Wait() call already freed the request (while properly reporting the MPI error that mpi4py converts to a Python exception).

@dalcinl
Copy link
Contributor Author

dalcinl commented Apr 3, 2024

@jsquyres Gentle reminder to move this PR forward and update #12253.

A second call to ompi_request_complete() segfaults. Make it mostly an
non-op. Alternatively, the second invocation could generate an error, but
let's play safe for now.

Signed-off-by: Lisandro Dalcin <dalcinl@gmail.com>
If ompi_request_free() fails, use the ompi_errhandler_request_invoke()
machinery used in the wait/test routines. This will have the side-effect
of attempting to free the request a second time.

Signed-off-by: Lisandro Dalcin <dalcinl@gmail.com>
Signed-off-by: Lisandro Dalcin <dalcinl@gmail.com>
Signed-off-by: Lisandro Dalcin <dalcinl@gmail.com>
@dalcinl
Copy link
Contributor Author

dalcinl commented May 8, 2024

@jsquyres @wenduwan One month after, a new reminder to get this PR moving forward and eventually backport everything to v5.0.x. If Jeff is not able to do the review, maybe you guys can assign an alternate reviewer?

@wenduwan
Copy link
Contributor

wenduwan commented May 8, 2024

Apologies for the delay.

This change appears to touch common critical path. I prefer to get more experienced eyes on this.

@devreal @bosilca I noticed that you were engaged in #12525. I'm not sure if it's related but could you kindly review this change too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants