Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detail MPI_ERR_PROC_FAILED cause #7

Open
abouteiller opened this issue Dec 11, 2015 · 2 comments
Open

detail MPI_ERR_PROC_FAILED cause #7

abouteiller opened this issue Dec 11, 2015 · 2 comments

Comments

@abouteiller
Copy link

why

Feedback received from SC'15 ULFM BoF

Motivation

Error MPI_ERR_PROC_FAILED does not details what caused the failure

Implication when debugging an FT app

  • in an FT application, programmatic errors are absorbed and “recovered from” semi-silently
  • For debugging of the FT path of the app: we need to tolerate only purposefully injected failures

Possible solution: mpirun parameters (non-standard) to launch in debug mode (nothing to standardize )

Implication In production

  • If program fails due to hardware failure, we always want to continue.
  • If the program fails due to software error of some sort (app. Or system software) the situation is not as clear cut.
    • some users may want to continue, depending on the type/severity of the failure, the current progress, trigger a dataset verification (for possibly propagated soft errors starting from the failed rank), etc.
    • in other conditions, or other users, they want to stop (code is bad, needs to be debugged).

Possible solution: Standardize (or recommend?) specific error codes for the MPI_ERR_PROC_FAILED*/REVOKE classes

  • Possible error codes for all types of signal, different type of hw failures, etc
  • Reasons to oppose:
    • So far all error codes that are not also a class are implementation specific
    • some of these new code would be specific to some architecture, not necessarily generic (my machine does not generate SIGSEGV because it doesn't have virtual memory)
@abouteiller
Copy link
Author

Issue was discussed during the wg meeting in dec. 2015 and the reception was mild.

At this point the general feeling is to not standardize this. We will proceed independently in providing this feature to users in the Open MPI ULFM implementation, and see if this becomes a widely used feature or if nobody cares.

@abouteiller
Copy link
Author

Also, if somebody can think of a different way of achieving the same purpose, please comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant