New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Failed to modify UD QP to INIT on mlx5_0: Operation not permitted
" error with OpenMPI after updating to OFED 23.10
#20233
Comments
We've pinned this down to the libfabric dependency we include with OpenMPI: when libfabric is removed as a dependency, the problem no longer occurs. This requires rebuilding OpenMPI, so that's painful on a production system. As a workaround, you can instruct OpenMPI to not use libfabric by passing the following options to
Or equivalently, you can set the following environment variables:
|
Failed to modify UD QP to INIT on mlx5_0: Operation not permitted
" error with OpenMPI after updating to OFED 23.10
SURF (@casparvl) also saw a very similar issue, they worked around it by setting The error there was |
I've not seen this, but we added the following in December 2021 because of another issue we'd seen
|
Yep, we get these:
As well regularly. We also solved it by setting those envrionment variables (I believe, would need to check what we set exactly) |
More detail on what I hit here btw |
After updating to OFED 23.10, we had several reports of people running this cryptic error:
The problem does not occur consistently (@hajgato has more details here).
openucx/ucx#9468 suggests that updating to a more recent OFED may help, but it doesn't.
The text was updated successfully, but these errors were encountered: