Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Failed to modify UD QP to INIT on mlx5_0: Operation not permitted" error with OpenMPI after updating to OFED 23.10 #20233

Open
boegel opened this issue Mar 27, 2024 · 5 comments
Milestone

Comments

@boegel
Copy link
Member

boegel commented Mar 27, 2024

After updating to OFED 23.10, we had several reports of people running this cryptic error:

Failed to modify UD QP to INIT on mlx5_0: Operation not permitted

The problem does not occur consistently (@hajgato has more details here).

openucx/ucx#9468 suggests that updating to a more recent OFED may help, but it doesn't.

@boegel boegel added this to the 4.x milestone Mar 27, 2024
@boegel
Copy link
Member Author

boegel commented Mar 27, 2024

We've pinned this down to the libfabric dependency we include with OpenMPI: when libfabric is removed as a dependency, the problem no longer occurs.

This requires rebuilding OpenMPI, so that's painful on a production system.

As a workaround, you can instruct OpenMPI to not use libfabric by passing the following options to mpirun:

mpirun -mca pml ucx -mca btl '^uct,ofi' -mca mtl '^ofi'

Or equivalently, you can set the following environment variables:

export OMPI_MCA_btl='^uct,ofi'
export OMPI_MCA_pml='ucx'
export OMPI_MCA_mtl='^ofi'

@boegel boegel changed the title "Failed to modify UD QP to INIT on mlx5_0: Operation not permitted" error with OpenMPI after updating to OFED 23.10 "Failed to modify UD QP to INIT on mlx5_0: Operation not permitted" error with OpenMPI after updating to OFED 23.10 Mar 27, 2024
@boegel
Copy link
Member Author

boegel commented Mar 28, 2024

SURF (@casparvl) also saw a very similar issue, they worked around it by setting $FI_PROVIDER to verbs...

The error there was Invalid argument though, more similar to what was reported in openucx/ucx#9468

@branfosj
Copy link
Member

I've not seen this, but we added the following in December 2021 because of another issue we'd seen

            # avoid libfabric warning "unknown link width 0x10"
            # see https://github.com/ComputeCanada/software-stack-config/pull/19
            setenv OMPI_MCA_mtl "^ofi"
            setenv OMPI_MCA_btl "^openib,ofi"

@casparvl
Copy link
Contributor

casparvl commented Mar 28, 2024

Yep, we get these:

avoid libfabric warning "unknown link width 0x10"

As well regularly. We also solved it by setting those envrionment variables (I believe, would need to check what we set exactly)

@casparvl
Copy link
Contributor

More detail on what I hit here btw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants