Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libfabric + intel MPI over fi_mlx with multiple IB cards on 4OAM PVC #10010

Open
paboyle opened this issue Apr 29, 2024 · 1 comment
Open

libfabric + intel MPI over fi_mlx with multiple IB cards on 4OAM PVC #10010

paboyle opened this issue Apr 29, 2024 · 1 comment

Comments

@paboyle
Copy link

paboyle commented Apr 29, 2024

I'm running on a cluster (Dawn@Cambridge) with 4OAM PVC nodes, and 4x mlx5 cards, appearing as mlx5_0 ... mlx5_3
on ibstat.

Intel MPI runs with performance that is only commensurate with using one of the mlx5 HDR 200 cards.
(200Gbit/s x send + receive) = 50GB/s bidrectional.

I expect nearer 200GB/s bidirectional out the node when running 8 MPI tasks per node.

Setting I_MPI_DEBUG=5, it displays the provider info.

MPI startup(): libfabric version: 1.18.1-impi
MPI startup(): libfabric provider:  mlx

This works and is "slow".

I understand that provider fi_mlx uses UCX underneath.
To get multirail working (one rail per MPI rank), I tried running through a wrapper
scripe that uses $SLURM_LOCALID to set $UCX_NET_DEVICES:

#!/bin/bash
mellanox_cards=(0 1 2 3)
mellanox=mlx5_${mellanox_cards[$SLURM_LOCALID]}
export UCX_NET_DEVICES=$mellanox
$*

But this results in:

select.c:627 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable

Any ideas about how to get using multiple adaptors under Intel MPI and the fi_mlx transport?
Is this the right direction?
Is there something I should do different?

@j-xiong
Copy link
Contributor

j-xiong commented May 9, 2024

At first, the IB port number needs to be included in the net device specification, e.g., UCX_NET_DEVICES=mlx5_0:1.

Secondly, this may or may not make difference because by default UCX auto selects the device.

Thirdly, UCX supports multi-rail (2 rails by default), you can try UCX_MAX_RNDV_RAILS=4 to see if that makes any difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants