prov/cxi performance regression in fi_pingpong #9802

philippfriese · 2024-02-09T10:35:02Z

Describe the bug
Using the upstreamed CXI provider (as of commit fc869ae main branch) yields reduced throughput in fi_pingpong (14GB/s for ofiwg/libfabric compared to 20GB/s for HPE-internal libfabric).

To Reproduce
Steps to reproduce the behavior:

Launch fi_pingpong -p cxi -e rdm on two Slingshot-connected nodes.
Observe performance deviation between ofiwg/libfabric and HPE-internal libfabric

Expected behavior
Equivalent performance between both libfabric-variants (~20GB/s).

Output
Deviating performance:

~14GB/s for ofiwg/libfabric
~20GB/s for hpe/libfabric

It is worth noting that the observed throughput of ofiwg/libfabric can be increased by setting the number of iterations from the default 10 to 100 via -I 100.
Additionally, using osu_bw and osu_latency from the OSU Microbenchmark Suite, no performance differences are observed between the two libfabric variants.

I've attached raw output of the fi_pingpong runs and osu_bw/osu_latency runs.

Environment:

ofiwg/libfabric at commit fc869ae main branch
ofiwg/libfabric configure setup: ./configure LDFLAGS=-Wl,--build-id --enable-cxi=yes --enable-only --enable-restricted-dl --enable-tcp --enable-udp --enable-rxm --enable-rxd --enable-hook_debug --enable-hook_hmem --enable-dmabuf_peer_mem --enable-verbs --enable-gdrcopy-dlopen --enable-profile=dl
hpe/libfabric version 1.15
OpenMPI 4.1.6 with --with-ofi=yes
OSU Microbenchmark: 7.3
openSUSE Leap 15.5, kernel 5.14.21
aarch64

Additional context
Due to a currently unresolved issue with the local Slingshot deployment on the used ARM platform, it is required to set FI_CXI_LLRING_MODE=never for both fi_pingpong and osu_bw.

The text was updated successfully, but these errors were encountered:

SSSSeb · 2024-02-09T13:41:58Z

ping @mindstorm38

mindstorm38 · 2024-02-09T15:29:41Z

I can't reproduce the regression, here's my environment:

HPE/libfabric v1.20.1
OFI/libfabric: v1.21.0a1 (your commit)
Iterations: 100
Aarch64 (also works with x86_64)
With FI_CXI_LLRING_MODE=never
Both are manually built from sources

Not yet tested with MPI

lflis · 2024-02-09T15:30:46Z

@mindstorm38
Which slingshot libraries version are you using?

mindstorm38 · 2024-02-09T15:59:25Z

I'm using the latest internal sources, I don't know the version number to be honest. I configure cxi, cassini headers and UAPI headers to directly point to the sources. Please tell me if you have any command to check a version that would be interesting for you, but note that my installation is not standard compared to official SlingShot packages, I'm working in parallel on a packages-based installation but it's on x86_64 so this will not be helpful in this case I guess (I'll try anyway).

philippfriese added the bug label Feb 9, 2024

jswaro added the prov/cxi label Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/cxi performance regression in fi_pingpong #9802

prov/cxi performance regression in fi_pingpong #9802

philippfriese commented Feb 9, 2024

SSSSeb commented Feb 9, 2024

mindstorm38 commented Feb 9, 2024 •

edited

lflis commented Feb 9, 2024

mindstorm38 commented Feb 9, 2024

prov/cxi performance regression in fi_pingpong #9802

prov/cxi performance regression in fi_pingpong #9802

Comments

philippfriese commented Feb 9, 2024

SSSSeb commented Feb 9, 2024

mindstorm38 commented Feb 9, 2024 • edited

lflis commented Feb 9, 2024

mindstorm38 commented Feb 9, 2024

mindstorm38 commented Feb 9, 2024 •

edited