Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-blocking point-to-point performance issue (MPI_Isend, MPI_Irecv) #12414

Open
LukasBreitwieser opened this issue Mar 18, 2024 · 3 comments
Open

Comments

@LukasBreitwieser
Copy link

Hi,

I am working on a distributed 3D agent-based simulation tool and noticed that sometimes non-blocking point-to-point operations take a very long time to complete.

I created a simplified reproducer that demonstrates the behavior (reproducer.cc). Each rank exchanges messages with its neighbors using MPI_Isend and MPI_Irecv calls. In the reproducer, the number of neighbors and the message size is the same for each rank. I measure the maximum time needed to send/receive all messages across all ranks. I performed the benchmarks on 2-node systems with Ethernet and Infiniband links. Half the neighbors (13) are on the same node and half on the opposite (13).

Benchmark systems

System A:
2-nodes each with 4 Intel E7-8890 v3 processors (72 physical cores per node), 1 Gbps Ethernet link, Alma Linux 9, OpenMPI 5.0.2 installed with (spack install openmpi@5.0.2 schedulers=slurm +legacylaunchers)

System B (Dutch National Supercomputer Snellius):
2-nodes each with 2 AMD Genoa 9654 processors (192 physical cores per node), Infiniband ConnectX-7 (200Gbps within a rack, 100Gbps outside the rack), RHEL 8.6, OpenMPI 4.1.5 (provided by the HPC center)
https://servicedesk.surf.nl/wiki/display/WIKI/Snellius+hardware

Nodes were used exclusively.

Compilation

mpicxx -Wall -std=c++17 -O3 -g reproducer.cc -o reproducer

Results

Sometimes, I observe very different results for the exact same message size.
The total amount of data transferred between the two nodes can be calculated with num_ranks * num_neighbors / 2 * message_size.
e.g. system A with 711 byte messages: 144 * 13 * 711 bytes = 1330992 bytes = 1.27 MiB.
Given the small amount of data, I did not expect to see transfer times of 216ms.

see systemA_from-512_to-1024_step-+1.log

max runtime 215 ms       711 byte messages, it 0
max runtime 216 ms       711 byte messages, it 1
max runtime 216 ms       711 byte messages, it 2
max runtime 216 ms       711 byte messages, it 3
max runtime 216 ms       711 byte messages, it 4
max runtime 216 ms       711 byte messages, it 5
max runtime 7 ms         711 byte messages, it 6
max runtime 7 ms         711 byte messages, it 7
max runtime 7 ms         711 byte messages, it 8
max runtime 7 ms         711 byte messages, it 9

systemB_from-1_to-2097152_step-x2.log
Total amount of data transferred: 384 * 13 * 65536 = 327155712 bytes = 312 MiB
I did not expect such a high variance nor transfer times >50ms given a line rate of 200 Gbps.

max runtime 1424 ms      65536 byte messages, it 0
max runtime 214 ms       65536 byte messages, it 1
max runtime 84 ms        65536 byte messages, it 2
max runtime 149 ms       65536 byte messages, it 3
max runtime 132 ms       65536 byte messages, it 4
max runtime 108 ms       65536 byte messages, it 5
max runtime 91 ms        65536 byte messages, it 6
max runtime 163 ms       65536 byte messages, it 7
max runtime 167 ms       65536 byte messages, it 8
max runtime 51 ms        65536 byte messages, it 9

Sometimes, a specific message size is fast 10x in a row.
see systemA_from-512_to-1024_step-+1.log

max runtime 215 ms       775 byte messages, it 0
max runtime 219 ms       775 byte messages, it 1
max runtime 213 ms       775 byte messages, it 2
max runtime 216 ms       775 byte messages, it 3
max runtime 216 ms       775 byte messages, it 4
max runtime 216 ms       775 byte messages, it 5
max runtime 216 ms       775 byte messages, it 6
max runtime 219 ms       775 byte messages, it 7
max runtime 213 ms       775 byte messages, it 8
max runtime 216 ms       775 byte messages, it 9

max runtime 8 ms         776 byte messages, it 0
max runtime 7 ms         776 byte messages, it 1
max runtime 8 ms         776 byte messages, it 2
max runtime 7 ms         776 byte messages, it 3
max runtime 8 ms         776 byte messages, it 4
max runtime 7 ms         776 byte messages, it 5
max runtime 8 ms         776 byte messages, it 6
max runtime 7 ms         776 byte messages, it 7
max runtime 8 ms         776 byte messages, it 8
max runtime 7 ms         776 byte messages, it 9

max runtime 211 ms       777 byte messages, it 0
max runtime 216 ms       777 byte messages, it 1
max runtime 216 ms       777 byte messages, it 2
max runtime 216 ms       777 byte messages, it 3
max runtime 216 ms       777 byte messages, it 4
max runtime 216 ms       777 byte messages, it 5
max runtime 216 ms       777 byte messages, it 6
max runtime 216 ms       777 byte messages, it 7
max runtime 219 ms       777 byte messages, it 8
max runtime 216 ms       777 byte messages, it 9

I used Intel traceanalyzer to get more insights. Below the histogram and CDF of the message arrival times on System A.

histogram
cdf

Do you have an idea of what could be causing this behavior and subsequently how to improve the performance?
I tried to keep the description short, but I am happy to provide any further information that you may need.

Many thanks in advance for your help!

Best,
Lukas

reproducer.log (GH disallows attachments with cc ending -> rename reproducer.log to reproducer.cc)
systemB_from-1_to-2097152_step-x2.log
systemA_from-512_to-1024_step-+1.log
systemA_from-1_to-8192_step-x2.log

@devreal
Copy link
Contributor

devreal commented Mar 18, 2024

I don't have a good explanation for what is going on here but a few ideas on how to narrow down the problem:

  1. Can you observe the problem if all ranks run on a single node?
  2. Can you observe the problem if all processes run on distinct nodes?
  3. Do you see similar behavior with one of the standard benchmarks (like OSU p2p)?

@LukasBreitwieser
Copy link
Author

Many thanks for your reply Joseph!

  1. Can you observe the problem if all ranks run on a single node?

On one node, the results are fine: systemA-1node-72ranks.log

  1. Can you observe the problem if all processes run on distinct nodes?

I modified the reproducer such that there are only 2 ranks that transfer multiple messages with each other and could not reproduce the issue (see reproducer-2-ranks-multiple-messages.cc.log).
The transferred data volume and the number of messages sent between the two nodes are the same. Only the messages that were sent to ranks on the same node are absent in this experiment. Here are the results: systemA-2ranks-multiple-messages.log

  1. Do you see similar behavior with one of the standard benchmarks (like OSU p2p)?

As far as I can tell, the pt2pt OSU benchmark results look ok. I attach the results for runs with different numbers of ranks and nodes (osu-pt2pt-results.zip).
Note: Only the osu_multi_lat and osu_mbw_mr benchmarks support more than one rank. MPI_Isend and MPI_Irecv are used only in osu_mbw_mr, osu_bibw, and osu_bw.

It seems like the issue is related to the specific communication pattern.

I also ran the original reproducer through hpctoolkit (System A, 2 nodes, 144 ranks). The blue bars show the MPI_Waits.
It shows that waits are mostly on node 2. I still need an explanation for this assymetry.

hpctoolkit-thread0s

I would also like to see profiling information for MPI itself.
However, hpctoolkit only shows one more thread besides the application, and that seems to be in epoll_wait the entire time.

hpctoolkit-thread1s

I assume that there must be an MPI thread that handles the non-blocking transfers, which is missing in the HPCtoolkit analysis.
Do you have a recommendation on how to profile MPI internals best and view them on a timeline?

@lrbison
Copy link
Contributor

lrbison commented Mar 26, 2024

I was benchmarking some related isend data recently. I wonder if you run these what do you see?

https://github.com/lrbison/lrbison_benchmarks/tree/main

In particular documentation is pretty light, but take a look at the output and adjust make_data.sbatch as needed.

To see what you describe where sometimes results are fast, othertimes they are slow, you probably want to get results from at least these two runs:

# isend, irecv for gather pattern (incast), showing time for each message, and executing 100 trials
mpirun -- ./bench_isend -i -n 1 -t 100 -w 100 -f filename_ii.dat -m $MSG_SIZE -s -r
# isend, irecv for scatter pattern (outcast), showing time for each message, and executing 100 trials
mpirun -- ./bench_isend -o -n 1 -t 100 -w 100 -f filename_ii.dat -m $MSG_SIZE -s -r

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants