New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-blocking point-to-point performance issue (MPI_Isend, MPI_Irecv) #12414
Comments
I don't have a good explanation for what is going on here but a few ideas on how to narrow down the problem:
|
Many thanks for your reply Joseph!
On one node, the results are fine: systemA-1node-72ranks.log
I modified the reproducer such that there are only 2 ranks that transfer multiple messages with each other and could not reproduce the issue (see reproducer-2-ranks-multiple-messages.cc.log).
As far as I can tell, the pt2pt OSU benchmark results look ok. I attach the results for runs with different numbers of ranks and nodes (osu-pt2pt-results.zip). It seems like the issue is related to the specific communication pattern. I also ran the original reproducer through hpctoolkit (System A, 2 nodes, 144 ranks). The blue bars show the MPI_Waits. I would also like to see profiling information for MPI itself. I assume that there must be an MPI thread that handles the non-blocking transfers, which is missing in the HPCtoolkit analysis. |
I was benchmarking some related isend data recently. I wonder if you run these what do you see? https://github.com/lrbison/lrbison_benchmarks/tree/main In particular documentation is pretty light, but take a look at the output and adjust make_data.sbatch as needed. To see what you describe where sometimes results are fast, othertimes they are slow, you probably want to get results from at least these two runs:
|
Hi,
I am working on a distributed 3D agent-based simulation tool and noticed that sometimes non-blocking point-to-point operations take a very long time to complete.
I created a simplified reproducer that demonstrates the behavior (
reproducer.cc
). Each rank exchanges messages with its neighbors usingMPI_Isend
andMPI_Irecv
calls. In the reproducer, the number of neighbors and the message size is the same for each rank. I measure the maximum time needed to send/receive all messages across all ranks. I performed the benchmarks on 2-node systems with Ethernet and Infiniband links. Half the neighbors (13) are on the same node and half on the opposite (13).Benchmark systems
System A:
2-nodes each with 4 Intel E7-8890 v3 processors (72 physical cores per node), 1 Gbps Ethernet link, Alma Linux 9, OpenMPI 5.0.2 installed with (
spack install openmpi@5.0.2 schedulers=slurm +legacylaunchers
)System B (Dutch National Supercomputer Snellius):
2-nodes each with 2 AMD Genoa 9654 processors (192 physical cores per node), Infiniband ConnectX-7 (200Gbps within a rack, 100Gbps outside the rack), RHEL 8.6, OpenMPI 4.1.5 (provided by the HPC center)
https://servicedesk.surf.nl/wiki/display/WIKI/Snellius+hardware
Nodes were used exclusively.
Compilation
Results
Sometimes, I observe very different results for the exact same message size.
The total amount of data transferred between the two nodes can be calculated with
num_ranks * num_neighbors / 2 * message_size
.e.g. system A with 711 byte messages: 144 * 13 * 711 bytes = 1330992 bytes = 1.27 MiB.
Given the small amount of data, I did not expect to see transfer times of 216ms.
see
systemA_from-512_to-1024_step-+1.log
systemB_from-1_to-2097152_step-x2.log
Total amount of data transferred: 384 * 13 * 65536 = 327155712 bytes = 312 MiB
I did not expect such a high variance nor transfer times >50ms given a line rate of 200 Gbps.
Sometimes, a specific message size is fast 10x in a row.
see
systemA_from-512_to-1024_step-+1.log
I used Intel traceanalyzer to get more insights. Below the histogram and CDF of the message arrival times on System A.
Do you have an idea of what could be causing this behavior and subsequently how to improve the performance?
I tried to keep the description short, but I am happy to provide any further information that you may need.
Many thanks in advance for your help!
Best,
Lukas
reproducer.log (GH disallows attachments with cc ending -> rename
reproducer.log
toreproducer.cc
)systemB_from-1_to-2097152_step-x2.log
systemA_from-512_to-1024_step-+1.log
systemA_from-1_to-8192_step-x2.log
The text was updated successfully, but these errors were encountered: