Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mca/coll: Add any radix k for alltoall bruck algorithm #12453

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jiaxiyan
Copy link
Contributor

@jiaxiyan jiaxiyan commented Apr 5, 2024

This method extends ompi_coll_base_alltoall_intra_bruck to handle any radix k.

@jiaxiyan jiaxiyan marked this pull request as draft April 5, 2024 23:05
@jiaxiyan jiaxiyan marked this pull request as ready for review April 8, 2024 18:55
@jiaxiyan jiaxiyan requested a review from lrbison April 8, 2024 18:55
Copy link
Contributor

@lrbison lrbison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @jiaxiyan. A few small suggestions.

Have you checked that the default faninout will be 2 for the dynamic rules so as to not change performance?

Thanks

ompi/mca/coll/base/coll_base_alltoall.c Outdated Show resolved Hide resolved
ompi/mca/coll/base/coll_base_alltoall.c Outdated Show resolved Hide resolved
ompi/mca/coll/base/coll_base_alltoall.c Show resolved Hide resolved
@bosilca
Copy link
Member

bosilca commented Apr 29, 2024

I was planning to review it after they figure out the issue with replacing the blocking ompi_coll_base_sendrecv with non-blocking communications.

@lrbison
Copy link
Contributor

lrbison commented May 7, 2024

@bosilca I don't follow this comment:

when the upper and lower level of the communication stack both try to generate multi-rail traffic things cant go well.

Are you suggesting that collectives shouldn't be making PML isend/irecv directly?

@bosilca
Copy link
Member

bosilca commented May 8, 2024

No, I was wondering how would a multi-rail enabled collective component (which will indeed post multiple isend requests) interacts with a multi-rail enabled low-level communication library (which will split all large messages across multiple lanes) ?

@wenduwan
Copy link
Contributor

@jiaxiyan Please fix the build.

@jiaxiyan
Copy link
Contributor Author

We decide to keep the blocking sendrecv because the performance is slightly better than nonblocking. (See test data below)
We need to wait for each isend to complete before we can copy back new data from recvbuf to tmpbuf, so it does not show much improvement.

OSU MPI All-to-All Latency (On 16 hpc7g, 1024 ranks)

ompi_coll_base_sendrecv

Message size k=2 k = 4 k = 8 k = 16 k = 32
1 146.11 201.24 290.62 426.96 847.91
2 141.23 201.72 295.76 431.24 844.11
4 146.65 207.2 290.8 438.15 847.54
8 154.93 211.39 299.4 439.36 859.98
16 280.06 222.51 319.33 445.43 860.43
32 348.79 474.72 355.11 500.69 872.44
64 544.38 603.89 779.21 576.34 916.67
128 1059.73 932.32 1054.84 1323.35 968.53
256 21981.76 1822.93 1765.59 1875.43 2212.38
512 24050.4 3830.67 3719 3146.81 3036.46
1024 33873.93 8328.25 7612.07 6758.61 5348.24
2048 59072.64 18545.69 15535.72 14898.76 10101.14
4096 80879.25 37798.9 33115.21 27867.95 19722.63
8192 137532.69 80270.04 63793.19 61073.62 38873.77

isend irecv

Message size k=2 k = 4 k = 8 k = 16 k = 32
1 138.96 204.74 291.22 428.89 847.15
2 141.24 205.62 290.97 430.1 848.62
4 145 204.63 295.42 438.76 851.13
8 156.82 211.84 298.93 444.47 865.3
16 279.29 223.89 324.6 449.07 862.43
32 350.09 479.04 357.18 502.31 880.38
64 540.79 601.15 799.61 583.52 932.97
128 1032.12 958.54 1087.27 1326.93 975.33
256 21958.61 1850.11 1753.25 1856.64 2210.36
512 23963.49 3837.31 3731 3241.24 3051.9
1024 33949.96 8348.78 7598.88 6802.96 5308.44
2048 59115.35 18520.23 15578.96 14939.42 10035.18
4096 80790.02 37814.46 33030 27908.15 19750.56
8192 137384.46 80003.8 63834.89 61007.52 38989.87

@juntangc
Copy link
Contributor

Bruck's method is designed to work in upper_comm with only the local/node leaders who can communicate directly with
any other ranks (in upper comm) and that every pair of ranks are equally distant. (details in this paper - https://ieeexplore.ieee.org/document/642949).

Bruck's method is designed for optimizing latency by sacrificing bandwidth. So you are not likely to have performance improvement for mid/large messages by using a larger k by limiting ranks_per_node to be 1. It should be used in other node-aware approach for the all-to-all communication for internode only. The procedure looks like the data on each node will be gathered and combined by the local/node leaders; internode all-to-all using Bruck's method; scatter data from local/node leaders to other ranks on the same node.

This method extends ompi_coll_base_alltoall_intra_bruck to handle
any radix k.

Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
@jiaxiyan
Copy link
Contributor Author

@bosilca @lrbison Can you give this another review? Thanks!

* [03] [13] [23] [33] [43] [53]
* [04] [14] [24] [34] [44] [54]
* [05] [15] [25] [35] [45] [55]
* After local rotation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you really need the local rotation ? I understand it makes the code easier to maintain but in has a significant and finally unnecessary cost, because at the end you are building the datatype by hand without taking advantage of it's continuity in memory.

Second remark is related to the cost of creating and committing the datatype. I'm almost certain that this cost is expensive, especially for the middle range messages where the k-ary bruck is supposed to behave best. The result is that you pay a high cost to prepare the datatype, resulting in a non contiguous datatype while leads to lower performance communications (because non-contiguous data usually lead to copy in/out protocols). If instead of building the datatype you copy the data into a contiguous buffer, you avoid the cost of the datatype construction and communicate from contiguous to contiguous memory, with better outcome. The only potential drawback is the extra local copy before the send (similar to a pack).

err = ompi_datatype_destroy(&new_ddt);
if (err != MPI_SUCCESS) { line = __LINE__; goto err_hndl; }
/* Commit the new datatype */
err = ompi_datatype_commit(&new_ddt);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the real reason you are not seeing benefits by using nonblocking communications is due to the cost of the datatype creation. During this time, the posted communications will not be able to progress, which means the bandwidth is wasted until all datatype are build and you reach the waitall part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants