New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the acoll component #12484
base: main
Are you sure you want to change the base?
Add the acoll component #12484
Conversation
How this compares with #10470 ? |
We can discuss it at the meeting. Part of the goal of filing the pr was to give people the ability to have a look at it ahead of the meeting if they want/can. |
ompi/mca/coll/acoll/Makefile.am
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have plan to add alltoall(v) to acoll?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are planning to add alltoall to acoll next.
* chosen, further decides if [ring|lin] allgather is to be used. | ||
* | ||
*/ | ||
static inline void coll_allgather_decision_fixed(int size, size_t total_dsize, int sg_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you shed some lights on how to choose which methods for other intel/amd achitectures? you might also want some utility to let the user to adjust the decisions according for other systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our testing has been mostly focused on Zen architectures, we will soon test on other architectures. We do not have the utility/config option to override decisions, we will plan to add it.
@@ -0,0 +1,15 @@ | |||
Copyright (c) 2023-2024 Advanced Micro Devices, Inc. All rights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you thought about what needs to be done to extend this for multiple nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the APIs (like bcast, barrier, allgather) support multi-node case. However, it is not extensively tested for multi-node, we will test them and extend other APIs also to multi-node.
/* | ||
* rd_allgather_sub | ||
* | ||
* Function: Uses recursive doubling based allgather for the group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you compared the performance of other methods, besides recursive doubling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, acoll/allgather chooses among recursive doubling, ring and linear based on process count and message sizes.
} | ||
|
||
/* This barrier is needed to prevent random hangs */ | ||
err = ompi_coll_base_barrier_intra_tree(comm, module); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the barrier is needed here? This barrier will also add cost to small message allgather.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is removed now.
if (sbuf != MPI_IN_PLACE) | ||
memcpy(tmp_rbuf, sbuf, my_count_size * dsize); | ||
} else { | ||
ompi_3buff_op_reduce(op, (char *) data->xpmem_saddr[0] + chunk * rank * dsize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the 3 operator reduce function to maintain the order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was a bit faster than copying the chunks first and then reducing later in the following "for" loop.
Please rebase to current main to get rid of the |
I tested the PR in AWS CI. I'm seeing assertion errors with
You can try |
@amd-nithyavs could you rebase this PR to see if that clears up the mpi4py CI failure? |
@hppritcha we did have issues after rebase. Have fixed the issues, will update the PR soon. Thanks. |
The updated PR (yet to be pushed) will fix this issue. Thanks. |
The issue is fixed in the updated PR. |
We have updated the PR, it passes the mpi4py tests. |
Running AWS CI |
@amd-nithyavs I noticed that the PR is currently split into 3 commits. Please squash them before merging. |
Passed AWS CI. Note that we don't test with xpmem. |
Hello! The Git Commit Checker CI bot found a few problems with this PR: 2f7c5e2: Merge latest of local ompiv5
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
@wenduwan We have rebased to the latest and squashed the commits. |
acoll is a collective component optimized for AMD "Zen"-based processors. It supports Bcast, Allreduce, Reduce, Barrier, Gather and Allgather APIs. Signed-off-by: Nithya V S <Nithya.VS@amd.com>
This PR introduces "acoll", a high-performant collective component that is optimized for communications within a single node of AMD EPYC CPUs. It mainly uses subcommunicators based on l3cache or numa to reduce cross-cache or cross-numa accesses. The supported collectives include Bcast, Allreduce, Gather, Reduce, Barrier, Allgather.
OSU micro-benchmarks were run on 2-socket AMD EPYC 9654 96-Core Processor with 4 NUMA domains per socket, with a total of 192 cores per node, on top of commit bb7ecde.
Average percentage latency reduction over "tuned" across 32, 64, 96, 128, 192 ranks over message sizes of 8 bytes to 8 MB (varied in powers of 2):
Sample graphs:
Allreduce
Bcast
Gather