New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll: add a new bcast composition #6781
Conversation
86267ca
to
762b5a3
Compare
test:mpich/ch4/gpu |
test:mpich/ch4/gpu |
src/mpid/ch4/src/ch4_coll_impl.h
Outdated
coll_ret = | ||
MPIC_Recv(buffer, count, datatype, MPIR_Get_intranode_rank(comm, root), MPIR_BCAST_TAG, | ||
comm->node_comm, MPI_STATUS_IGNORE); | ||
MPIR_ERR_COLL_CHECKANDCONT(coll_ret, errflag, mpi_errno); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it difficult to sort out the branches. How about --
int intra_root = MPIR_Get_intranode_rank(comm, root);
if (intra_root != -1 && intra_root != 0) {
/* root send message to local leader (node_comm rank 0) */
if (comm->rank == root) {
[MPIC_Send]
} else {
[MPIC_Recv]
}
}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will make the code more clear. I copied this part of code from composition alpha, would you like me to change composition alpha as well? Actually, this composition delta is basically the same as composition alpha, it just moves the data swap before the intra-node bcast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We should clean some of the technical debt as we adding new ones. But make separate commits if you do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the code in both composition alpha and delta.
} | ||
#endif | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Group the next two blocks of code under
if (comm->node_roots_comm) {
/* bcast in node_roots_comm */
int inter_root = MPIR_Get_internode_rank(comm, root);
int my_rank = comm->node_roots_comm->rank;
if (my_rank == inter_root) {
...
} else {
...
}
}
But shouldn't MPIDI_NM_mpi_bcast
take care of buffer swap anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean group the buffer allocation and inter-node bcast?
I can try to use MPIDI_NM_mpi_bcast to take care of the buffer swap, the performance should be similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think doing the explicit data swap as the current code is better than using MPIDI_NM_mpi_bcast to take care of the buffer swap as it is easier to see the difference between the composition alpha and delta.
404681e
to
b383049
Compare
test:mpich/ch4/gpu |
1 similar comment
test:mpich/ch4/gpu |
b8275e6
to
2045cc4
Compare
test:mpich/ch4/gpu |
test:mpich/ch4/gpu |
test:mpich/ch4/gpu |
test:mpich/ch4/gpu |
Add composition delta for bcast that can utilize the direct links between the GPUs in the same node. This composition delta is basically the same as composition alpha, it just moves the data swap before the intra-node bcast.
When the root is not the local node leader, the data needs to be sent from the root to the local leader. Rewrite that part of code to make it more clear.
test:mpich/ch4/gpu |
Ready for next round of reviews @hzhou |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Will merge after tests
Having issue with CUDA running out of memory. I don't think it is related to this PR. Will merge and deal with any issue afterward |
Pull Request Description
Add composition delta for bcast that can utilize the direct links between the GPUs in the same node.
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.