coll: add a new bcast composition #6781

dycz0fx · 2023-11-06T17:50:45Z

Pull Request Description

Add composition delta for bcast that can utilize the direct links between the GPUs in the same node.

Author Checklist

Provide Description
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form: module: short description
Commit message explains what's in the commit.
Passes All Tests
Whitespace checker. Warnings test. Additional tests via comments.
Contribution Agreement
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.

dycz0fx · 2024-02-29T20:00:36Z

test:mpich/ch4/gpu

dycz0fx · 2024-03-01T20:25:59Z

test:mpich/ch4/gpu

hzhou · 2024-03-01T21:00:05Z

src/mpid/ch4/src/ch4_coll_impl.h

+        coll_ret =
+            MPIC_Recv(buffer, count, datatype, MPIR_Get_intranode_rank(comm, root), MPIR_BCAST_TAG,
+                      comm->node_comm, MPI_STATUS_IGNORE);
+        MPIR_ERR_COLL_CHECKANDCONT(coll_ret, errflag, mpi_errno);


I find it difficult to sort out the branches. How about --

int intra_root = MPIR_Get_intranode_rank(comm, root); if (intra_root != -1 && intra_root != 0) { /* root send message to local leader (node_comm rank 0) */ if (comm->rank == root) { [MPIC_Send] } else { [MPIC_Recv] } }

?

I think this will make the code more clear. I copied this part of code from composition alpha, would you like me to change composition alpha as well? Actually, this composition delta is basically the same as composition alpha, it just moves the data swap before the intra-node bcast.

Yes. We should clean some of the technical debt as we adding new ones. But make separate commits if you do so.

Fixed the code in both composition alpha and delta.

hzhou · 2024-03-01T21:06:40Z

src/mpid/ch4/src/ch4_coll_impl.h

+        }
+#endif
+    }
+


Group the next two blocks of code under

if (comm->node_roots_comm) { /* bcast in node_roots_comm */ int inter_root = MPIR_Get_internode_rank(comm, root); int my_rank = comm->node_roots_comm->rank; if (my_rank == inter_root) { ... } else { ... } }

But shouldn't MPIDI_NM_mpi_bcast take care of buffer swap anyway?

Do you mean group the buffer allocation and inter-node bcast?
I can try to use MPIDI_NM_mpi_bcast to take care of the buffer swap, the performance should be similar.

I think doing the explicit data swap as the current code is better than using MPIDI_NM_mpi_bcast to take care of the buffer swap as it is easier to see the difference between the composition alpha and delta.

dycz0fx · 2024-03-06T20:42:36Z

test:mpich/ch4/gpu

dycz0fx · 2024-03-07T18:46:17Z

test:mpich/ch4/gpu

dycz0fx · 2024-03-08T19:06:14Z

test:mpich/ch4/gpu

dycz0fx · 2024-03-11T21:46:25Z

test:mpich/ch4/gpu

dycz0fx · 2024-03-19T17:53:50Z

test:mpich/ch4/gpu

dycz0fx · 2024-03-28T18:20:42Z

test:mpich/ch4/gpu

Add composition delta for bcast that can utilize the direct links between the GPUs in the same node. This composition delta is basically the same as composition alpha, it just moves the data swap before the intra-node bcast.

When the root is not the local node leader, the data needs to be sent from the root to the local leader. Rewrite that part of code to make it more clear.

abrooks98 · 2024-05-02T14:42:50Z

test:mpich/ch4/gpu

abrooks98 · 2024-05-02T14:44:15Z

Ready for next round of reviews @hzhou

hzhou

LGTM

Will merge after tests

hzhou · 2024-05-02T21:57:10Z

Having issue with CUDA running out of memory. I don't think it is related to this PR. Will merge and deal with any issue afterward

dycz0fx force-pushed the inter_coll branch from 860c851 to c372d58 Compare November 6, 2023 18:23

dycz0fx marked this pull request as ready for review November 15, 2023 19:16

dycz0fx mentioned this pull request Dec 4, 2023

Select GPU-optimized collective algorithms with JSON #6829

Draft

4 tasks

dycz0fx force-pushed the inter_coll branch from c372d58 to 5c9d2af Compare December 7, 2023 19:46

abrooks98 added this to In Progress in Intel Work via automation Jan 2, 2024

dycz0fx force-pushed the inter_coll branch 2 times, most recently from 86267ca to 762b5a3 Compare February 27, 2024 21:27

dycz0fx force-pushed the inter_coll branch from 762b5a3 to 6f3e9d7 Compare March 1, 2024 19:36

hzhou reviewed Mar 1, 2024

View reviewed changes

dycz0fx force-pushed the inter_coll branch 3 times, most recently from 404681e to b383049 Compare March 6, 2024 20:35

dycz0fx force-pushed the inter_coll branch 2 times, most recently from b8275e6 to 2045cc4 Compare March 8, 2024 19:04

dycz0fx force-pushed the inter_coll branch from 2045cc4 to 370b1c5 Compare March 11, 2024 21:45

dycz0fx force-pushed the inter_coll branch from fddb5d4 to c486bb7 Compare March 19, 2024 17:38

dycz0fx force-pushed the inter_coll branch from a673a31 to 868ef92 Compare March 28, 2024 18:15

dycz0fx added 2 commits May 2, 2024 09:42

coll: add a new bcast composition

1d2f8ee

Add composition delta for bcast that can utilize the direct links between the GPUs in the same node. This composition delta is basically the same as composition alpha, it just moves the data swap before the intra-node bcast.

coll: improve readability in MPIDI_Bcast_intra_composition_alpha

bcdb21e

When the root is not the local node leader, the data needs to be sent from the root to the local leader. Rewrite that part of code to make it more clear.

abrooks98 force-pushed the inter_coll branch from 868ef92 to bcdb21e Compare May 2, 2024 14:42

abrooks98 requested a review from hzhou May 2, 2024 14:43

hzhou approved these changes May 2, 2024

View reviewed changes

hzhou merged commit 5ae17c4 into pmodels:main May 2, 2024
4 of 6 checks passed

Intel Work automation moved this from In Progress to Done May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coll: add a new bcast composition #6781

coll: add a new bcast composition #6781

dycz0fx commented Nov 6, 2023

dycz0fx commented Feb 29, 2024

dycz0fx commented Mar 1, 2024

hzhou Mar 1, 2024

dycz0fx Mar 1, 2024

hzhou Mar 1, 2024 •

edited

dycz0fx Mar 6, 2024

hzhou Mar 1, 2024

dycz0fx Mar 1, 2024

dycz0fx Mar 6, 2024

dycz0fx commented Mar 6, 2024

dycz0fx commented Mar 7, 2024

dycz0fx commented Mar 8, 2024

dycz0fx commented Mar 11, 2024

dycz0fx commented Mar 19, 2024

dycz0fx commented Mar 28, 2024

abrooks98 commented May 2, 2024

abrooks98 commented May 2, 2024

hzhou left a comment •

edited

hzhou commented May 2, 2024

coll: add a new bcast composition #6781

coll: add a new bcast composition #6781

Conversation

dycz0fx commented Nov 6, 2023

Pull Request Description

Author Checklist

dycz0fx commented Feb 29, 2024

dycz0fx commented Mar 1, 2024

hzhou Mar 1, 2024

Choose a reason for hiding this comment

dycz0fx Mar 1, 2024

Choose a reason for hiding this comment

hzhou Mar 1, 2024 • edited

Choose a reason for hiding this comment

dycz0fx Mar 6, 2024

Choose a reason for hiding this comment

hzhou Mar 1, 2024

Choose a reason for hiding this comment

dycz0fx Mar 1, 2024

Choose a reason for hiding this comment

dycz0fx Mar 6, 2024

Choose a reason for hiding this comment

dycz0fx commented Mar 6, 2024

dycz0fx commented Mar 7, 2024

dycz0fx commented Mar 8, 2024

dycz0fx commented Mar 11, 2024

dycz0fx commented Mar 19, 2024

dycz0fx commented Mar 28, 2024

abrooks98 commented May 2, 2024

abrooks98 commented May 2, 2024

hzhou left a comment • edited

Choose a reason for hiding this comment

hzhou commented May 2, 2024

hzhou Mar 1, 2024 •

edited

hzhou left a comment •

edited