Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coll: add a new bcast composition #6781

Merged
merged 2 commits into from May 2, 2024
Merged

coll: add a new bcast composition #6781

merged 2 commits into from May 2, 2024

Conversation

dycz0fx
Copy link
Contributor

@dycz0fx dycz0fx commented Nov 6, 2023

Pull Request Description

Add composition delta for bcast that can utilize the direct links between the GPUs in the same node.

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@dycz0fx dycz0fx marked this pull request as ready for review November 15, 2023 19:16
@abrooks98 abrooks98 added this to In Progress in Intel Work via automation Jan 2, 2024
@dycz0fx dycz0fx force-pushed the inter_coll branch 2 times, most recently from 86267ca to 762b5a3 Compare February 27, 2024 21:27
@dycz0fx
Copy link
Contributor Author

dycz0fx commented Feb 29, 2024

test:mpich/ch4/gpu

@dycz0fx
Copy link
Contributor Author

dycz0fx commented Mar 1, 2024

test:mpich/ch4/gpu

coll_ret =
MPIC_Recv(buffer, count, datatype, MPIR_Get_intranode_rank(comm, root), MPIR_BCAST_TAG,
comm->node_comm, MPI_STATUS_IGNORE);
MPIR_ERR_COLL_CHECKANDCONT(coll_ret, errflag, mpi_errno);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it difficult to sort out the branches. How about --

int intra_root = MPIR_Get_intranode_rank(comm, root);
if (intra_root != -1 && intra_root != 0) {
    /* root send message to local leader (node_comm rank 0) */
    if (comm->rank == root) {
        [MPIC_Send]
    } else {
        [MPIC_Recv]
    }
}

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will make the code more clear. I copied this part of code from composition alpha, would you like me to change composition alpha as well? Actually, this composition delta is basically the same as composition alpha, it just moves the data swap before the intra-node bcast.

Copy link
Contributor

@hzhou hzhou Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We should clean some of the technical debt as we adding new ones. But make separate commits if you do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the code in both composition alpha and delta.

}
#endif
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Group the next two blocks of code under

if (comm->node_roots_comm) {
    /* bcast in node_roots_comm */
    int inter_root = MPIR_Get_internode_rank(comm, root);
    int my_rank = comm->node_roots_comm->rank;
    if (my_rank == inter_root) {
        ...
    } else {
        ...
    }
}

But shouldn't MPIDI_NM_mpi_bcast take care of buffer swap anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean group the buffer allocation and inter-node bcast?
I can try to use MPIDI_NM_mpi_bcast to take care of the buffer swap, the performance should be similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doing the explicit data swap as the current code is better than using MPIDI_NM_mpi_bcast to take care of the buffer swap as it is easier to see the difference between the composition alpha and delta.

@dycz0fx dycz0fx force-pushed the inter_coll branch 3 times, most recently from 404681e to b383049 Compare March 6, 2024 20:35
@dycz0fx
Copy link
Contributor Author

dycz0fx commented Mar 6, 2024

test:mpich/ch4/gpu

1 similar comment
@dycz0fx
Copy link
Contributor Author

dycz0fx commented Mar 7, 2024

test:mpich/ch4/gpu

@dycz0fx dycz0fx force-pushed the inter_coll branch 2 times, most recently from b8275e6 to 2045cc4 Compare March 8, 2024 19:04
@dycz0fx
Copy link
Contributor Author

dycz0fx commented Mar 8, 2024

test:mpich/ch4/gpu

@dycz0fx
Copy link
Contributor Author

dycz0fx commented Mar 11, 2024

test:mpich/ch4/gpu

@dycz0fx
Copy link
Contributor Author

dycz0fx commented Mar 19, 2024

test:mpich/ch4/gpu

@dycz0fx
Copy link
Contributor Author

dycz0fx commented Mar 28, 2024

test:mpich/ch4/gpu

Add composition delta for bcast that can utilize the direct links between
the GPUs in the same node. This composition delta is basically the same as
composition alpha, it just moves the data swap before the intra-node bcast.
When the root is not the local node leader, the data needs to be sent
from the root to the local leader. Rewrite that part of code to make
it more clear.
@abrooks98
Copy link
Contributor

test:mpich/ch4/gpu

@abrooks98 abrooks98 requested a review from hzhou May 2, 2024 14:43
@abrooks98
Copy link
Contributor

Ready for next round of reviews @hzhou

Copy link
Contributor

@hzhou hzhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Will merge after tests

@hzhou
Copy link
Contributor

hzhou commented May 2, 2024

Having issue with CUDA running out of memory. I don't think it is related to this PR. Will merge and deal with any issue afterward

@hzhou hzhou merged commit 5ae17c4 into pmodels:main May 2, 2024
4 of 6 checks passed
Intel Work automation moved this from In Progress to Done May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Intel Work
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants