Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coll/han: Add alltoall algorithm #12387

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

lrbison
Copy link
Contributor

@lrbison lrbison commented Mar 1, 2024

Add two Alltoall algorithms to coll/han. Both algorithms use the same
communication pattern. Each rank on one host is assigned a single
partner on a remote host and vice versa. Then the rank collects all
the data its partner will need to receive from it's host, and sends it
in one large send, and likewise receives it's data in one large recv,
then cycles to the next host.

The two algorithms are:

  • mca_coll_han_alltoall_using_allgather: gathering data is done once
    and each rank has a copy of all local data. Only recommended for
    small message sizes.
  • mca_coll_han_alltoall_using_smsc: ranks use smsc module to
    direct-map local memory before copying into a packed send buffer.
    Currently only the XPMEM-based smsc module supports this operation.

@lrbison
Copy link
Contributor Author

lrbison commented Mar 8, 2024

Some motivating data collected on 2K ranks (32 hpc7g):

image

<style> </style>
  Han-smsc Tuned
1 302.2 372.5
2 308.9 441.5
4 308.4 436.5
8 331.9 666.0
16 351.9 814.5
32 333.3 1703.2
64 435.5 15613.4
128 1248.6 27603.3
256 1866.1 31895.6
512 3421.5 37748.5
1024 6306.1 30685.0
2048 12603.2 38110.2
4096 24445.4 37636.9
8192 50277.3 47228.5
16384 97058.5 93082.9
32768 192255.5 185665.3
65536 392959.9 370929.0
131072 795386.1 741670.5
262144 1592377.8 1531173.9

Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just two minor comments

Comment on lines 515 to 519
/* OPAL_THREAD_LOCK(&XXX); */
if (NULL == proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC]) {
proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC] = mca_smsc->get_endpoint(&proc->super);
}
/* OPAL_THREAD_UNLOCK(&XXX); */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be a lock with a double check. Something like this:

Suggested change
/* OPAL_THREAD_LOCK(&XXX); */
if (NULL == proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC]) {
proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC] = mca_smsc->get_endpoint(&proc->super);
}
/* OPAL_THREAD_UNLOCK(&XXX); */
if (NULL == proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC]) {
OPAL_THREAD_LOCK(&XXX);
if (NULL == proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC]) {
proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC] = mca_smsc->get_endpoint(&proc->super);
}
OPAL_THREAD_UNLOCK(&XXX);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stole this code from bml which just uses a bml lock. I suppose I could do the same for an smsc lock, I'll just need to create one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devreal I implemented this with a new lock scoped to the han module.

Although currently han is the only module to use the smsc endpoint, in the future this might not be true. I feel like the lock should really be in ompi_proc_t and could apply to all get_endpoint calls, but this pattern isn't followed elsewhere so I'm hesitant to introduce it, so here we are with a new han lock.

Thoughts?

int aint_size;
int bytes_to_gather;
int reg_data_size = 0;
PMPI_Type_size( MPI_AINT, &aint_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just sizeof(MPI_Aint) here?

ompi/mca/coll/han/coll_han_alltoall.c Outdated Show resolved Hide resolved
goto error;
}

for (int jloop=0; jloop<nrounds+fanout; jloop++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much performance benefit by overlapping the intranode and internode communication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not as much as I'd hoped. My goal wasn't so much to overlap internode and intranode, but rather pipeline internode communication. You commented on the "using allgather" variant which doesn't perform well in large message sizes. My observation was that for large message sizes in the smsc variant having two or perhaps three rounds in-flight at once saved me a little bit of time. However there is an additional penalty I had to pay. you'll notice for fanout=1 in the smsc variant I don't allocate a send_bounce buffer at all, and just re-use the application buffer. This sidesteps the need for transport to perform ibv_mr_reg on a temporary buffer, which might change every call. By re-using the application buffer the mr_reg cache happily re-uses previous registrations everytime and this typically outweighs the fanout improvements.

If we had an ability to re-use a buffer pool of registered memory, the fanout might be more helpful.

ompi/mca/coll/han/coll_han_alltoall.c Show resolved Hide resolved
@wenduwan
Copy link
Contributor

@lrbison On a high level - I am a bit bothered by the fact that we are implementing point-to-point communication directly in HAN. It breaches the encapsulation in HAN where other collective algorithms are reusing other coll modules. In other words, HAN implicitly composes hierarchical collectives of basic collectives.

Related to encapsulation, I have been thinking about the the benefits of this new algorithm. I think the biggest innovation is to separate inter-node from intra-node comms which significantly reduces network load. The intra-node offload is only a secondary concern, and we might get away with PML instead of smsc. IMO the new XHC module is a better home for the smsc stuff, and later we can adopt XHC in HAN.

@lrbison
Copy link
Contributor Author

lrbison commented Mar 20, 2024

@wenduwan Yes, I understand your concern about the encapsulation. The problem is that alltoall is not easily/efficiently composed of smaller operations. By the nature of ALL to all, it is difficult to hierarchically subdivide the problem.

I did think of three ways to subivide:

  • each local node does allgather, then sends some portion of the data off-node. This is implemented in this PR as alltoall_using_allgather, but it's a terrible idea for anything but the smallest message sizes due to the amount of wasted data movement (and extra memory buffers).
  • similar to above, but use gather. We don't waste data now, but now we have to do many many gathers for each remote node. Depending on the configuration, you probably also need a scatter on the remote side, and some transpose operation somewhere. This is also slow.

In both of those cases above the thing limiting performance is actually the local communication. Finally, there is one more case:

  • alltoall can be composed of smaller, local, alltoallv operations. This has the potential to be the right way forward however there are two problems: (1) an efficient, smsc-based alltoallv, hasn't been implemented yet (I plan on doing so using the pattern I've written above), and (2) the setup for any smsc-based collective requires sharing the base address of all rank's buffers. In my implementation here we pay that cost up front by an on-node allgather call and re-use the buffer address for each "round" of network communication. However, if we call alltoallv independently at each round, then the alltoallv will have to pay that cost every time. Conversely if we shove all communication into one round, we require very large temporary memory buffers, and it starts to look more like the allgather version.

@lrbison
Copy link
Contributor Author

lrbison commented Mar 21, 2024

Looks like CI has a real failure in test_cco_buf_inter.TestCCOBufInter.testAlltoall. This is particularly surprising to me since I doubt CI setup has XPMEM installed, so I would expect it to just fall back to previous alltoall algorithm.

I will investigate, and also re-try the test including xpmem on my own cluster.

@lrbison
Copy link
Contributor Author

lrbison commented Mar 21, 2024

The failure is a symptom of the OB1 PML code's recent problems in INTER communicators. Unrelated to this change. See Issue 12407. Ignoring that unrelated problem for a moment, I think this change is ready for review again.

@wenduwan
Copy link
Contributor

About smsc - I remember that it does not always work for accelerator, so should we add a check to unselect this algorithm for device memory? It might be tricky since it requires another consensus within the global communicator since some rank might be on accelerator but others not.

I was debating with myself if we should leave smsc out of this PR - would that cause a dramatic performance penalty? In other words, is smsc essential to achieve a reasonable/significant improvement over the current alltoall?

@lrbison
Copy link
Contributor Author

lrbison commented Mar 25, 2024

if we should leave smsc out of this PR - would that cause a dramatic performance penalty?

Yes. The primary advantage of this PR is that we avoid an extra copy and lots of extra synchronization by exposing all the send buffers to all other on-node ranks via mapped memory from SMSC module (and specifically, an SMSC module that implements mapped memory, which is currently only XPMEM)

so should we add a check to unselect this algorithm for device memory? It might be tricky since it requires another consensus within the global communicator since some rank might be on accelerator but others not.

Hm, good point. Technically we could achieve this with local-to-node collective as long as the fallback implemented the same off-node pattern, but in the short-term a comm-wide allreduce would probably be needed. Let me experiment with this.

@wenduwan
Copy link
Contributor

wenduwan commented Mar 25, 2024

The primary advantage of this PR is that we avoid an extra copy

I have a different take on this. In my opinion the greatest value proposition of this PR is that we avoid the expensive cost associated with high network incast and outcast. In the context of HAN, every collective will benefit from smsc but I doubt if that is the foremost advantage. I imagine we can claim a bigger win if we can devise a reusable smsc pattern for allreduce, bcast, etc. too.

@lrbison
Copy link
Contributor Author

lrbison commented Mar 29, 2024

I realized I just sidestepped all intermediary datatypes and their conversion in this PR which is going to cause lots of failures. I'm marking this as draft while I fix that. I will:

  • Have followers pack data into leader buffer (which is reverse of current situation).
  • Add synchronization to account for the fact that leader doesn't know when followers are done packing.

@lrbison lrbison marked this pull request as draft March 29, 2024 18:03
Increase coll:han:get_algorithm verbosity level from 1 to 30, to avoid
flooding terminal at any verbosity level.  Thirty seems to be used for
most of the other han dynamic selection prints.

Signed-off-by: Luke Robison <lrbison@amazon.com>
This will allow HAN collectives to check for and use SMSC methods to
direct-map peer memory during on-node communication.

Signed-off-by: Luke Robison <lrbison@amazon.com>
Add Alltoall algorithm to coll/han.  Each rank on one host is assigned
a single partner on a remote host and vice versa.  Then the rank
collects all the data its partner will need to receive from it's host,
and sends it in one large send, and likewise receives it's data in one
large recv, then cycles to the next host.

This algorithm is only selected when SMSC component has ability to
direct-map peer memory, which only exists for XPMEM module.

Signed-off-by: Luke Robison <lrbison@amazon.com>
@lrbison
Copy link
Contributor Author

lrbison commented Apr 6, 2024

I have completed a significant update to this PR. The changes corrected glaring issues for DDT and device memory, while largely retaining the original performance.

  • The function properly uses convertors now, and properly detects non-contiguous and device-based memory.
  • In cases where packing is required, each low comm will decide collectively to pack their local data types into an SMSC-exposed bounce buffer of other low rank. This however requires a few extra barriers, so it is slower.
  • In cases where packing is not required, the function uses the algorithm which I posted in this PR originally, where each rank packs data from SMSC-exposed send buffers of other low ranks into it's own bounce buffer. This is the fast-path.
  • I removed alltoall_using_allgather. The performance was never better than the smsc version, and only rarely better than tuned, and only for small messages. For this reason I decided it was easier to remove than fix up the convertor usage with no benefit.

I'm marking the PR ready for review again. Please take a look.

@lrbison lrbison marked this pull request as ready for review April 6, 2024 03:55
@lrbison
Copy link
Contributor Author

lrbison commented Apr 6, 2024

Updated latency results on 8x hpc7g on AWS with libfabric/EFA:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants