coll/han: Add alltoall algorithm #12387

lrbison · 2024-03-01T15:49:37Z

Add two Alltoall algorithms to coll/han. Both algorithms use the same
communication pattern. Each rank on one host is assigned a single
partner on a remote host and vice versa. Then the rank collects all
the data its partner will need to receive from it's host, and sends it
in one large send, and likewise receives it's data in one large recv,
then cycles to the next host.

The two algorithms are:

mca_coll_han_alltoall_using_allgather: gathering data is done once
and each rank has a copy of all local data. Only recommended for
small message sizes.
mca_coll_han_alltoall_using_smsc: ranks use smsc module to
direct-map local memory before copying into a packed send buffer.
Currently only the XPMEM-based smsc module supports this operation.

lrbison · 2024-03-08T18:59:26Z

Some motivating data collected on 2K ranks (32 hpc7g):

	Han-smsc	Tuned
1	302.2	372.5
2	308.9	441.5
4	308.4	436.5
8	331.9	666.0
16	351.9	814.5
32	333.3	1703.2
64	435.5	15613.4
128	1248.6	27603.3
256	1866.1	31895.6
512	3421.5	37748.5
1024	6306.1	30685.0
2048	12603.2	38110.2
4096	24445.4	37636.9
8192	50277.3	47228.5
16384	97058.5	93082.9
32768	192255.5	185665.3
65536	392959.9	370929.0
131072	795386.1	741670.5
262144	1592377.8	1531173.9

devreal

LGTM, just two minor comments

devreal · 2024-03-11T15:27:03Z

ompi/mca/coll/han/coll_han.h

+        /* OPAL_THREAD_LOCK(&XXX); */
+        if (NULL == proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC]) {
+            proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC] = mca_smsc->get_endpoint(&proc->super);
+        }
+        /* OPAL_THREAD_UNLOCK(&XXX); */


I think there should be a lock with a double check. Something like this:

Suggested change

/* OPAL_THREAD_LOCK(&XXX); */

if (NULL == proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC]) {

proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC] = mca_smsc->get_endpoint(&proc->super);

}

/* OPAL_THREAD_UNLOCK(&XXX); */

if (NULL == proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC]) {

OPAL_THREAD_LOCK(&XXX);

if (NULL == proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC]) {

proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC] = mca_smsc->get_endpoint(&proc->super);

}

OPAL_THREAD_UNLOCK(&XXX);

}

I stole this code from bml which just uses a bml lock. I suppose I could do the same for an smsc lock, I'll just need to create one.

@devreal I implemented this with a new lock scoped to the han module.

Although currently han is the only module to use the smsc endpoint, in the future this might not be true. I feel like the lock should really be in ompi_proc_t and could apply to all get_endpoint calls, but this pattern isn't followed elsewhere so I'm hesitant to introduce it, so here we are with a new han lock.

Thoughts?

devreal · 2024-03-11T15:38:55Z

ompi/mca/coll/han/coll_han_alltoall.c

+    int aint_size;
+    int bytes_to_gather;
+    int reg_data_size = 0;
+    PMPI_Type_size( MPI_AINT, &aint_size);


Why not just sizeof(MPI_Aint) here?

ompi/mca/coll/han/coll_han_alltoall.c

juntangc · 2024-03-11T16:00:56Z

ompi/mca/coll/han/coll_han_alltoall.c

+        goto error;
+    }
+
+    for (int jloop=0; jloop<nrounds+fanout; jloop++) {


How much performance benefit by overlapping the intranode and internode communication?

Not as much as I'd hoped. My goal wasn't so much to overlap internode and intranode, but rather pipeline internode communication. You commented on the "using allgather" variant which doesn't perform well in large message sizes. My observation was that for large message sizes in the smsc variant having two or perhaps three rounds in-flight at once saved me a little bit of time. However there is an additional penalty I had to pay. you'll notice for fanout=1 in the smsc variant I don't allocate a send_bounce buffer at all, and just re-use the application buffer. This sidesteps the need for transport to perform ibv_mr_reg on a temporary buffer, which might change every call. By re-using the application buffer the mr_reg cache happily re-uses previous registrations everytime and this typically outweighs the fanout improvements.

If we had an ability to re-use a buffer pool of registered memory, the fanout might be more helpful.

ompi/mca/coll/han/coll_han_alltoall.c

wenduwan · 2024-03-19T14:26:36Z

@lrbison On a high level - I am a bit bothered by the fact that we are implementing point-to-point communication directly in HAN. It breaches the encapsulation in HAN where other collective algorithms are reusing other coll modules. In other words, HAN implicitly composes hierarchical collectives of basic collectives.

Related to encapsulation, I have been thinking about the the benefits of this new algorithm. I think the biggest innovation is to separate inter-node from intra-node comms which significantly reduces network load. The intra-node offload is only a secondary concern, and we might get away with PML instead of smsc. IMO the new XHC module is a better home for the smsc stuff, and later we can adopt XHC in HAN.

lrbison · 2024-03-20T20:02:26Z

@wenduwan Yes, I understand your concern about the encapsulation. The problem is that alltoall is not easily/efficiently composed of smaller operations. By the nature of ALL to all, it is difficult to hierarchically subdivide the problem.

I did think of three ways to subivide:

each local node does allgather, then sends some portion of the data off-node. This is implemented in this PR as alltoall_using_allgather, but it's a terrible idea for anything but the smallest message sizes due to the amount of wasted data movement (and extra memory buffers).
similar to above, but use gather. We don't waste data now, but now we have to do many many gathers for each remote node. Depending on the configuration, you probably also need a scatter on the remote side, and some transpose operation somewhere. This is also slow.

In both of those cases above the thing limiting performance is actually the local communication. Finally, there is one more case:

alltoall can be composed of smaller, local, alltoallv operations. This has the potential to be the right way forward however there are two problems: (1) an efficient, smsc-based alltoallv, hasn't been implemented yet (I plan on doing so using the pattern I've written above), and (2) the setup for any smsc-based collective requires sharing the base address of all rank's buffers. In my implementation here we pay that cost up front by an on-node allgather call and re-use the buffer address for each "round" of network communication. However, if we call alltoallv independently at each round, then the alltoallv will have to pay that cost every time. Conversely if we shove all communication into one round, we require very large temporary memory buffers, and it starts to look more like the allgather version.

lrbison · 2024-03-21T13:26:00Z

Looks like CI has a ~~real failure~~ in test_cco_buf_inter.TestCCOBufInter.testAlltoall. This is particularly surprising to me since I doubt CI setup has XPMEM installed, so I would expect it to just fall back to previous alltoall algorithm.

I will investigate, and also re-try the test including xpmem on my own cluster.

lrbison · 2024-03-21T14:54:42Z

The failure is a symptom of the OB1 PML code's recent problems in INTER communicators. Unrelated to this change. See Issue 12407. Ignoring that unrelated problem for a moment, I think this change is ready for review again.

wenduwan · 2024-03-21T21:42:34Z

About smsc - I remember that it does not always work for accelerator, so should we add a check to unselect this algorithm for device memory? It might be tricky since it requires another consensus within the global communicator since some rank might be on accelerator but others not.

I was debating with myself if we should leave smsc out of this PR - would that cause a dramatic performance penalty? In other words, is smsc essential to achieve a reasonable/significant improvement over the current alltoall?

lrbison · 2024-03-25T14:29:03Z

if we should leave smsc out of this PR - would that cause a dramatic performance penalty?

Yes. The primary advantage of this PR is that we avoid an extra copy and lots of extra synchronization by exposing all the send buffers to all other on-node ranks via mapped memory from SMSC module (and specifically, an SMSC module that implements mapped memory, which is currently only XPMEM)

so should we add a check to unselect this algorithm for device memory? It might be tricky since it requires another consensus within the global communicator since some rank might be on accelerator but others not.

Hm, good point. Technically we could achieve this with local-to-node collective as long as the fallback implemented the same off-node pattern, but in the short-term a comm-wide allreduce would probably be needed. Let me experiment with this.

wenduwan · 2024-03-25T15:19:33Z

The primary advantage of this PR is that we avoid an extra copy

I have a different take on this. In my opinion the greatest value proposition of this PR is that we avoid the expensive cost associated with high network incast and outcast. In the context of HAN, every collective will benefit from smsc but I doubt if that is the foremost advantage. I imagine we can claim a bigger win if we can devise a reusable smsc pattern for allreduce, bcast, etc. too.

lrbison · 2024-03-29T18:02:26Z

I realized I just sidestepped all intermediary datatypes and their conversion in this PR which is going to cause lots of failures. I'm marking this as draft while I fix that. I will:

Have followers pack data into leader buffer (which is reverse of current situation).
Add synchronization to account for the fact that leader doesn't know when followers are done packing.

Increase coll:han:get_algorithm verbosity level from 1 to 30, to avoid flooding terminal at any verbosity level. Thirty seems to be used for most of the other han dynamic selection prints. Signed-off-by: Luke Robison <lrbison@amazon.com>

This will allow HAN collectives to check for and use SMSC methods to direct-map peer memory during on-node communication. Signed-off-by: Luke Robison <lrbison@amazon.com>

Add Alltoall algorithm to coll/han. Each rank on one host is assigned a single partner on a remote host and vice versa. Then the rank collects all the data its partner will need to receive from it's host, and sends it in one large send, and likewise receives it's data in one large recv, then cycles to the next host. This algorithm is only selected when SMSC component has ability to direct-map peer memory, which only exists for XPMEM module. Signed-off-by: Luke Robison <lrbison@amazon.com>

lrbison · 2024-04-06T03:54:43Z

I have completed a significant update to this PR. The changes corrected glaring issues for DDT and device memory, while largely retaining the original performance.

The function properly uses convertors now, and properly detects non-contiguous and device-based memory.
In cases where packing is required, each low comm will decide collectively to pack their local data types into an SMSC-exposed bounce buffer of other low rank. This however requires a few extra barriers, so it is slower.
In cases where packing is not required, the function uses the algorithm which I posted in this PR originally, where each rank packs data from SMSC-exposed send buffers of other low ranks into it's own bounce buffer. This is the fast-path.
I removed alltoall_using_allgather. The performance was never better than the smsc version, and only rarely better than tuned, and only for small messages. For this reason I decided it was easier to remove than fix up the convertor usage with no benefit.

I'm marking the PR ready for review again. Please take a look.

lrbison · 2024-04-06T04:09:27Z

Updated latency results on 8x hpc7g on AWS with libfabric/EFA:

lrbison requested a review from bosilca March 1, 2024 15:49

github-actions bot added the Target: main label Mar 1, 2024

devreal reviewed Mar 11, 2024

View reviewed changes

juntangc reviewed Mar 11, 2024

View reviewed changes

lrbison force-pushed the ringleaders branch from b35fc44 to c15396f Compare March 20, 2024 22:31

lrbison marked this pull request as draft March 29, 2024 18:03

lrbison force-pushed the ringleaders branch from c15396f to 2319b3c Compare April 6, 2024 03:21

lrbison added 3 commits April 6, 2024 03:24

coll/han: reduce coll:han:get_algorithm prints

8781eca

Increase coll:han:get_algorithm verbosity level from 1 to 30, to avoid flooding terminal at any verbosity level. Thirty seems to be used for most of the other han dynamic selection prints. Signed-off-by: Luke Robison <lrbison@amazon.com>

coll/han: require smsc endpoint

d6891dd

This will allow HAN collectives to check for and use SMSC methods to direct-map peer memory during on-node communication. Signed-off-by: Luke Robison <lrbison@amazon.com>

lrbison force-pushed the ringleaders branch from 2319b3c to 06fa035 Compare April 6, 2024 03:51

lrbison marked this pull request as ready for review April 6, 2024 03:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coll/han: Add alltoall algorithm #12387

coll/han: Add alltoall algorithm #12387

lrbison commented Mar 1, 2024

lrbison commented Mar 8, 2024 •

edited

devreal left a comment

devreal Mar 11, 2024

lrbison Mar 19, 2024

lrbison Mar 21, 2024

devreal Mar 11, 2024

juntangc Mar 11, 2024

lrbison Mar 19, 2024

wenduwan commented Mar 19, 2024

lrbison commented Mar 20, 2024 •

edited

lrbison commented Mar 21, 2024 •

edited

lrbison commented Mar 21, 2024

wenduwan commented Mar 21, 2024

lrbison commented Mar 25, 2024

wenduwan commented Mar 25, 2024 •

edited

lrbison commented Mar 29, 2024

lrbison commented Apr 6, 2024

lrbison commented Apr 6, 2024

coll/han: Add alltoall algorithm #12387

Are you sure you want to change the base?

coll/han: Add alltoall algorithm #12387

Conversation

lrbison commented Mar 1, 2024

lrbison commented Mar 8, 2024 • edited

devreal left a comment

Choose a reason for hiding this comment

devreal Mar 11, 2024

Choose a reason for hiding this comment

lrbison Mar 19, 2024

Choose a reason for hiding this comment

lrbison Mar 21, 2024

Choose a reason for hiding this comment

devreal Mar 11, 2024

Choose a reason for hiding this comment

juntangc Mar 11, 2024

Choose a reason for hiding this comment

lrbison Mar 19, 2024

Choose a reason for hiding this comment

wenduwan commented Mar 19, 2024

lrbison commented Mar 20, 2024 • edited

lrbison commented Mar 21, 2024 • edited

lrbison commented Mar 21, 2024

wenduwan commented Mar 21, 2024

lrbison commented Mar 25, 2024

wenduwan commented Mar 25, 2024 • edited

lrbison commented Mar 29, 2024

lrbison commented Apr 6, 2024

lrbison commented Apr 6, 2024

lrbison commented Mar 8, 2024 •

edited

lrbison commented Mar 20, 2024 •

edited

lrbison commented Mar 21, 2024 •

edited

wenduwan commented Mar 25, 2024 •

edited