New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll/han: Add alltoall algorithm #12387
base: main
Are you sure you want to change the base?
Conversation
Some motivating data collected on 2K ranks (32 hpc7g): <style> </style>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just two minor comments
ompi/mca/coll/han/coll_han.h
Outdated
/* OPAL_THREAD_LOCK(&XXX); */ | ||
if (NULL == proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC]) { | ||
proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC] = mca_smsc->get_endpoint(&proc->super); | ||
} | ||
/* OPAL_THREAD_UNLOCK(&XXX); */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be a lock with a double check. Something like this:
/* OPAL_THREAD_LOCK(&XXX); */ | |
if (NULL == proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC]) { | |
proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC] = mca_smsc->get_endpoint(&proc->super); | |
} | |
/* OPAL_THREAD_UNLOCK(&XXX); */ | |
if (NULL == proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC]) { | |
OPAL_THREAD_LOCK(&XXX); | |
if (NULL == proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC]) { | |
proc->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_SMSC] = mca_smsc->get_endpoint(&proc->super); | |
} | |
OPAL_THREAD_UNLOCK(&XXX); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stole this code from bml which just uses a bml lock. I suppose I could do the same for an smsc lock, I'll just need to create one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devreal I implemented this with a new lock scoped to the han module.
Although currently han is the only module to use the smsc endpoint, in the future this might not be true. I feel like the lock should really be in ompi_proc_t and could apply to all get_endpoint calls, but this pattern isn't followed elsewhere so I'm hesitant to introduce it, so here we are with a new han lock.
Thoughts?
int aint_size; | ||
int bytes_to_gather; | ||
int reg_data_size = 0; | ||
PMPI_Type_size( MPI_AINT, &aint_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just sizeof(MPI_Aint)
here?
goto error; | ||
} | ||
|
||
for (int jloop=0; jloop<nrounds+fanout; jloop++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much performance benefit by overlapping the intranode and internode communication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not as much as I'd hoped. My goal wasn't so much to overlap internode and intranode, but rather pipeline internode communication. You commented on the "using allgather" variant which doesn't perform well in large message sizes. My observation was that for large message sizes in the smsc variant having two or perhaps three rounds in-flight at once saved me a little bit of time. However there is an additional penalty I had to pay. you'll notice for fanout=1 in the smsc variant I don't allocate a send_bounce buffer at all, and just re-use the application buffer. This sidesteps the need for transport to perform ibv_mr_reg on a temporary buffer, which might change every call. By re-using the application buffer the mr_reg cache happily re-uses previous registrations everytime and this typically outweighs the fanout improvements.
If we had an ability to re-use a buffer pool of registered memory, the fanout might be more helpful.
@lrbison On a high level - I am a bit bothered by the fact that we are implementing point-to-point communication directly in HAN. It breaches the encapsulation in HAN where other collective algorithms are reusing other coll modules. In other words, HAN implicitly composes hierarchical collectives of basic collectives. Related to encapsulation, I have been thinking about the the benefits of this new algorithm. I think the biggest innovation is to separate inter-node from intra-node comms which significantly reduces network load. The intra-node offload is only a secondary concern, and we might get away with PML instead of smsc. IMO the new XHC module is a better home for the smsc stuff, and later we can adopt XHC in HAN. |
@wenduwan Yes, I understand your concern about the encapsulation. The problem is that alltoall is not easily/efficiently composed of smaller operations. By the nature of ALL to all, it is difficult to hierarchically subdivide the problem. I did think of three ways to subivide:
In both of those cases above the thing limiting performance is actually the local communication. Finally, there is one more case:
|
Looks like CI has a I will investigate, and also re-try the test including xpmem on my own cluster. |
The failure is a symptom of the OB1 PML code's recent problems in INTER communicators. Unrelated to this change. See Issue 12407. Ignoring that unrelated problem for a moment, I think this change is ready for review again. |
About smsc - I remember that it does not always work for accelerator, so should we add a check to unselect this algorithm for device memory? It might be tricky since it requires another consensus within the global communicator since some rank might be on accelerator but others not. I was debating with myself if we should leave smsc out of this PR - would that cause a dramatic performance penalty? In other words, is smsc essential to achieve a reasonable/significant improvement over the current alltoall? |
Yes. The primary advantage of this PR is that we avoid an extra copy and lots of extra synchronization by exposing all the send buffers to all other on-node ranks via mapped memory from SMSC module (and specifically, an SMSC module that implements mapped memory, which is currently only XPMEM)
Hm, good point. Technically we could achieve this with local-to-node collective as long as the fallback implemented the same off-node pattern, but in the short-term a comm-wide allreduce would probably be needed. Let me experiment with this. |
I have a different take on this. In my opinion the greatest value proposition of this PR is that we avoid the expensive cost associated with high network incast and outcast. In the context of HAN, every collective will benefit from smsc but I doubt if that is the foremost advantage. I imagine we can claim a bigger win if we can devise a reusable smsc pattern for allreduce, bcast, etc. too. |
I realized I just sidestepped all intermediary datatypes and their conversion in this PR which is going to cause lots of failures. I'm marking this as draft while I fix that. I will:
|
Increase coll:han:get_algorithm verbosity level from 1 to 30, to avoid flooding terminal at any verbosity level. Thirty seems to be used for most of the other han dynamic selection prints. Signed-off-by: Luke Robison <lrbison@amazon.com>
This will allow HAN collectives to check for and use SMSC methods to direct-map peer memory during on-node communication. Signed-off-by: Luke Robison <lrbison@amazon.com>
Add Alltoall algorithm to coll/han. Each rank on one host is assigned a single partner on a remote host and vice versa. Then the rank collects all the data its partner will need to receive from it's host, and sends it in one large send, and likewise receives it's data in one large recv, then cycles to the next host. This algorithm is only selected when SMSC component has ability to direct-map peer memory, which only exists for XPMEM module. Signed-off-by: Luke Robison <lrbison@amazon.com>
I have completed a significant update to this PR. The changes corrected glaring issues for DDT and device memory, while largely retaining the original performance.
I'm marking the PR ready for review again. Please take a look. |
Add two Alltoall algorithms to coll/han. Both algorithms use the same
communication pattern. Each rank on one host is assigned a single
partner on a remote host and vice versa. Then the rank collects all
the data its partner will need to receive from it's host, and sends it
in one large send, and likewise receives it's data in one large recv,
then cycles to the next host.
The two algorithms are:
and each rank has a copy of all local data. Only recommended for
small message sizes.
direct-map local memory before copying into a packed send buffer.
Currently only the XPMEM-based smsc module supports this operation.