Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New collectives component: XPMEM-based Hierarchical Collectives (XHC) #11418

Merged
merged 2 commits into from May 2, 2024

Conversation

gkatev
Copy link
Contributor

@gkatev gkatev commented Feb 16, 2023

Hello,
This PR proposes the XPMEM-based Hierarchical Collectives (XHC) component.

XHC implements optimized intra-node (only) collectives according to a topology-aware n-level hierarchy, especially for nodes with high core counts and elaborate topologies (but not limited to). Supported topological features include NUMA nodes, CPU sockets, shared caches (based on hwloc through ompi's internal process-to-process distance mappings). The design is centered around communication using XPMEM, but Bcast also supports CMA and KNEM via opal/smsc. Currently Bcast, Allreduce and Barrier are supported. Reduce is implemented partially (and disabled by default).

XHC is also combinable with coll/HAN for multi-node runs.

Besides the main commit that adds the component, this PR includes two auxillary commits:

  1. The 2nd commit, an allegedly mostly-trivial one, adds support in coll/HAN for selecting XHC as the component to use for the intra-comm.

  2. The 3rd commit, which is a temporary hack, allows XHC's partially-implemented MPI_Reduce to be used for HAN's Allreduce. XHC's reduce is (for now) implemented as a sub-case of allreduce and requires that the rbuf param be valid for all ranks. I will understand if this commit is not desirable :-) (the reason for its existence would be alleged increased performance potential)

Below are some benchmarks from various nodes and operations:

Allreduce

allreduce-ampere
allreduce-dp-cn
allreduce-dp-esb
allreduce-taos
allreduce-vader

Broadcast

bcast-ampere
bcast-taos
bcast-titan
bcast-vader

Barrier

barrier-ampere
barrier-taos
barrier-titan

HAN+XHC Allreduce

dp-cn-allreduce-48x
dp-dam-allreduce-10x
dp-esb-allreduce-10x

HAN+XHC Broadcast

dp-cn-bcast-48x
dp-dam-bcast-10x
dp-esb-bcast-10x

HAN+XHC Barrier

dp-cn-barrier-48x

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@bosilca
Copy link
Member

bosilca commented Feb 16, 2023

ok to test.

@bosilca
Copy link
Member

bosilca commented Feb 16, 2023

Your barrier graphs are missing legends or titles to be able to understand what they are showing. Different number of processes I assume ?

@gkatev
Copy link
Contributor Author

gkatev commented Feb 16, 2023

Ah sorry something went wrong there, I fixed them and updated the post. It's 3 different nodes (and yes different proc counts), and then one multi-node one across 48 nodes.

@bwbarrett
Copy link
Member

I might be missing something, but rather than the changes to HAN to support Reduce, why not save the previous Reduce function and call it if the parameters aren't sufficient for your implementation? An example of doing this is the sync collective, which doesn't actually implement any algorithms, but simply calls a barrier after calling the underlying algorithm for the non-synchronizing collectives.

@gkatev
Copy link
Contributor Author

gkatev commented Feb 17, 2023

I could do that with the root parameter, but from what I understand not with the recvbuf one. Even if it is non-NULL, I can't assume that it points to valid and appropriately sized memory. And I'd need to know that this is the case for all ranks (or at least the leaders if I recall correctly). Unless the standard says it has to be either NULL or fully valid (??).

In the future, I was thinking of implementing the intermediate buffers necessary on the leaders for the hierarchical reduction using a memory pool (not sure if opal/mpool does what I have in mind or not). This might then be further extended to have HAN's Allreduce register the rbuf to the pool as a "single-use" memory area, which would then be picked up by xhc's reduce when it requests a temporary buffer to use.

@bwbarrett
Copy link
Member

I could do that with the root parameter, but from what I understand not with the recvbuf one. Even if it is non-NULL, I can't assume that it points to valid and appropriately sized memory. And I'd need to know that this is the case for all ranks (or at least the leaders if I recall correctly). Unless the standard says it has to be either NULL or fully valid (??).

I think you're right; that's a bummer. I'm not a huge fan of the coupling between xhc and han, especially to the level in the last patch, but maybe it is unavoidable.

In the future, I was thinking of implementing the intermediate buffers necessary on the leaders for the hierarchical reduction using a memory pool (not sure if opal/mpool does what I have in mind or not). This might then be further extended to have HAN's Allreduce register the rbuf to the pool as a "single-use" memory area, which would then be picked up by xhc's reduce when it requests a temporary buffer to use.

Other than the HAN/xhc coupling, that makes sense. The rbuf definition in the MPI standard is rather unfortunate, but such is life.

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big PR it takes time to review and understand how it interacts with the rest of the OMPI internals. But here are some comments for you.

int return_code = OMPI_SUCCESS;
int ret;

errno = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the question here? Why errno is set to 0? If so, it's for error reporting: in case something fails, I show what errno indicated. In the past I've had trouble where errno was set to non-zero outside the code in question, and when the code failed in a way that errno is not altered, the stale value of errno mileadingly hinted to another error. I don't have strong feelings about it -- let me know if it would be best to remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we report vie the return code not via errno. Moreover, why setting it to success ?

Copy link
Contributor Author

@gkatev gkatev Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also report init errors via a help text -- perhaps that could/should go away. In the olden times before smsc, xhc_init also handled xpmem initialization, so failures there were even more common due to xpmem misconfigurations, and it was helpful to be explicitly notified of them. So the help text there is in part a remnant that could potentially be removed.

Or maybe it should only appear when e.g. the priority is explicitly set to non-zero. Or maybe something like that would be nice in coll/base -- a mechanism to inform the user that something went wrong with something that they explicitly requested. Anyway.

The errno set to 0 is to bring it to a known state. Suppose that some previously running code encounters an error, e.g. a failed open for a non-existing file, but does not abort because of it. My init may fail because of some unrelated reason, but errno will misleadingly indicate that failure is related to No such file or directory. (I'm not 100% sure on errno conventions and that it's OK to reset it like that)

ompi/mca/coll/xhc/coll_xhc.c Outdated Show resolved Hide resolved
return return_code;
}

void xhc_deinit(mca_coll_xhc_module_t *module) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we tend to call these xhc_fini, and they should either follow the naming scheme or be static.

Copy link
Contributor Author

@gkatev gkatev Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will adjust it. There are also a number of other functions that are not static and begin with xhc_, so I'll fix those as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to xhc_fini to match the other coll components. I also fixed the naming of all non-static functions.

For the functions that are non-static because they are called across xhc files, but are only intended to be used internally, I added some macros that omit the mca_coll_ prefix, so as to keeps the names short and wieldy, and to offer a slight hint to their internal nature.

int return_code = OMPI_SUCCESS;
int ret;

comms = malloc((comms_size = 5) * sizeof(xhc_comm_t));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ugly, please dont initialize a variable in a function call argument list!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have too strongs feelings about it, but I think it's rather nice; it is compact and groups together both the allocation of the memory and the assignment of the variable tracking its size. And has all its parentheses! Or is there a correctness concern here?

ompi/mca/coll/xhc/coll_xhc.c Outdated Show resolved Hide resolved
ompi/mca/coll/xhc/coll_xhc.c Outdated Show resolved Hide resolved
@gkatev
Copy link
Contributor Author

gkatev commented Feb 22, 2023

Thanks for the review and the initial comments. I addressed most of them with code changes. I added them as a new commit, as I imagined will make keeping track of the improvements easier (?). (to be squashed before merging)

@jjhursey
Copy link
Member

bot:ibm:retest

@bosilca
Copy link
Member

bosilca commented Mar 31, 2023

finally done with the review and 👍. However, I would like to revisit the discussion above about the rbuf on reduce for non-root. I understand the interest, but I do not like the implementation. I understand the issue is arising from the MPI standard itself, but I would like to propose an OMPI-specific and sensible solution. in our upper-level implementation we switch the rbuf coming from the user to NULL on all non-root processes, to indicate that we don't know if it's use is safe. Thus, if a component receives a non-NULL buffer it will indicate that some other part of OMPI has allocated the memory and it is therefore safe to use. This will have no impact on existing collective components and will allow this component to remove the hack with the root. Thoughts ?

@bwbarrett
Copy link
Member

@bosilca I really like that solution.

@gkatev
Copy link
Contributor Author

gkatev commented Apr 1, 2023

I like it! I'll implement/test it on the XHC side and update the PR

@gkatev
Copy link
Contributor Author

gkatev commented Apr 1, 2023

The only caveat is that the root, for whom sbuf and rbuf are both non-NULL, won't know whether rbuf is present for other ranks or not. Looks like communication between root and at least one other rank is necessary in order for the root to determine whether to use the fallback reduce or not.

It's certainly possible and better than the other hack, so I'll think it over.

Part of the problem is the non-full implementation of reduce. In the future the decision shouldn't be whether to fall back to another component or not (based on the presence of rbuf), but rather whether to allocate a temporary buffer or not for non-roots. The latter does not require any rbuf-related action on the root's side.

@bosilca
Copy link
Member

bosilca commented Apr 2, 2023

I see, you need more than local knowledge about the buffers. My patch will definitely not provide that, it just give us all (aka implementers of collective algorithms) a better knowledge of what we can use locally.

Let me think a little about the buffer management, the cleaner approach would be a well-defined internal API for management of temporary buffers. IF you don't mind send me a message (email or slack) about the requirements you would need from such an API.

@gkatev
Copy link
Contributor Author

gkatev commented May 5, 2023

I got rid of the hacky approach for the HAN allreduce. XHC's reduce now keeps a sticky internal buffer that it manually allocates and grows in size when necessary. If rbuf is NULL (and thanks to #11552, a scenario where rbuf is non-NULL but is not valid is not possible), the internal buffer is used instead. During HAN's allreduce, in XHC's reduce, the rbuf will be present for all ranks, and in this case an internal rbuf won't need to be allocated. The only case in which XHC's reduce falls back to another component now is when root != 0, but this only requires local knowledge to determine.

This PR should ideally be merged after #11552. See also #11650.

I also applied a bunch of misc fixes and a small new feature for the barrier's root rank (it should all be in github's comparison).

@gkatev
Copy link
Contributor Author

gkatev commented Jan 29, 2024

Hello @bosilca, and others, is now a good time to look into completing the integration of this PR? From what I understand the only prerequisites are #11552 and #11650, correct? Can I assist in any way, e.g. with testing or code? I could look into putting together a PR to address the issue identified in #11650 (?). I'll also bring this PR up to speed with the latest main, and test to make sure all still works.

@edgargabriel
Copy link
Member

edgargabriel commented Feb 9, 2024

I performed some tests and measurements with this component, I think the performance improvements are very significant and we should try to merge this in the near future, before we branch for the next release. (Short message allreduce was the only area that the xhc component was slower than tuned, but I would suspect that this could be tuned as well). I can also provide the raw data instead of graphs if people prefer.

Performance Graphs

xhc_osu_bcast_64

xhc_osu_reduce_64

xhc_osu_allreduce_64

@gkatev gkatev force-pushed the coll_xhc branch 2 times, most recently from 89d8897 to cbeff9a Compare March 4, 2024 17:34
@gkatev
Copy link
Contributor Author

gkatev commented Apr 3, 2024

With #11552 and #11650 completed, I believe we are now ready to merge this? Is there something else pending?

Thanks for the benchmarks Edgar, I'll take a look regarding the short message allreduce. It's also possible that it has already been improved/fixed in upcoming versions, might ask you at some point if you can repeat the test!

@gkatev gkatev force-pushed the coll_xhc branch 3 times, most recently from 9508b41 to 8c032f2 Compare April 3, 2024 18:04
@gkatev
Copy link
Contributor Author

gkatev commented Apr 3, 2024

There were some mpi4py CI failures, with the tests timing out (327, 328). They actually went away after I re-ran the CI twice. Kinda mysterious, since all changes introduced here are not expected to be utilized at all, unless specifically activated. I'll try to see if I can reproduce the failures on my machine. In the mean time, if there are any insights on the mpi4py CI and whether the failures are related to this PR or not...

@hppritcha
Copy link
Member

wrt mpi4py failures, they look to be related to problems we are seeing with intercommunicators. intermittent. they are highly unlikely to be triggered by the work in this pr.

@wenduwan
Copy link
Contributor

wenduwan commented May 1, 2024

@gkatev We should have brought this to your attention earlier.

During our developer meeting last week. We had a constructive discussion about this PR. The consensus is that the change is rather safe as it does not alter default collective behaviors. However, from a code maintenance perspective, the community is severely under-staffed(we are all volunteers). We want to understand if your sponsor would be committed to maintaining XHC in the short to long term. This includes fixing bugs, as well as reviewing PRs from other contributors.

@bosilca
Copy link
Member

bosilca commented May 1, 2024

@gkatev I sent you an email ealier this week about the topics mentioned above but did not have received an answer yet.

@wenduwan
Copy link
Contributor

wenduwan commented May 1, 2024

@bosilca Sorry I didn't know that.

@bosilca
Copy link
Member

bosilca commented May 1, 2024

No problem, they either bounced back (the email from the XHC paper), and acted as a black hole (his github email). Let's hope direct poke @gkatev works better ;)

@gkatev
Copy link
Contributor Author

gkatev commented May 1, 2024

Hi, sorry for the delay in responding, I just sent a reply. (short answer: yes, we are able to support this work as necessary!)

(PS the email from the paper will stop bouncing quite soon...)

@bosilca
Copy link
Member

bosilca commented May 1, 2024

Please revise this PR to account for the merge of the init/fini PR (#12429)

gkatev added 2 commits May 1, 2024 22:47
XHC implements hierarchical and topology-aware intra-node
collectives, using XPMEM for inter-process communication.
See the README for more information.

Signed-off-by: George Katevenis <gkatev@ics.forth.gr>
Signed-off-by: George Katevenis <gkatev@ics.forth.gr>
@gkatev
Copy link
Contributor Author

gkatev commented May 1, 2024

ready

@bosilca bosilca merged commit 37a7896 into open-mpi:main May 2, 2024
14 checks passed
@gkatev gkatev deleted the coll_xhc branch May 2, 2024 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants