New collectives component: XPMEM-based Hierarchical Collectives (XHC) #11418

gkatev · 2023-02-16T15:06:11Z

Hello,
This PR proposes the XPMEM-based Hierarchical Collectives (XHC) component.

XHC implements optimized intra-node (only) collectives according to a topology-aware n-level hierarchy, especially for nodes with high core counts and elaborate topologies (but not limited to). Supported topological features include NUMA nodes, CPU sockets, shared caches (based on hwloc through ompi's internal process-to-process distance mappings). The design is centered around communication using XPMEM, but Bcast also supports CMA and KNEM via opal/smsc. Currently Bcast, Allreduce and Barrier are supported. Reduce is implemented partially (and disabled by default).

XHC is also combinable with coll/HAN for multi-node runs.

Some more information is included in the README inside the component's folder.
You might also be interested in our related research paper:
A framework for hierarchical single-copy MPI collectives on multicore nodes
https://doi.org/10.1109/CLUSTER51413.2022.00024

Besides the main commit that adds the component, this PR includes two auxillary commits:

The 2nd commit, an allegedly mostly-trivial one, adds support in coll/HAN for selecting XHC as the component to use for the intra-comm.
The 3rd commit, which is a temporary hack, allows XHC's partially-implemented MPI_Reduce to be used for HAN's Allreduce. XHC's reduce is (for now) implemented as a sub-case of allreduce and requires that the rbuf param be valid for all ranks. I will understand if this commit is not desirable :-) (the reason for its existence would be alleged increased performance potential)

Below are some benchmarks from various nodes and operations:

Allreduce

Broadcast

Barrier

HAN+XHC Allreduce

HAN+XHC Broadcast

HAN+XHC Barrier

ompiteam-bot · 2023-02-16T15:06:16Z

Can one of the admins verify this patch?

bosilca · 2023-02-16T15:07:48Z

ok to test.

bosilca · 2023-02-16T15:18:29Z

Your barrier graphs are missing legends or titles to be able to understand what they are showing. Different number of processes I assume ?

gkatev · 2023-02-16T21:13:55Z

Ah sorry something went wrong there, I fixed them and updated the post. It's 3 different nodes (and yes different proc counts), and then one multi-node one across 48 nodes.

bwbarrett · 2023-02-16T22:29:09Z

I might be missing something, but rather than the changes to HAN to support Reduce, why not save the previous Reduce function and call it if the parameters aren't sufficient for your implementation? An example of doing this is the sync collective, which doesn't actually implement any algorithms, but simply calls a barrier after calling the underlying algorithm for the non-synchronizing collectives.

gkatev · 2023-02-17T06:38:07Z

I could do that with the root parameter, but from what I understand not with the recvbuf one. Even if it is non-NULL, I can't assume that it points to valid and appropriately sized memory. And I'd need to know that this is the case for all ranks (or at least the leaders if I recall correctly). Unless the standard says it has to be either NULL or fully valid (??).

In the future, I was thinking of implementing the intermediate buffers necessary on the leaders for the hierarchical reduction using a memory pool (not sure if opal/mpool does what I have in mind or not). This might then be further extended to have HAN's Allreduce register the rbuf to the pool as a "single-use" memory area, which would then be picked up by xhc's reduce when it requests a temporary buffer to use.

bwbarrett · 2023-02-17T16:38:11Z

I could do that with the root parameter, but from what I understand not with the recvbuf one. Even if it is non-NULL, I can't assume that it points to valid and appropriately sized memory. And I'd need to know that this is the case for all ranks (or at least the leaders if I recall correctly). Unless the standard says it has to be either NULL or fully valid (??).

I think you're right; that's a bummer. I'm not a huge fan of the coupling between xhc and han, especially to the level in the last patch, but maybe it is unavoidable.

In the future, I was thinking of implementing the intermediate buffers necessary on the leaders for the hierarchical reduction using a memory pool (not sure if opal/mpool does what I have in mind or not). This might then be further extended to have HAN's Allreduce register the rbuf to the pool as a "single-use" memory area, which would then be picked up by xhc's reduce when it requests a temporary buffer to use.

Other than the HAN/xhc coupling, that makes sense. The rbuf definition in the MPI standard is rather unfortunate, but such is life.

bosilca

This is a big PR it takes time to review and understand how it interacts with the rest of the OMPI internals. But here are some comments for you.

bosilca · 2023-02-17T16:53:11Z

ompi/mca/coll/xhc/coll_xhc.c

+    int return_code = OMPI_SUCCESS;
+    int ret;
+
+    errno = 0;


What is the question here? Why errno is set to 0? If so, it's for error reporting: in case something fails, I show what errno indicated. In the past I've had trouble where errno was set to non-zero outside the code in question, and when the code failed in a way that errno is not altered, the stale value of errno mileadingly hinted to another error. I don't have strong feelings about it -- let me know if it would be best to remove it.

we report vie the return code not via errno. Moreover, why setting it to success ?

I also report init errors via a help text -- perhaps that could/should go away. In the olden times before smsc, xhc_init also handled xpmem initialization, so failures there were even more common due to xpmem misconfigurations, and it was helpful to be explicitly notified of them. So the help text there is in part a remnant that could potentially be removed.

Or maybe it should only appear when e.g. the priority is explicitly set to non-zero. Or maybe something like that would be nice in coll/base -- a mechanism to inform the user that something went wrong with something that they explicitly requested. Anyway.

The errno set to 0 is to bring it to a known state. Suppose that some previously running code encounters an error, e.g. a failed open for a non-existing file, but does not abort because of it. My init may fail because of some unrelated reason, but errno will misleadingly indicate that failure is related to No such file or directory. (I'm not 100% sure on errno conventions and that it's OK to reset it like that)

ompi/mca/coll/xhc/coll_xhc.c

bosilca · 2023-02-17T17:19:46Z

ompi/mca/coll/xhc/coll_xhc.c

+    return return_code;
+}
+
+void xhc_deinit(mca_coll_xhc_module_t *module) {


we tend to call these xhc_fini, and they should either follow the naming scheme or be static.

OK, I will adjust it. There are also a number of other functions that are not static and begin with xhc_, so I'll fix those as well.

I renamed it to xhc_fini to match the other coll components. I also fixed the naming of all non-static functions.

For the functions that are non-static because they are called across xhc files, but are only intended to be used internally, I added some macros that omit the mca_coll_ prefix, so as to keeps the names short and wieldy, and to offer a slight hint to their internal nature.

bosilca · 2023-02-17T17:22:07Z

ompi/mca/coll/xhc/coll_xhc.c

+    int return_code = OMPI_SUCCESS;
+    int ret;
+
+    comms = malloc((comms_size = 5) * sizeof(xhc_comm_t));


This is ugly, please dont initialize a variable in a function call argument list!

I don't have too strongs feelings about it, but I think it's rather nice; it is compact and groups together both the allocation of the memory and the assignment of the variable tracking its size. And has all its parentheses! Or is there a correctness concern here?

ompi/mca/coll/xhc/coll_xhc.c

gkatev · 2023-02-22T09:57:59Z

Thanks for the review and the initial comments. I addressed most of them with code changes. ~~I added them as a new commit, as I imagined will make keeping track of the improvements easier (?). (to be squashed before merging)~~

jjhursey · 2023-02-22T15:14:24Z

bot:ibm:retest

bosilca · 2023-03-31T19:06:26Z

finally done with the review and 👍. However, I would like to revisit the discussion above about the rbuf on reduce for non-root. I understand the interest, but I do not like the implementation. I understand the issue is arising from the MPI standard itself, but I would like to propose an OMPI-specific and sensible solution. in our upper-level implementation we switch the rbuf coming from the user to NULL on all non-root processes, to indicate that we don't know if it's use is safe. Thus, if a component receives a non-NULL buffer it will indicate that some other part of OMPI has allocated the memory and it is therefore safe to use. This will have no impact on existing collective components and will allow this component to remove the hack with the root. Thoughts ?

bwbarrett · 2023-03-31T19:20:02Z

@bosilca I really like that solution.

gkatev · 2023-04-01T08:19:27Z

I like it! I'll implement/test it on the XHC side and update the PR

gkatev · 2023-04-01T10:05:14Z

The only caveat is that the root, for whom sbuf and rbuf are both non-NULL, won't know whether rbuf is present for other ranks or not. Looks like communication between root and at least one other rank is necessary in order for the root to determine whether to use the fallback reduce or not.

It's certainly possible and better than the other hack, so I'll think it over.

Part of the problem is the non-full implementation of reduce. In the future the decision shouldn't be whether to fall back to another component or not (based on the presence of rbuf), but rather whether to allocate a temporary buffer or not for non-roots. The latter does not require any rbuf-related action on the root's side.

bosilca · 2023-04-02T15:48:04Z

I see, you need more than local knowledge about the buffers. My patch will definitely not provide that, it just give us all (aka implementers of collective algorithms) a better knowledge of what we can use locally.

Let me think a little about the buffer management, the cleaner approach would be a well-defined internal API for management of temporary buffers. IF you don't mind send me a message (email or slack) about the requirements you would need from such an API.

gkatev · 2023-05-05T11:31:58Z

I got rid of the hacky approach for the HAN allreduce. XHC's reduce now keeps a sticky internal buffer that it manually allocates and grows in size when necessary. If rbuf is NULL (and thanks to #11552, a scenario where rbuf is non-NULL but is not valid is not possible), the internal buffer is used instead. During HAN's allreduce, in XHC's reduce, the rbuf will be present for all ranks, and in this case an internal rbuf won't need to be allocated. The only case in which XHC's reduce falls back to another component now is when root != 0, but this only requires local knowledge to determine.

This PR should ideally be merged after #11552. See also #11650.

I also applied a bunch of misc fixes and a small new feature for the barrier's root rank (it should all be in github's comparison).

gkatev · 2024-01-29T17:04:32Z

Hello @bosilca, and others, is now a good time to look into completing the integration of this PR? From what I understand the only prerequisites are #11552 and #11650, correct? Can I assist in any way, e.g. with testing or code? I could look into putting together a PR to address the issue identified in #11650 (?). I'll also bring this PR up to speed with the latest main, and test to make sure all still works.

edgargabriel · 2024-02-09T20:18:11Z

I performed some tests and measurements with this component, I think the performance improvements are very significant and we should try to merge this in the near future, before we branch for the next release. (Short message allreduce was the only area that the xhc component was slower than tuned, but I would suspect that this could be tuned as well). I can also provide the raw data instead of graphs if people prefer.

Performance Graphs

gkatev · 2024-04-03T08:56:40Z

With #11552 and #11650 completed, I believe we are now ready to merge this? Is there something else pending?

Thanks for the benchmarks Edgar, I'll take a look regarding the short message allreduce. It's also possible that it has already been improved/fixed in upcoming versions, might ask you at some point if you can repeat the test!

gkatev · 2024-04-03T19:17:49Z

There were some mpi4py CI failures, with the tests timing out (327, 328). They actually went away after I re-ran the CI twice. Kinda mysterious, since all changes introduced here are not expected to be utilized at all, unless specifically activated. I'll try to see if I can reproduce the failures on my machine. In the mean time, if there are any insights on the mpi4py CI and whether the failures are related to this PR or not...

hppritcha · 2024-04-03T19:39:19Z

wrt mpi4py failures, they look to be related to problems we are seeing with intercommunicators. intermittent. they are highly unlikely to be triggered by the work in this pr.

wenduwan · 2024-05-01T18:41:49Z

@gkatev We should have brought this to your attention earlier.

During our developer meeting last week. We had a constructive discussion about this PR. The consensus is that the change is rather safe as it does not alter default collective behaviors. However, from a code maintenance perspective, the community is severely under-staffed(we are all volunteers). We want to understand if your sponsor would be committed to maintaining XHC in the short to long term. This includes fixing bugs, as well as reviewing PRs from other contributors.

bosilca · 2024-05-01T18:58:08Z

@gkatev I sent you an email ealier this week about the topics mentioned above but did not have received an answer yet.

wenduwan · 2024-05-01T19:01:45Z

@bosilca Sorry I didn't know that.

bosilca · 2024-05-01T19:05:31Z

No problem, they either bounced back (the email from the XHC paper), and acted as a black hole (his github email). Let's hope direct poke @gkatev works better ;)

gkatev · 2024-05-01T19:38:11Z

Hi, sorry for the delay in responding, I just sent a reply. (short answer: yes, we are able to support this work as necessary!)

(PS the email from the paper will stop bouncing quite soon...)

bosilca · 2024-05-01T19:41:51Z

Please revise this PR to account for the merge of the init/fini PR (#12429)

XHC implements hierarchical and topology-aware intra-node collectives, using XPMEM for inter-process communication. See the README for more information. Signed-off-by: George Katevenis <gkatev@ics.forth.gr>

Signed-off-by: George Katevenis <gkatev@ics.forth.gr>

gkatev · 2024-05-01T21:35:06Z

ready

github-actions bot added the Target: main label Feb 16, 2023

gkatev force-pushed the coll_xhc branch from a203458 to e5e7d2d Compare February 17, 2023 06:10

bosilca reviewed Feb 20, 2023

View reviewed changes

gkatev force-pushed the coll_xhc branch from ce5ce73 to 0222f35 Compare February 23, 2023 08:13

gkatev force-pushed the coll_xhc branch from 0222f35 to 4c590c7 Compare March 3, 2023 16:01

gkatev force-pushed the coll_xhc branch from 4c590c7 to b08f94a Compare March 27, 2023 07:09

bosilca mentioned this pull request Apr 1, 2023

Pass NULL to the collective components for unused buffers. #11552

Merged

gkatev force-pushed the coll_xhc branch from b08f94a to 05068be Compare May 5, 2023 11:00

gkatev mentioned this pull request May 5, 2023

Problem with rbuf pointer in HAN's reduce #11650

Closed

gkatev force-pushed the coll_xhc branch from 05068be to 94ac74e Compare August 31, 2023 09:51

gkatev force-pushed the coll_xhc branch 2 times, most recently from 89d8897 to cbeff9a Compare March 4, 2024 17:34

gkatev force-pushed the coll_xhc branch 3 times, most recently from 9508b41 to 8c032f2 Compare April 3, 2024 18:04

gkatev force-pushed the coll_xhc branch from 8c032f2 to 3047c89 Compare May 1, 2024 19:46

gkatev added 2 commits May 1, 2024 22:47

Add the XHC (XPMEM-based Hierarchical Collectives) component

7b59f8e

XHC implements hierarchical and topology-aware intra-node collectives, using XPMEM for inter-process communication. See the README for more information. Signed-off-by: George Katevenis <gkatev@ics.forth.gr>

coll/HAN: Add support for XHC on the intra-comm

7b9e74c

Signed-off-by: George Katevenis <gkatev@ics.forth.gr>

gkatev force-pushed the coll_xhc branch from 3047c89 to 7b9e74c Compare May 1, 2024 21:31

bosilca approved these changes May 2, 2024

View reviewed changes

bosilca merged commit 37a7896 into open-mpi:main May 2, 2024
14 checks passed

gkatev deleted the coll_xhc branch May 2, 2024 15:24

New collectives component: XPMEM-based Hierarchical Collectives (XHC) #11418

New collectives component: XPMEM-based Hierarchical Collectives (XHC) #11418

Conversation

gkatev commented Feb 16, 2023 • edited

ompiteam-bot commented Feb 16, 2023

bosilca commented Feb 16, 2023

bosilca commented Feb 16, 2023

gkatev commented Feb 16, 2023

bwbarrett commented Feb 16, 2023

gkatev commented Feb 17, 2023

bwbarrett commented Feb 17, 2023

bosilca left a comment

Choose a reason for hiding this comment

bosilca Feb 17, 2023

Choose a reason for hiding this comment

gkatev Feb 20, 2023

Choose a reason for hiding this comment

bosilca Feb 20, 2023

Choose a reason for hiding this comment

gkatev Feb 20, 2023 • edited

Choose a reason for hiding this comment

bosilca Feb 17, 2023

Choose a reason for hiding this comment

gkatev Feb 20, 2023 • edited

Choose a reason for hiding this comment

gkatev Feb 22, 2023

Choose a reason for hiding this comment

bosilca Feb 17, 2023

Choose a reason for hiding this comment

gkatev Feb 22, 2023

Choose a reason for hiding this comment

gkatev commented Feb 22, 2023 • edited

jjhursey commented Feb 22, 2023

bosilca commented Mar 31, 2023

bwbarrett commented Mar 31, 2023

gkatev commented Apr 1, 2023

gkatev commented Apr 1, 2023 • edited

bosilca commented Apr 2, 2023

gkatev commented May 5, 2023 • edited

gkatev commented Jan 29, 2024 • edited

edgargabriel commented Feb 9, 2024 • edited

gkatev commented Apr 3, 2024

gkatev commented Apr 3, 2024

hppritcha commented Apr 3, 2024

wenduwan commented May 1, 2024

bosilca commented May 1, 2024

wenduwan commented May 1, 2024

bosilca commented May 1, 2024

gkatev commented May 1, 2024 • edited

bosilca commented May 1, 2024

gkatev commented May 1, 2024

gkatev commented Feb 16, 2023 •

edited

gkatev Feb 20, 2023 •

edited

gkatev Feb 20, 2023 •

edited

gkatev commented Feb 22, 2023 •

edited

gkatev commented Apr 1, 2023 •

edited

gkatev commented May 5, 2023 •

edited

gkatev commented Jan 29, 2024 •

edited

edgargabriel commented Feb 9, 2024 •

edited

gkatev commented May 1, 2024 •

edited