Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the acoll component #12484

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

amd-nithyavs
Copy link

This PR introduces "acoll", a high-performant collective component that is optimized for communications within a single node of AMD EPYC CPUs. It mainly uses subcommunicators based on l3cache or numa to reduce cross-cache or cross-numa accesses. The supported collectives include Bcast, Allreduce, Gather, Reduce, Barrier, Allgather.

OSU micro-benchmarks were run on 2-socket AMD EPYC 9654 96-Core Processor with 4 NUMA domains per socket, with a total of 192 cores per node, on top of commit bb7ecde.
Average percentage latency reduction over "tuned" across 32, 64, 96, 128, 192 ranks over message sizes of 8 bytes to 8 MB (varied in powers of 2):

  • Allreduce: 37.7%
  • Bcast: 39.4%
  • Gather: 27.5%

Sample graphs:

Allreduce
allreduce

Bcast
bcast

Gather
gather

@bosilca
Copy link
Member

bosilca commented Apr 22, 2024

How this compares with #10470 ?

@edgargabriel
Copy link
Member

edgargabriel commented Apr 22, 2024

How this compares with #10470 ?

We can discuss it at the meeting. Part of the goal of filing the pr was to give people the ability to have a look at it ahead of the meeting if they want/can.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have plan to add alltoall(v) to acoll?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are planning to add alltoall to acoll next.

* chosen, further decides if [ring|lin] allgather is to be used.
*
*/
static inline void coll_allgather_decision_fixed(int size, size_t total_dsize, int sg_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you shed some lights on how to choose which methods for other intel/amd achitectures? you might also want some utility to let the user to adjust the decisions according for other systems.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our testing has been mostly focused on Zen architectures, we will soon test on other architectures. We do not have the utility/config option to override decisions, we will plan to add it.

@@ -0,0 +1,15 @@
Copyright (c) 2023-2024 Advanced Micro Devices, Inc. All rights
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you thought about what needs to be done to extend this for multiple nodes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the APIs (like bcast, barrier, allgather) support multi-node case. However, it is not extensively tested for multi-node, we will test them and extend other APIs also to multi-node.

/*
* rd_allgather_sub
*
* Function: Uses recursive doubling based allgather for the group.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you compared the performance of other methods, besides recursive doubling?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, acoll/allgather chooses among recursive doubling, ring and linear based on process count and message sizes.

}

/* This barrier is needed to prevent random hangs */
err = ompi_coll_base_barrier_intra_tree(comm, module);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the barrier is needed here? This barrier will also add cost to small message allgather.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is removed now.

if (sbuf != MPI_IN_PLACE)
memcpy(tmp_rbuf, sbuf, my_count_size * dsize);
} else {
ompi_3buff_op_reduce(op, (char *) data->xpmem_saddr[0] + chunk * rank * dsize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the 3 operator reduce function to maintain the order?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was a bit faster than copying the chunks first and then reducing later in the following "for" loop.

@bosilca
Copy link
Member

bosilca commented Apr 29, 2024

Please rebase to current main to get rid of the mpi4py failure.

@wenduwan
Copy link
Contributor

I tested the PR in AWS CI. I'm seeing assertion errors with --enable-debug

# [ pairs: 18 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
osu_mbw_mr: coll_acoll_allgather.c:391: mca_coll_acoll_allgather: Assertion `subc->local_r_comm != NULL' failed.

You can try osu_mbw_mr.

@hppritcha
Copy link
Member

@amd-nithyavs could you rebase this PR to see if that clears up the mpi4py CI failure?

@mshanthagit
Copy link

@amd-nithyavs could you rebase this PR to see if that clears up the mpi4py CI failure?

@hppritcha we did have issues after rebase. Have fixed the issues, will update the PR soon. Thanks.

@mshanthagit
Copy link

I tested the PR in AWS CI. I'm seeing assertion errors with --enable-debug

# [ pairs: 18 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
osu_mbw_mr: coll_acoll_allgather.c:391: mca_coll_acoll_allgather: Assertion `subc->local_r_comm != NULL' failed.

You can try osu_mbw_mr.

The updated PR (yet to be pushed) will fix this issue. Thanks.

@amd-nithyavs
Copy link
Author

I tested the PR in AWS CI. I'm seeing assertion errors with --enable-debug

# [ pairs: 18 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
osu_mbw_mr: coll_acoll_allgather.c:391: mca_coll_acoll_allgather: Assertion `subc->local_r_comm != NULL' failed.

You can try osu_mbw_mr.

The issue is fixed in the updated PR.

@amd-nithyavs
Copy link
Author

@amd-nithyavs could you rebase this PR to see if that clears up the mpi4py CI failure?

We have updated the PR, it passes the mpi4py tests.

@wenduwan
Copy link
Contributor

wenduwan commented May 8, 2024

Running AWS CI

@wenduwan
Copy link
Contributor

wenduwan commented May 8, 2024

@amd-nithyavs I noticed that the PR is currently split into 3 commits. Please squash them before merging.

@wenduwan
Copy link
Contributor

wenduwan commented May 8, 2024

Passed AWS CI. Note that we don't test with xpmem.

Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

2f7c5e2: Merge latest of local ompiv5

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@amd-nithyavs
Copy link
Author

@wenduwan We have rebased to the latest and squashed the commits.

Copy link
Contributor

@wenduwan wenduwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed a couple files. Haven't looked into the algorithm themselves yet. Will pick up later.

@@ -0,0 +1,11 @@
Copyright (C) 2024, Advanced Micro Devices, Inc. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, the copyright notice format in ompi is

Copyright (c) YYYY[-YYYY] Entity etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change in the updated patch

AC_DEFUN([MCA_ompi_coll_acoll_CONFIG],[
AC_CONFIG_FILES([ompi/mca/coll/acoll/Makefile])
# ToDo: Check for a proper way to pass args 1 and 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain what this TODO means? Positional arguments can be accesses in shell style, e.g. $1 $2.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove TODO. We want acoll to build with or without xpmem, this comment was to check if there is a better way to than " OPAL_CHECK_XPMEM([coll_acoll], [should_build=1], [should_build=1])". Passing $1 $2 wouldn't build acoll if xpmem is not present.

return temp_ptr;
}

static inline void coll_acoll_free(coll_acoll_reserve_mem_t *reserve_mem_ptr, void *ptr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the function name could be more specific.

Also I don't quite understand coll_acoll_reserve_mem_t. It looks to me that each instance tracks a single memory allocation. Since coll_acoll_reserve_mem_t.reserve_mem already points to the allocation, why does the caller have to pass in the ptr again?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use coll_acoll_reserve_mem_t to track the use of a pre-allocated memory. However, when the requested size during coll_acoll_malloc() is greater than that of the pre-allocated memory, we allocate new memory, which needs to be then freed during coll_acoll_free(). Hence the need to pass ptr.

}
}

static inline int log_sg_bcast_intra(void *buff, int count, struct ompi_datatype_t *datatype,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the *sg_bcast_intra functions are only used in allgather, so should they be moved to allgather file?

I don't have a strong opinion, but it caught my eye that we are doing actual collectives in utils - that's unusual.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, will move to allgather.

if ((false == reserve_mem_ptr->reserve_mem_allocate)
|| (false == reserve_mem_ptr->reserve_mem_in_use)) {
if (NULL != ptr) {
free(ptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the NULL check should be moved to the top of the function. But please see my comment about coll_acoll_reserve_mem_t - maybe we don't need ptr at all.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the reply to the previous comment on coll_acoll_reserve_mem_t. Since ptr will be the same as reserve_mem (the pre-allocated memory) when the memory size needed is less than or equal to that of the pre-allocated memory, the null check should be inside to ensure it is the case corresponding to a new buffer.

ret = -1;
goto error_hndl;
}
sprintf(rc_name, "acoll_%d_%d_%d", cid, rank, i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit If we care about the name length then we should consider snprintf instead of sprintf


data->rcache[i] = mca_rcache_base_module_create("grdma", NULL, &rcache_element);
if (data->rcache[i] == NULL) {
printf("Error in rcache create\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to clean up printf in this PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

Comment on lines 24 to 25
# include "opal/mca/rcache/base/base.h"
# include <xpmem.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit I don't think we need to indent includes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, will fix it.


END_C_DECLS

#define MCA_COLL_ACOLL_MAX_CID 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment about CID. I'm not sure if it is intended to be used in collectives.

}
#endif

void mca_coll_acoll_barrier(coll_acoll_data_t *data, int offset, int *group, int gp_size, int rank,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems to be at the wrong file.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will rename the function, this is not a generic MPI barrier function, it is used internallly for small message allreduce only

acoll is a collective component optimized for AMD "Zen"-based
processors. It supports Bcast, Allreduce, Reduce, Barrier, Gather and
Allgather APIs.

Signed-off-by: Nithya V S <Nithya.VS@amd.com>
@mshanthagit
Copy link

@wenduwan we have addressed the comments and incorporated some big count related changes as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants