Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offload reduction operations to accelerator devices #12318

Draft
wants to merge 74 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
35ff1da
Initial draft of CUDA device support for ops
Mar 3, 2023
b7e6f89
First working version of CUDA op support
Mar 14, 2023
164388a
Update copyright header
Mar 14, 2023
d8110ac
Fix minor bugs to get osu_allreduce working
Mar 14, 2023
f609127
cuMemAllocAsync is supported since CUDA 11.2.0
Mar 15, 2023
8ae3dac
coll/base/allreduce: Condition device allocation on op/dtype support
Mar 15, 2023
655948f
Make sure the device op callbacks are zero-initialized
Mar 19, 2023
7cdc828
Be more graceful when creating a context and stream
Mar 19, 2023
bdb16a1
fix wrong call to memset
devreal Mar 31, 2023
5934f43
Add detector for cudart
devreal Mar 31, 2023
c2c3d0e
Add CUDA stream-based allocator and memory pools
devreal Apr 5, 2023
5df449c
Don't memset the CUDA op component, we need the version
devreal Apr 6, 2023
812d068
Set the memory pool release threshold
devreal Apr 6, 2023
a688c84
Implement device-compatible allocator to cache coll temporaries
devreal Apr 11, 2023
bbd362d
Fix devicebucket allocator for larger sizes
devreal Apr 12, 2023
1fd6636
Fix the RDMA fallback protocol selection.
bosilca Apr 14, 2023
f2f0f2d
Stream-based reduction and ddt copy and 3buff cuda kernels, adopted f…
devreal Apr 14, 2023
8f5b503
Remove extra copies from allreduce redscat and ring
devreal Apr 19, 2023
1c68d17
Allow ops and memcpy on managed memory from the host
devreal Apr 19, 2023
70dde0f
reduce_local: add support for device memory
devreal Apr 19, 2023
e603bcc
Draft of ompi_op_select_device
devreal Apr 19, 2023
60dd446
Second draft of ompi_op_select_device
devreal Apr 19, 2023
c485ecf
Fix undefined symbols in cuda op component
devreal Apr 27, 2023
793863c
Fix off-by-one error in device-bucket allocator
devreal Apr 28, 2023
d2e8677
Heuristic to select op device based on element count
devreal Apr 28, 2023
cd7e578
init op_rocm, not compilable yet
May 2, 2023
2ccaa87
implemented funcs in accelerator_rocm modules
May 3, 2023
a6f1cce
add -I include path to Makefile
May 3, 2023
ce0b88d
added rocm codes into test example
May 16, 2023
ad420fe
fixed kernel launches in hip
May 16, 2023
c3c3287
Make headers in reduce_local better parsable
devreal Jun 27, 2023
9674aae
CUDA: disable internal memory pool (seems broken)
devreal Jun 27, 2023
628c0f1
Op: minor comment correction
devreal Jun 27, 2023
251dac4
Reduce_local: set hip device during init
devreal Jun 28, 2023
7589d17
CUDA accelerator: fix compiler warnings
devreal Jun 28, 2023
ead6847
Device op: pass device to lower-level op to avoid recurring queries
devreal Jun 28, 2023
ee31b60
CUDA/ROCm: Fix vectorized ops and rocm integration
devreal Jun 28, 2023
9ab499a
Reduce_local: use OPAL defines to detect device support
devreal Jun 28, 2023
dbd855d
CUDA op: fix vectorized ops
devreal Jun 28, 2023
02120c9
Reduce: add vectors to cuda implementation
devreal Jul 12, 2023
7cdbe24
Allreduce: cleanup and minor fixes
devreal Jul 18, 2023
c7fe5f6
Add MCA op_[cuda|rocm]_max_num_[blocks|threads]
devreal Jul 19, 2023
42bd424
Fix the generation of "unsigned char" ops.
bosilca Jul 19, 2023
8e3d042
We need CXX17 for the CUDA ops.
bosilca Jul 19, 2023
7524f99
ROCM: add vectorization of some basic ops
devreal Jul 20, 2023
cfe8a5a
Device allocators: correctly handle non-zero ID single accelerator
devreal Jul 20, 2023
3bc7676
CUDA op: consistently name unsigned_long functions as ulong
devreal Jul 20, 2023
9c1da7e
ROCM op: remove debug output
devreal Jul 20, 2023
a20f671
Reduce_local test: correctly test for OPAL_CUDA_SUPPORT and OPAL_ROCM…
devreal Jul 20, 2023
97338db
More unsigned_long -> ulong fixes in CUDA and ROCm op
devreal Jul 20, 2023
541b8a0
Fix type in ulong conversion
devreal Jul 20, 2023
8cb2feb
Reduce_local: access only host-side memory in error message
devreal Jul 20, 2023
2996ba0
Make sure CUDA accelerator is initialized before querying number of d…
devreal Jul 20, 2023
246003f
Accelerator: provide peak bandwidth estimate
devreal Jul 24, 2023
6601484
accelerator/rocm: regular memory behaves like unified memory
devreal Jul 24, 2023
d0fe9a2
ROCM: add missing FUNC_FUNC_FN macro
devreal Jul 24, 2023
63b64a0
opal_datatype_accelerator_memcpy: determine device copy type
devreal Jul 26, 2023
5a29e13
accelerator rocm: fix global memcpy stream variable
devreal Jul 26, 2023
5c7c7a1
Thread base: fix missing include file
devreal Jul 26, 2023
76f00c4
Accelerator: Remove debug output
devreal Jul 26, 2023
56bcfee
Allreduce: don't copy inputs if data can be accessed from the host
devreal Jul 26, 2023
a1f089e
Be more careful when releasing temporary receive buffers
devreal Nov 6, 2023
33616e6
Remove debug output and dead code
devreal Nov 6, 2023
9da8b54
Bump max devicebucket allocator max size to 1GB
devreal Nov 6, 2023
93ded5e
accelerator/cuda: fix error message
devreal Nov 6, 2023
182e6fa
CUDA: Select compute capability 52 by default
devreal Nov 6, 2023
e5eb45f
Sqash const correctness warnings
devreal Nov 7, 2023
14a5372
Squash warnings about mismatched function pointer types
devreal Nov 7, 2023
1f63809
Squash printfs
devreal Nov 7, 2023
3d9f33a
Replace fprintf with show_help
devreal Nov 7, 2023
c878c4f
Squash compiler warnings
devreal Nov 7, 2023
1c6667d
Clean up cuda and rocm op codes
devreal Nov 7, 2023
7bb4b95
Minor tweak to CUDA op configury
devreal Nov 7, 2023
d1382c3
Fix rebase errors
devreal Nov 8, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
120 changes: 120 additions & 0 deletions config/opal_check_cudart.m4
@@ -0,0 +1,120 @@
dnl -*- autoconf -*-
dnl
dnl Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
dnl University Research and Technology
dnl Corporation. All rights reserved.
dnl Copyright (c) 2004-2005 The University of Tennessee and The University
dnl of Tennessee Research Foundation. All rights
dnl reserved.
dnl Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
dnl University of Stuttgart. All rights reserved.
dnl Copyright (c) 2004-2005 The Regents of the University of California.
dnl All rights reserved.
dnl Copyright (c) 2006-2016 Cisco Systems, Inc. All rights reserved.
dnl Copyright (c) 2007 Sun Microsystems, Inc. All rights reserved.
dnl Copyright (c) 2009 IBM Corporation. All rights reserved.
dnl Copyright (c) 2009 Los Alamos National Security, LLC. All rights
dnl reserved.
dnl Copyright (c) 2009-2011 Oak Ridge National Labs. All rights reserved.
dnl Copyright (c) 2011-2015 NVIDIA Corporation. All rights reserved.
dnl Copyright (c) 2015 Research Organization for Information Science
dnl and Technology (RIST). All rights reserved.
dnl Copyright (c) 2022 Amazon.com, Inc. or its affiliates. All Rights reserved.
dnl $COPYRIGHT$
dnl
dnl Additional copyrights may follow
dnl
dnl $HEADER$
dnl


# OPAL_CHECK_CUDART(prefix, [action-if-found], [action-if-not-found])
# --------------------------------------------------------
# check if CUDA runtime library support can be found. sets prefix_{CPPFLAGS,
# LDFLAGS, LIBS} as needed and runs action-if-found if there is
# support, otherwise executes action-if-not-found

#
# Check for CUDA support
#
AC_DEFUN([OPAL_CHECK_CUDART],[
OPAL_VAR_SCOPE_PUSH([cudart_save_CPPFLAGS cudart_save_LDFLAGS cudart_save_LIBS])

cudart_save_CPPFLAGS="$CPPFLAGS"
cudart_save_LDFLAGS="$LDFLAGS"
cudart_save_LIBS="$LIBS"

#
# Check to see if the user provided paths for CUDART
#
AC_ARG_WITH([cudart],
[AS_HELP_STRING([--with-cudart=DIR],
[Path to the CUDA runtime library and header files])])
AC_MSG_CHECKING([if --with-cudart is set])
AC_ARG_WITH([cudart-libdir],
[AS_HELP_STRING([--with-cudart-libdir=DIR],
[Search for CUDA runtime libraries in DIR])])

####################################
#### Check for CUDA runtime library
####################################
AS_IF([test "x$with_cudart" != "xno" || test "x$with_cudart" = "x"],
[opal_check_cudart_happy=no
AC_MSG_RESULT([not set (--with-cudart=$with_cudart)])],
[AS_IF([test ! -d "$with_cudart"],
[AC_MSG_RESULT([not found])
AC_MSG_WARN([Directory $with_cudart not found])]
[AS_IF([test "x`ls $with_cudart/include/cuda_runtime.h 2> /dev/null`" = "x"]
[AC_MSG_RESULT([not found])
AC_MSG_WARN([Could not find cuda_runtime.h in $with_cudart/include])]
[opal_check_cudart_happy=yes
opal_cudart_incdir="$with_cudart/include"])])])

AS_IF([test "$opal_check_cudart_happy" = "no" && test "$with_cudart" != "no"],
[AC_PATH_PROG([nvcc_bin], [nvcc], ["not-found"])
AS_IF([test "$nvcc_bin" = "not-found"],
[AC_MSG_WARN([Could not find nvcc binary])],
[nvcc_dirname=`AS_DIRNAME([$nvcc_bin])`
with_cudart=$nvcc_dirname/../
opal_cudart_incdir=$nvcc_dirname/../include
opal_check_cudart_happy=yes])
]
[])

AS_IF([test x"$with_cudart_libdir" = "x"],
[with_cudart_libdir=$with_cudart/lib64/]
[])

AS_IF([test "$opal_check_cudart_happy" = "yes"],
[OAC_CHECK_PACKAGE([cudart],
[$1],
[cuda_runtime.h],
[cudart],
[cudaMalloc],
[opal_check_cudart_happy="yes"],
[opal_check_cudart_happy="no"])],
[])


AC_MSG_CHECKING([if have cuda runtime library support])
if test "$opal_check_cudart_happy" = "yes"; then
AC_MSG_RESULT([yes (-I$opal_cudart_incdir)])
CUDART_SUPPORT=1
common_cudart_CPPFLAGS="-I$opal_cudart_incdir"
AC_SUBST([common_cudart_CPPFLAGS])
else
AC_MSG_RESULT([no])
CUDART_SUPPORT=0
fi


OPAL_SUMMARY_ADD([Accelerators], [CUDART support], [], [$opal_check_cudart_happy])
AM_CONDITIONAL([OPAL_cudart_support], [test "x$CUDART_SUPPORT" = "x1"])
AC_DEFINE_UNQUOTED([OPAL_CUDART_SUPPORT],$CUDART_SUPPORT,
[Whether we have cuda runtime library support])

CPPFLAGS=${cudart_save_CPPFLAGS}
LDFLAGS=${cudart_save_LDFLAGS}
LIBS=${cudart_save_LIBS}
OPAL_VAR_SCOPE_POP
])dnl
16 changes: 12 additions & 4 deletions ompi/datatype/ompi_datatype.h
Expand Up @@ -275,8 +275,9 @@ ompi_datatype_set_element_count( const ompi_datatype_t* type, size_t count, size
}

static inline int32_t
ompi_datatype_copy_content_same_ddt( const ompi_datatype_t* type, size_t count,
char* pDestBuf, char* pSrcBuf )
ompi_datatype_copy_content_same_ddt_stream( const ompi_datatype_t* type, size_t count,
char* pDestBuf, char* pSrcBuf,
opal_accelerator_stream_t *stream )
{
int32_t length, rc;
ptrdiff_t extent;
Expand All @@ -285,8 +286,8 @@ ompi_datatype_copy_content_same_ddt( const ompi_datatype_t* type, size_t count,
while( 0 != count ) {
length = INT_MAX;
if( ((size_t)length) > count ) length = (int32_t)count;
rc = opal_datatype_copy_content_same_ddt( &type->super, length,
pDestBuf, pSrcBuf );
rc = opal_datatype_copy_content_same_ddt_stream( &type->super, length,
pDestBuf, pSrcBuf, stream );
if( 0 != rc ) return rc;
pDestBuf += ((ptrdiff_t)length) * extent;
pSrcBuf += ((ptrdiff_t)length) * extent;
Expand All @@ -295,6 +296,13 @@ ompi_datatype_copy_content_same_ddt( const ompi_datatype_t* type, size_t count,
return 0;
}

static inline int32_t
ompi_datatype_copy_content_same_ddt( const ompi_datatype_t* type, size_t count,
char* pDestBuf, char* pSrcBuf )
{
return ompi_datatype_copy_content_same_ddt_stream(type, count, pDestBuf, pSrcBuf, NULL);
}

OMPI_DECLSPEC const ompi_datatype_t* ompi_datatype_match_size( int size, uint16_t datakind, uint16_t datalang );

/*
Expand Down