Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offload reduction operations to accelerator devices #12318

Draft
wants to merge 74 commits into
base: main
Choose a base branch
from

Commits on Nov 7, 2023

  1. Initial draft of CUDA device support for ops

    Signed-off-by: Joseph Schuchart <jschuchart@leconte.icl.utk.edu>
    Joseph Schuchart authored and devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    35ff1da View commit details
    Browse the repository at this point in the history
  2. First working version of CUDA op support

    Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>
    Joseph Schuchart authored and devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    b7e6f89 View commit details
    Browse the repository at this point in the history
  3. Update copyright header

    Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>
    Joseph Schuchart authored and devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    164388a View commit details
    Browse the repository at this point in the history
  4. Fix minor bugs to get osu_allreduce working

    Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>
    Joseph Schuchart authored and devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    d8110ac View commit details
    Browse the repository at this point in the history
  5. cuMemAllocAsync is supported since CUDA 11.2.0

    Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>
    Joseph Schuchart authored and devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    f609127 View commit details
    Browse the repository at this point in the history
  6. coll/base/allreduce: Condition device allocation on op/dtype support

    Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>
    Joseph Schuchart authored and devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    8ae3dac View commit details
    Browse the repository at this point in the history
  7. Make sure the device op callbacks are zero-initialized

    Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>
    Joseph Schuchart authored and devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    655948f View commit details
    Browse the repository at this point in the history
  8. Be more graceful when creating a context and stream

    Signed-off-by: Joseph Schuchart <jschuchart@xsdk.icl.utk.edu>
    Joseph Schuchart authored and devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    7cdc828 View commit details
    Browse the repository at this point in the history
  9. fix wrong call to memset

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    bdb16a1 View commit details
    Browse the repository at this point in the history
  10. Add detector for cudart

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    5934f43 View commit details
    Browse the repository at this point in the history
  11. Add CUDA stream-based allocator and memory pools

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    c2c3d0e View commit details
    Browse the repository at this point in the history
  12. Don't memset the CUDA op component, we need the version

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    5df449c View commit details
    Browse the repository at this point in the history
  13. Set the memory pool release threshold

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    812d068 View commit details
    Browse the repository at this point in the history
  14. Implement device-compatible allocator to cache coll temporaries

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    a688c84 View commit details
    Browse the repository at this point in the history
  15. Fix devicebucket allocator for larger sizes

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    bbd362d View commit details
    Browse the repository at this point in the history
  16. Fix the RDMA fallback protocol selection.

    If the target process is unable to execute an RDMA operation it
    instructs the origin to change the communication protocol. When this
    happen theorigin must be informed to cancel all pending RDMA operations,
    and release the rdma_frag.
    
    Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
    bosilca authored and devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    1fd6636 View commit details
    Browse the repository at this point in the history
  17. Stream-based reduction and ddt copy and 3buff cuda kernels, adopted f…

    …or allreduce recursive doubling
    
    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    f2f0f2d View commit details
    Browse the repository at this point in the history
  18. Remove extra copies from allreduce redscat and ring

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    8f5b503 View commit details
    Browse the repository at this point in the history
  19. Allow ops and memcpy on managed memory from the host

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    1c68d17 View commit details
    Browse the repository at this point in the history
  20. reduce_local: add support for device memory

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    70dde0f View commit details
    Browse the repository at this point in the history
  21. Draft of ompi_op_select_device

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    e603bcc View commit details
    Browse the repository at this point in the history
  22. Second draft of ompi_op_select_device

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    60dd446 View commit details
    Browse the repository at this point in the history
  23. Fix undefined symbols in cuda op component

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    c485ecf View commit details
    Browse the repository at this point in the history
  24. Fix off-by-one error in device-bucket allocator

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    793863c View commit details
    Browse the repository at this point in the history
  25. Heuristic to select op device based on element count

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    d2e8677 View commit details
    Browse the repository at this point in the history
  26. init op_rocm, not compilable yet

    Signed-off-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu>
    Phuong Nguyen authored and devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    cd7e578 View commit details
    Browse the repository at this point in the history
  27. implemented funcs in accelerator_rocm modules

    Signed-off-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu>
    Phuong Nguyen authored and devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    2ccaa87 View commit details
    Browse the repository at this point in the history
  28. add -I include path to Makefile

    Signed-off-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu>
    Phuong Nguyen authored and devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    a6f1cce View commit details
    Browse the repository at this point in the history
  29. added rocm codes into test example

    Signed-off-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu>
    Phuong Nguyen authored and devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    ce0b88d View commit details
    Browse the repository at this point in the history
  30. fixed kernel launches in hip

    Signed-off-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu>
    Phuong Nguyen authored and devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    ad420fe View commit details
    Browse the repository at this point in the history
  31. Make headers in reduce_local better parsable

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    c3c3287 View commit details
    Browse the repository at this point in the history
  32. CUDA: disable internal memory pool (seems broken)

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    9674aae View commit details
    Browse the repository at this point in the history
  33. Op: minor comment correction

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    628c0f1 View commit details
    Browse the repository at this point in the history
  34. Reduce_local: set hip device during init

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    251dac4 View commit details
    Browse the repository at this point in the history
  35. CUDA accelerator: fix compiler warnings

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    7589d17 View commit details
    Browse the repository at this point in the history
  36. Device op: pass device to lower-level op to avoid recurring queries

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    ead6847 View commit details
    Browse the repository at this point in the history
  37. CUDA/ROCm: Fix vectorized ops and rocm integration

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    ee31b60 View commit details
    Browse the repository at this point in the history
  38. Reduce_local: use OPAL defines to detect device support

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    9ab499a View commit details
    Browse the repository at this point in the history
  39. CUDA op: fix vectorized ops

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    dbd855d View commit details
    Browse the repository at this point in the history
  40. Reduce: add vectors to cuda implementation

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    02120c9 View commit details
    Browse the repository at this point in the history
  41. Allreduce: cleanup and minor fixes

    Replace ompi_op_reduce with ompi_op_reduce_stream(..., NULL) to avoid
    repeated checking for locality in ompi_op_reduce
    
    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    7cdbe24 View commit details
    Browse the repository at this point in the history
  42. Add MCA op_[cuda|rocm]_max_num_[blocks|threads]

    These variables allow users to limit the maximum number of blocks and
    threads per block in the reduction kernels. The implementation
    will fall back to the device limit if lower.
    
    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    c7fe5f6 View commit details
    Browse the repository at this point in the history
  43. Fix the generation of "unsigned char" ops.

    Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
    bosilca authored and devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    42bd424 View commit details
    Browse the repository at this point in the history
  44. We need CXX17 for the CUDA ops.

    Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
    bosilca authored and devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    8e3d042 View commit details
    Browse the repository at this point in the history
  45. ROCM: add vectorization of some basic ops

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    7524f99 View commit details
    Browse the repository at this point in the history
  46. Device allocators: correctly handle non-zero ID single accelerator

    The accelerator component may report the availability of a single accelerator
    whose ID is not zero.
    
    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    cfe8a5a View commit details
    Browse the repository at this point in the history
  47. CUDA op: consistently name unsigned_long functions as ulong

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    3bc7676 View commit details
    Browse the repository at this point in the history
  48. ROCM op: remove debug output

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    9c1da7e View commit details
    Browse the repository at this point in the history
  49. Reduce_local test: correctly test for OPAL_CUDA_SUPPORT and OPAL_ROCM…

    …_SUPPORT
    
    These macros are defined to either 1 or 0
    
    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    a20f671 View commit details
    Browse the repository at this point in the history
  50. More unsigned_long -> ulong fixes in CUDA and ROCm op

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    97338db View commit details
    Browse the repository at this point in the history
  51. Fix type in ulong conversion

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    541b8a0 View commit details
    Browse the repository at this point in the history
  52. Reduce_local: access only host-side memory in error message

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    8cb2feb View commit details
    Browse the repository at this point in the history
  53. Make sure CUDA accelerator is initialized before querying number of d…

    …evices
    
    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    2996ba0 View commit details
    Browse the repository at this point in the history
  54. Accelerator: provide peak bandwidth estimate

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    246003f View commit details
    Browse the repository at this point in the history
  55. accelerator/rocm: regular memory behaves like unified memory

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    6601484 View commit details
    Browse the repository at this point in the history
  56. ROCM: add missing FUNC_FUNC_FN macro

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    d0fe9a2 View commit details
    Browse the repository at this point in the history
  57. opal_datatype_accelerator_memcpy: determine device copy type

    We know where source and target buffers are located, so pass the right
    transfer direction to the accelerator memcpy call.
    
    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    63b64a0 View commit details
    Browse the repository at this point in the history
  58. accelerator rocm: fix global memcpy stream variable

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    5a29e13 View commit details
    Browse the repository at this point in the history
  59. Thread base: fix missing include file

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    5c7c7a1 View commit details
    Browse the repository at this point in the history
  60. Accelerator: Remove debug output

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    76f00c4 View commit details
    Browse the repository at this point in the history
  61. Allreduce: don't copy inputs if data can be accessed from the host

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    56bcfee View commit details
    Browse the repository at this point in the history
  62. Be more careful when releasing temporary receive buffers

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    a1f089e View commit details
    Browse the repository at this point in the history
  63. Remove debug output and dead code

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    33616e6 View commit details
    Browse the repository at this point in the history
  64. Bump max devicebucket allocator max size to 1GB

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    9da8b54 View commit details
    Browse the repository at this point in the history
  65. accelerator/cuda: fix error message

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    93ded5e View commit details
    Browse the repository at this point in the history
  66. CUDA: Select compute capability 52 by default

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    182e6fa View commit details
    Browse the repository at this point in the history
  67. Sqash const correctness warnings

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    e5eb45f View commit details
    Browse the repository at this point in the history
  68. Squash warnings about mismatched function pointer types

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    14a5372 View commit details
    Browse the repository at this point in the history
  69. Squash printfs

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    1f63809 View commit details
    Browse the repository at this point in the history
  70. Replace fprintf with show_help

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    3d9f33a View commit details
    Browse the repository at this point in the history
  71. Squash compiler warnings

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    c878c4f View commit details
    Browse the repository at this point in the history
  72. Clean up cuda and rocm op codes

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    1c6667d View commit details
    Browse the repository at this point in the history
  73. Minor tweak to CUDA op configury

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 7, 2023
    Configuration menu
    Copy the full SHA
    7bb4b95 View commit details
    Browse the repository at this point in the history

Commits on Nov 8, 2023

  1. Fix rebase errors

    Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
    devreal committed Nov 8, 2023
    Configuration menu
    Copy the full SHA
    d1382c3 View commit details
    Browse the repository at this point in the history