Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cuda::mr::cuda_async_memory_resource #1637

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

miscco
Copy link
Collaborator

@miscco miscco commented Apr 16, 2024

This implements two facilities for allocating stream order device accessible memory.

The first commit adds cuda_memory_pool, a wrapper around ::cudaMemPool_t that takes care of initialization of the buffer and its destruction.

The second commit adds cuda_async_memory_resource, is a leightweight async_resource. It utilizes either a self-owned cuda_memory_pool that is properly destroyed when it goes out of scope, or wraps a user provided ::cudaMemPool_t.

This allows interoperability between different libraries by retaining an existing pool

Fixes #1514

@miscco miscco requested review from a team as code owners April 16, 2024 08:33
@miscco miscco force-pushed the cuda_async_memory_resource branch 4 times, most recently from 6a353d8 to f3b0145 Compare April 16, 2024 15:37
@miscco miscco force-pushed the cuda_async_memory_resource branch from 8db9b27 to 1af8469 Compare April 18, 2024 06:33
@miscco miscco force-pushed the cuda_async_memory_resource branch from 7812271 to a782e09 Compare April 19, 2024 08:58
@miscco miscco force-pushed the cuda_async_memory_resource branch from c54d3e4 to c86c926 Compare April 22, 2024 06:49
Copy link
Contributor

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review -- just cuda_memory_pool so far.

Copy link
Contributor

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some typos.

@harrism
Copy link
Contributor

harrism commented Apr 23, 2024

I don't really love the mixed ownership / lifetime semantics of this dual-class approach. In RMM we took a dual class approach but with each class having different ownership / lifetime semantics. One class that owns its pool. And another that is a view/reference to a non-owned pool resource. If you take the RMM approach of avoiding mixed ownership semantics, then your current cuda_async_memory_resource, since it is such a thin wrapper and only decides whether to use the "raw" pool or not, becomes basically just a resource_ref.

Therefore, shouldn't we be able to have just one class that owns its pool? It could optionally take an argument to use the default cudaAsync pool instead of allocating one. But the non-owning version can just be replaced by a resource_ref?

If we do go with two classes, we should try to avoid mixed ownership/lifetime semantics.

Copy link
Contributor

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments.

@miscco
Copy link
Collaborator Author

miscco commented Apr 24, 2024

If you take the RMM approach of avoiding mixed ownership semantics, then your current cuda_async_memory_resource, since it is such a thin wrapper and only decides whether to use the "raw" pool or not, becomes basically just a resource_ref.

The issue is that cudaMemPool_t itself is not a resource. So we will in any way need a wrapper around it so that it satisfies the resource concept.

Therefore, shouldn't we be able to have just one class that owns its pool? It could optionally take an argument to use the default cudaAsync pool instead of allocating one. But the non-owning version can just be replaced by a resource_ref?

This would again bring conflicting ownership semantics. My first draft had this behavior and it turned out quite complex

Maybe we should turn what is currently the cuda_memory_pool into cuda_async_memory_resource and then rename the wrapper into something like cuda_mem_pool_resource ?

@harrism
Copy link
Contributor

harrism commented Apr 24, 2024

If you take the RMM approach of avoiding mixed ownership semantics, then your current cuda_async_memory_resource, since it is such a thin wrapper and only decides whether to use the "raw" pool or not, becomes basically just a resource_ref.

The issue is that cudaMemPool_t itself is not a resource. So we will in any way need a wrapper around it so that it satisfies the resource concept.

I don't understand why the thing that owns cudaMempool_t has to satisfy the resource concept. A mempool in the CUDA API doesn't itself allocate memory, it's just a handle. IMO there should be exactly one place where we call cudaMallocAsync and one place where we call cudaFreeAsync (both in the same class).

Therefore, shouldn't we be able to have just one class that owns its pool? It could optionally take an argument to use the default cudaAsync pool instead of allocating one. But the non-owning version can just be replaced by a resource_ref?

This would again bring conflicting ownership semantics. My first draft had this behavior and it turned out quite complex

Where is the conflicting ownership? If you have a cuda_memory_pool that is not a resource but owns a pool, and a cuda_async_memory_resource that takes a cuda_memory_pool (or uses the default) and doesn't own it, I see no ownership conflict. (This is slightly different from what I said above.)

Maybe we should turn what is currently the cuda_memory_pool into cuda_async_memory_resource and then rename the wrapper into something like cuda_mem_pool_resource ?

That naming doesn't make sense to me, sorry.

@miscco miscco force-pushed the cuda_async_memory_resource branch from 8c678e1 to a5ea93a Compare April 25, 2024 11:36
cuda_memory_pool(
const size_t initial_pool_size = 0,
const size_t release_threshold = 0,
const cudaMemAllocationHandleType __allocation_handle_type = cudaMemAllocationHandleType::cudaMemHandleTypeNone)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should ensure that the constructor takes the device explicitly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved to an API that explicitly takes the device id as a required argument.

I found that it is easy to incorrectly mix up the other arguments, so I encapsulated them into a helper struct.

@harrism does that look reasonable to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will indeed suppress clang-tidy's bugprone-easily-swappable-parameters warning, which is nice. However I often struggle with the struct solution to that problem as I don't see how this helps avoid mixing up the two integral arguments:

cuda::mr::cuda_memory_pool_properties props = {42, 20};

At least it requires the caller to look at the struct and think about it, though. And I think the alternative of having strong types for the initial size and release threshold is overkill.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is something where designated initializers shine

cuda::mr::cuda_memory_pool_properties props = { .initial_pool_size = 42, .release_threshold = 20};

#endif // no system header

// cudaMallocAsync was introduced in CTK 11.2
#if !defined(_CCCL_COMPILER_MSVC_2017) && !defined(_CCCL_CUDACC_BELOW_11_2)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gevtushenko We should move this to cudaNext to ensure that we get the design right

cuda_memory_pool(::cudaMemPool_t) = delete;

cuda_memory_pool(cuda_memory_pool const&) = delete;
cuda_memory_pool(cuda_memory_pool&&) = delete;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this solely because you want to avoid empty states in this class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This should always contain a valid cudaMemPool_t

@miscco miscco force-pushed the cuda_async_memory_resource branch from 5b0fc7e to a94230f Compare May 7, 2024 11:00
@miscco miscco requested a review from a team as a code owner May 7, 2024 11:53
Copy link
Contributor

github-actions bot commented May 7, 2024

🟩 CI Results [ Failed: 0 | Passed: 302 | Total: 302 ]
  • 🟩 Project cub [ Failed: 0 | Passed: 99 | Total: 99 ]

    🟩 cpu
      🟩 amd64 (0% Fail)              Failed:  0  -- Passed: 91  -- Total: 91 
      🟩 arm64 (0% Fail)              Failed:  0  -- Passed:  8  -- Total:  8 
    🟩 ctk
      🟩 11.1 (0% Fail)               Failed:  0  -- Passed: 15  -- Total: 15 
      🟩 11.8 (0% Fail)               Failed:  0  -- Passed:  3  -- Total:  3 
      🟩 12.4 (0% Fail)               Failed:  0  -- Passed: 81  -- Total: 81 
    🟩 cudacxx_full
      🟩 clang-cuda16 (0% Fail)       Failed:  0  -- Passed:  2  -- Total:  2 
      🟩 nvcc11.1 (0% Fail)           Failed:  0  -- Passed: 15  -- Total: 15 
      🟩 nvcc11.8 (0% Fail)           Failed:  0  -- Passed:  3  -- Total:  3 
      🟩 nvcc12.4 (0% Fail)           Failed:  0  -- Passed: 79  -- Total: 79 
    🟩 cudacxx_name
      🟩 clang-cuda (0% Fail)         Failed:  0  -- Passed:  2  -- Total:  2 
      🟩 nvcc (0% Fail)               Failed:  0  -- Passed: 97  -- Total: 97 
    🟩 cxx_full
      🟩 clang9 (0% Fail)             Failed:  0  -- Passed:  6  -- Total:  6 
      🟩 clang10 (0% Fail)            Failed:  0  -- Passed:  3  -- Total:  3 
      🟩 clang11 (0% Fail)            Failed:  0  -- Passed:  4  -- Total:  4 
      🟩 clang12 (0% Fail)            Failed:  0  -- Passed:  4  -- Total:  4 
      🟩 clang13 (0% Fail)            Failed:  0  -- Passed:  4  -- Total:  4 
      🟩 clang14 (0% Fail)            Failed:  0  -- Passed:  4  -- Total:  4 
      🟩 clang15 (0% Fail)            Failed:  0  -- Passed:  4  -- Total:  4 
      🟩 clang16 (0% Fail)            Failed:  0  -- Passed: 14  -- Total: 14 
      🟩 gcc6 (0% Fail)               Failed:  0  -- Passed:  2  -- Total:  2 
      🟩 gcc7 (0% Fail)               Failed:  0  -- Passed:  6  -- Total:  6 
      🟩 gcc8 (0% Fail)               Failed:  0  -- Passed:  6  -- Total:  6 
      🟩 gcc9 (0% Fail)               Failed:  0  -- Passed:  6  -- Total:  6 
      🟩 gcc10 (0% Fail)              Failed:  0  -- Passed:  4  -- Total:  4 
      🟩 gcc11 (0% Fail)              Failed:  0  -- Passed:  7  -- Total:  7 
      🟩 gcc12 (0% Fail)              Failed:  0  -- Passed: 16  -- Total: 16 
      🟩 Intel2023.2.0 (0% Fail)      Failed:  0  -- Passed:  3  -- Total:  3 
      🟩 MSVC14.16 (0% Fail)          Failed:  0  -- Passed:  1  -- Total:  1 
      🟩 MSVC14.29 (0% Fail)          Failed:  0  -- Passed:  2  -- Total:  2 
      🟩 MSVC14.39 (0% Fail)          Failed:  0  -- Passed:  3  -- Total:  3 
    🟩 cxx_name
      🟩 clang (0% Fail)              Failed:  0  -- Passed: 43  -- Total: 43 
      🟩 gcc (0% Fail)                Failed:  0  -- Passed: 47  -- Total: 47 
      🟩 Intel (0% Fail)              Failed:  0  -- Passed:  3  -- Total:  3 
      🟩 MSVC (0% Fail)               Failed:  0  -- Passed:  6  -- Total:  6 
    🟩 gpu
      🟩 v100 (0% Fail)               Failed:  0  -- Passed: 99  -- Total: 99 
    🟩 jobs
      🟩 build (0% Fail)              Failed:  0  -- Passed: 91  -- Total: 91 
      🟩 test (0% Fail)               Failed:  0  -- Passed:  8  -- Total:  8 
    🟩 os
      🟩 ubuntu18.04 (0% Fail)        Failed:  0  -- Passed: 14  -- Total: 14 
      🟩 ubuntu20.04 (0% Fail)        Failed:  0  -- Passed: 35  -- Total: 35 
      🟩 ubuntu22.04 (0% Fail)        Failed:  0  -- Passed: 44  -- Total: 44 
      🟩 windows2022 (0% Fail)        Failed:  0  -- Passed:  6  -- Total:  6 
    🟩 sm
      🟩 60;70;80;90 (0% Fail)        Failed:  0  -- Passed:  3  -- Total:  3 
      🟩 90a (0% Fail)                Failed:  0  -- Passed:  4  -- Total:  4 
    🟩 std
      🟩 11 (0% Fail)                 Failed:  0  -- Passed: 26  -- Total: 26 
      🟩 14 (0% Fail)                 Failed:  0  -- Passed: 29  -- Total: 29 
      🟩 17 (0% Fail)                 Failed:  0  -- Passed: 28  -- Total: 28 
      🟩 20 (0% Fail)                 Failed:  0  -- Passed: 16  -- Total: 16 
    
  • 🟩 Project thrust [ Failed: 0 | Passed: 99 | Total: 99 ]

    🟩 cpu
      🟩 amd64 (0% Fail)              Failed:  0  -- Passed: 91  -- Total: 91 
      🟩 arm64 (0% Fail)              Failed:  0  -- Passed:  8  -- Total:  8 
    🟩 ctk
      🟩 11.1 (0% Fail)               Failed:  0  -- Passed: 15  -- Total: 15 
      🟩 11.8 (0% Fail)               Failed:  0  -- Passed:  3  -- Total:  3 
      🟩 12.4 (0% Fail)               Failed:  0  -- Passed: 81  -- Total: 81 
    🟩 cudacxx_full
      🟩 clang-cuda16 (0% Fail)       Failed:  0  -- Passed:  2  -- Total:  2 
      🟩 nvcc11.1 (0% Fail)           Failed:  0  -- Passed: 15  -- Total: 15 
      🟩 nvcc11.8 (0% Fail)           Failed:  0  -- Passed:  3  -- Total:  3 
      🟩 nvcc12.4 (0% Fail)           Failed:  0  -- Passed: 79  -- Total: 79 
    🟩 cudacxx_name
      🟩 clang-cuda (0% Fail)         Failed:  0  -- Passed:  2  -- Total:  2 
      🟩 nvcc (0% Fail)               Failed:  0  -- Passed: 97  -- Total: 97 
    🟩 cxx_full
      🟩 clang9 (0% Fail)             Failed:  0  -- Passed:  6  -- Total:  6 
      🟩 clang10 (0% Fail)            Failed:  0  -- Passed:  3  -- Total:  3 
      🟩 clang11 (0% Fail)            Failed:  0  -- Passed:  4  -- Total:  4 
      🟩 clang12 (0% Fail)            Failed:  0  -- Passed:  4  -- Total:  4 
      🟩 clang13 (0% Fail)            Failed:  0  -- Passed:  4  -- Total:  4 
      🟩 clang14 (0% Fail)            Failed:  0  -- Passed:  4  -- Total:  4 
      🟩 clang15 (0% Fail)            Failed:  0  -- Passed:  4  -- Total:  4 
      🟩 clang16 (0% Fail)            Failed:  0  -- Passed: 14  -- Total: 14 
      🟩 gcc6 (0% Fail)               Failed:  0  -- Passed:  2  -- Total:  2 
      🟩 gcc7 (0% Fail)               Failed:  0  -- Passed:  6  -- Total:  6 
      🟩 gcc8 (0% Fail)               Failed:  0  -- Passed:  6  -- Total:  6 
      🟩 gcc9 (0% Fail)               Failed:  0  -- Passed:  6  -- Total:  6 
      🟩 gcc10 (0% Fail)              Failed:  0  -- Passed:  4  -- Total:  4 
      🟩 gcc11 (0% Fail)              Failed:  0  -- Passed:  7  -- Total:  7 
      🟩 gcc12 (0% Fail)              Failed:  0  -- Passed: 16  -- Total: 16 
      🟩 Intel2023.2.0 (0% Fail)      Failed:  0  -- Passed:  3  -- Total:  3 
      🟩 MSVC14.16 (0% Fail)          Failed:  0  -- Passed:  1  -- Total:  1 
      🟩 MSVC14.29 (0% Fail)          Failed:  0  -- Passed:  2  -- Total:  2 
      🟩 MSVC14.39 (0% Fail)          Failed:  0  -- Passed:  3  -- Total:  3 
    🟩 cxx_name
      🟩 clang (0% Fail)              Failed:  0  -- Passed: 43  -- Total: 43 
      🟩 gcc (0% Fail)                Failed:  0  -- Passed: 47  -- Total: 47 
      🟩 Intel (0% Fail)              Failed:  0  -- Passed:  3  -- Total:  3 
      🟩 MSVC (0% Fail)               Failed:  0  -- Passed:  6  -- Total:  6 
    🟩 gpu
      🟩 v100 (0% Fail)               Failed:  0  -- Passed: 99  -- Total: 99 
    🟩 jobs
      🟩 build (0% Fail)              Failed:  0  -- Passed: 91  -- Total: 91 
      🟩 test (0% Fail)               Failed:  0  -- Passed:  8  -- Total:  8 
    🟩 os
      🟩 ubuntu18.04 (0% Fail)        Failed:  0  -- Passed: 14  -- Total: 14 
      🟩 ubuntu20.04 (0% Fail)        Failed:  0  -- Passed: 35  -- Total: 35 
      🟩 ubuntu22.04 (0% Fail)        Failed:  0  -- Passed: 44  -- Total: 44 
      🟩 windows2022 (0% Fail)        Failed:  0  -- Passed:  6  -- Total:  6 
    🟩 sm
      🟩 60;70;80;90 (0% Fail)        Failed:  0  -- Passed:  3  -- Total:  3 
      🟩 90a (0% Fail)                Failed:  0  -- Passed:  4  -- Total:  4 
    🟩 std
      🟩 11 (0% Fail)                 Failed:  0  -- Passed: 26  -- Total: 26 
      🟩 14 (0% Fail)                 Failed:  0  -- Passed: 29  -- Total: 29 
      🟩 17 (0% Fail)                 Failed:  0  -- Passed: 28  -- Total: 28 
      🟩 20 (0% Fail)                 Failed:  0  -- Passed: 16  -- Total: 16 
    
  • 🟩 Project libcudacxx [ Failed: 0 | Passed: 104 | Total: 104 ]

    🟩 cpu
      🟩 amd64 (0% Fail)              Failed:  0  -- Passed: 96  -- Total: 96 
      🟩 arm64 (0% Fail)              Failed:  0  -- Passed:  8  -- Total:  8 
    🟩 ctk
      🟩 11.1 (0% Fail)               Failed:  0  -- Passed: 15  -- Total: 15 
      🟩 11.8 (0% Fail)               Failed:  0  -- Passed:  3  -- Total:  3 
      🟩 12.4 (0% Fail)               Failed:  0  -- Passed: 86  -- Total: 86 
    🟩 cudacxx_full
      🟩 clang-cuda16 (0% Fail)       Failed:  0  -- Passed:  2  -- Total:  2 
      🟩 nvcc11.1 (0% Fail)           Failed:  0  -- Passed: 15  -- Total: 15 
      🟩 nvcc11.8 (0% Fail)           Failed:  0  -- Passed:  3  -- Total:  3 
      🟩 nvcc12.4 (0% Fail)           Failed:  0  -- Passed: 84  -- Total: 84 
    🟩 cudacxx_name
      🟩 clang-cuda (0% Fail)         Failed:  0  -- Passed:  2  -- Total:  2 
      🟩 nvcc (0% Fail)               Failed:  0  -- Passed: 102 -- Total: 102
    🟩 cxx_full
      🟩 clang9 (0% Fail)             Failed:  0  -- Passed:  6  -- Total:  6 
      🟩 clang10 (0% Fail)            Failed:  0  -- Passed:  3  -- Total:  3 
      🟩 clang11 (0% Fail)            Failed:  0  -- Passed:  4  -- Total:  4 
      🟩 clang12 (0% Fail)            Failed:  0  -- Passed:  4  -- Total:  4 
      🟩 clang13 (0% Fail)            Failed:  0  -- Passed:  4  -- Total:  4 
      🟩 clang14 (0% Fail)            Failed:  0  -- Passed:  4  -- Total:  4 
      🟩 clang15 (0% Fail)            Failed:  0  -- Passed:  4  -- Total:  4 
      🟩 clang16 (0% Fail)            Failed:  0  -- Passed: 14  -- Total: 14 
      🟩 gcc6 (0% Fail)               Failed:  0  -- Passed:  2  -- Total:  2 
      🟩 gcc7 (0% Fail)               Failed:  0  -- Passed:  6  -- Total:  6 
      🟩 gcc8 (0% Fail)               Failed:  0  -- Passed:  6  -- Total:  6 
      🟩 gcc9 (0% Fail)               Failed:  0  -- Passed:  6  -- Total:  6 
      🟩 gcc10 (0% Fail)              Failed:  0  -- Passed:  4  -- Total:  4 
      🟩 gcc11 (0% Fail)              Failed:  0  -- Passed:  7  -- Total:  7 
      🟩 gcc12 (0% Fail)              Failed:  0  -- Passed: 21  -- Total: 21 
      🟩 Intel2023.2.0 (0% Fail)      Failed:  0  -- Passed:  3  -- Total:  3 
      🟩 MSVC14.16 (0% Fail)          Failed:  0  -- Passed:  1  -- Total:  1 
      🟩 MSVC14.29 (0% Fail)          Failed:  0  -- Passed:  2  -- Total:  2 
      🟩 MSVC14.39 (0% Fail)          Failed:  0  -- Passed:  3  -- Total:  3 
    🟩 cxx_name
      🟩 clang (0% Fail)              Failed:  0  -- Passed: 43  -- Total: 43 
      🟩 gcc (0% Fail)                Failed:  0  -- Passed: 52  -- Total: 52 
      🟩 Intel (0% Fail)              Failed:  0  -- Passed:  3  -- Total:  3 
      🟩 MSVC (0% Fail)               Failed:  0  -- Passed:  6  -- Total:  6 
    🟩 gpu
      🟩 v100 (0% Fail)               Failed:  0  -- Passed: 104 -- Total: 104
    🟩 jobs
      🟩 build (0% Fail)              Failed:  0  -- Passed: 91  -- Total: 91 
      🟩 nvrtc (0% Fail)              Failed:  0  -- Passed:  4  -- Total:  4 
      🟩 test (0% Fail)               Failed:  0  -- Passed:  8  -- Total:  8 
      🟩 verify_codegen (0% Fail)     Failed:  0  -- Passed:  1  -- Total:  1 
    🟩 os
      🟩 ubuntu18.04 (0% Fail)        Failed:  0  -- Passed: 14  -- Total: 14 
      🟩 ubuntu20.04 (0% Fail)        Failed:  0  -- Passed: 35  -- Total: 35 
      🟩 ubuntu22.04 (0% Fail)        Failed:  0  -- Passed: 49  -- Total: 49 
      🟩 windows2022 (0% Fail)        Failed:  0  -- Passed:  6  -- Total:  6 
    🟩 sm
      🟩 60;70;80;90 (0% Fail)        Failed:  0  -- Passed:  3  -- Total:  3 
      🟩 90a (0% Fail)                Failed:  0  -- Passed:  4  -- Total:  4 
    🟩 std
      🟩 11 (0% Fail)                 Failed:  0  -- Passed: 27  -- Total: 27 
      🟩 14 (0% Fail)                 Failed:  0  -- Passed: 30  -- Total: 30 
      🟩 17 (0% Fail)                 Failed:  0  -- Passed: 29  -- Total: 29 
      🟩 20 (0% Fail)                 Failed:  0  -- Passed: 17  -- Total: 17 
    

🏃‍ Runner counts (total jobs: 302)

# Runner
232 linux-amd64-cpu16
28 linux-amd64-gpu-v100-latest-1
24 linux-arm64-cpu16
18 windows-amd64-cpu16

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
+/- libcu++
CUB
Thrust

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

[FEA]: Implement a memory_resource using cudaMallocAsync and cudaFreeAsync
3 participants