Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA/HIP Backend Refactor #839

Open
jeremylt opened this issue Nov 8, 2021 · 5 comments
Open

CUDA/HIP Backend Refactor #839

jeremylt opened this issue Nov 8, 2021 · 5 comments
Assignees

Comments

@jeremylt
Copy link
Member

jeremylt commented Nov 8, 2021

As a result of how they were designed, there is a bunch of code duplication in the CUDA backends, and as a result of this CUDA duplication, the HIP backends inherited this same code duplication.

We should make a PR or series of PRs that is actually designed to refactor and reduce code duplication across these backends.

  • What code could/should be combined?

  • Where and how do we need to allow for differences between the backends and platforms?

  • Where do we want to test to prevent regression from aggressive or incorrect amalgamation between CUDA and HIP?

  • What needs to be done to allow the code generation backends (gpu/cuda/gen and gpu/hip/gen) to share kernels from the other backends?

@tcew, @jedbrown, any anyone else who I'm missing but is interested, please feel free to jump into this issue or the discussion with thoughts I'm overlooking.

@YohannDudouit
Copy link
Contributor

YohannDudouit commented Nov 8, 2021

I would unify hip/ kernels/* and cuda/kernels/* into gpu_common_kernels/* (or something similar), currently the code is duplicated. In the future it could be different, Hip/Cuda could have different code, but I think that the design would be better if we abstracted the fact that we target Hip or Cuda architectures. This would allow to test new implementations in a more modular way. We can have different implementations of the same algorithms that live under common_gpu_kernels, then which one we decide on using is chosen when loading the source. The implementations loaded can be different for hip and cuda. The purpose being to try different implementations for different scenarios. We could imagine loading different implementations not only based on hip/cuda, but also on the number of quadrature points and degrees of freedom. We already know that there is no best performance approach for all cases, that different implementations work better for different cases, this design could also handle that.
Also, there is implementations that result in less register pressure. In certain applications, register pressure can become an issue when the QFunction gets big, this design would also allow to change the parallelization strategy to accommodate this kind of issues.
This design would potentially allow to fuse magma backend kernels too.

On the topic of implementations that would be specific to hip or cuda, I am not aware of such things.
Generalizing code to target either Hip or Cuda is relatively trivial, the architecture specific keywords can easily be abstracted behind macros (CEED_DEVICE, CEED_HOST, CEED_HOST_DEVICE, CEED_MEM_SHARED, etc...).

@jeremylt
Copy link
Member Author

jeremylt commented Nov 8, 2021

I think a smaller first step could be refactoring the code generation backends to share the kernels that other backends use. Currently there are some minor differences, but I don't know why those differences were added.

@YohannDudouit
Copy link
Contributor

YohannDudouit commented Nov 8, 2021

This is a good point, if we gather code, then we have to document in the same place the reasons for the differences.

My proposal above is not a "first step" but a goal, I guess the different tasks would be:

  • Gather common code between hip/cuda
  • Generalize code that differ from hip/cuda to be architecture agnostic
  • Generalize the gen interface to make sense with simplices (is it just the same as 1D tensor?)
  • Unify the interface of ref and shared implementations with the gen interface, this would allow to generate gen kernels from ref and shared functions, but also any other implementation idea that might come.
  • Add a mechanism to pick and generate a specific implementation.

@jeremylt
Copy link
Member Author

jeremylt commented Nov 17, 2021

For the long term health of these backends, I think we should do a cleanup and refactor in the near term. Combining kernels across the CUDA and HIP backends should come after this near term refactor. I don't know enough about performance studies between CUDA and HIP to attempt combining pieces of these two backend 'families' myself, but I do know enough to refactor the backend design into something cleaner.

Proposed near term refactor roadmap:

PR 2

  • Tidy mechanism by which *-shared and *-gen reach into *-ref for JiT, ceed backend data, etc

PR 3

  • Pull kernel source strings into header files
  • Refactor *-ref and *-shared kernels to use templates for compatibility with *-gen

PR 4

  • Refactor *-gen to break Ceed*GenOperatorBuild into smaller pieces
  • Add simplex support to *-gen
  • Add collocated gradient support to *-gen

@jeremylt
Copy link
Member Author

I stalled out and focused on some Ratel work before wrapping up the final stage of this issue. @jedbrown I think this last stage of the GPU /gen backend refactor would let us most easily incorporate the new basis (including particles) work into these backends. Depending upon prioritization, I think this would be a good think to try to make time for in the spring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants