Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clang JIT CPU Backend #1239

Open
jeremylt opened this issue Jun 21, 2023 · 4 comments
Open

Clang JIT CPU Backend #1239

jeremylt opened this issue Jun 21, 2023 · 4 comments

Comments

@jeremylt
Copy link
Member

Clang 16 now supports JIT. An interesting small project could be to create a /cpu/self/clang-jit backend that provides JITed tensor contraction kernels. If we see performance that is in the neighborhood of AVX or libXSMM, this could be a way to ship a faster CPU backend with fewer dependencies.

See Serac for reference:
https://github.com/LLNL/serac/blob/prototype/adjoints_with_internal_variables/tests/jit/basic_jit.cpp
https://github.com/LLNL/serac/blob/prototype/adjoints_with_internal_variables/include/JIT.hpp

(This repo comes from a member of Jamie's Smith team)

@jedbrown
Copy link
Member

Certainly interesting, but do note that we have a limited number of combinations in tensor contractions so this is more of a solution to a finding that compile-time constant sizes are a huge benefit and that we can't pare down that combinatorial space to do ahead-of-time specialization.

A different use might be to use JIT to build single-precision versions of select kernels.

@jeremylt
Copy link
Member Author

Right, I'd expect that if we enumerated a bunch of kernels ahead of time across combos of p, q, num_comp, and blocked/serial we'd see the same performance, but that approach is intractable.

WRT performance I just mean that my gut expects the performance of such a backend to be between AVX and LIBXSMM, but without the need for a user to build LIBXSMM so we might get a little better performance in our upcoming Ratel + Enzyme container.

I agree that single-precision kernels would be an interesting avenue to explore too so its easier to get mixed precision capabilities.

@jedbrown
Copy link
Member

It's a low-effort test to see if specializing one particular size has much benefit. Like just drop in some integer literals and run a benchmark using matching sizes. If it's a lot faster, we can see if specializing all the values is important or, say, just one matters. If it's about the same, we don't need to pursue the idea (at least until we learn more).

@jeremylt
Copy link
Member Author

That's a good point. Its a easy test to check if someone finds time. I don't see this as a particular priority - 50% of why I created this issue was so we don't lose track of this as an option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants