Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEMM benchmark with CLBlast comparison #9

Open
mratsim opened this issue Jul 13, 2019 · 3 comments
Open

GEMM benchmark with CLBlast comparison #9

mratsim opened this issue Jul 13, 2019 · 3 comments

Comments

@mratsim
Copy link

mratsim commented Jul 13, 2019

Hello team,

I read with interests the 2 papers you published on Futhark and the DNN implementation.

I'm currently researching autotuners for deep learning and futhark stands out with the introduction of Second-Order Array Combinators like scanomap and redomap.

I'd like to understand how Futhark fares versus state-of-the-art matrix multiplication which should be able to reach 90%+ of hardware utilisation and stress compilers, autotuners and handwritten kernels with need of cache-aware (or cache-oblivious) algorithms, tiling, register pressure.

However I cannot find a benchmark versus a reference library like ClBlast

@athas
Copy link
Member

athas commented Jul 13, 2019

Futhark does not fare particularly well against ClBlast (at least I'd not expect that it does). This is based on only a little data: for the PPoPP'19 experiments we implemented matrix multiplication in Futhark and with cuBLAS. As I recall, cuBLAS was at least two times faster in its comfort zone.

We pretty much know why this is: Futhark only does block tiling in local memory, while cuBLAS (probably) also does register tiling. We verified this by manually applying register tiling to Futhark's generated code, which brought performance fairly close to cuBLAS. We roughly know how to implement register tiling, but it's a bunch of work, so we haven't gotten around to it yet. It's unlikely (and not really intended) that Futhark will ever outperform expertly hand-tuned code on a primitive-for-primitive level.

@mratsim
Copy link
Author

mratsim commented Jul 15, 2019

You can beat hand-tuned or at least reach 90% of the speed with a high-level library:

  • Nvidia CUTLASS shows the high-level primitives necessary to reach handwritten CuBLAS performance
  • Halide allows us to use a high-level einsum-like DSL for the algorithm with separate loop schedule tuning.
  • Tiramisu similarly allows schedule tuning but is polyhedral based.

Even for CPU, I personally managed to reach BLAS speed without any assembly on CPU matrix multiplication:

Obviously that does not compose well and also requires the same effort to reproduce in OpenCL or Cuda, so I'm looking into alternative representations that abstracts loops hence my inquiries into Futhark.

@athas
Copy link
Member

athas commented Jul 15, 2019

Sure, but all of those still involve machine-specific hand-tuning/scheduling (or search procedures). We don't have any current plans for Futhark to directly support that level of manual optimisation, but maybe if we can figure out a way that does not complicate the language as a whole... The Halide/Tiramisu approaches are definitely inspiring.

Currently, our main performance focus is on optimising composition and ensuring good application-level performance. Eventually it'd be fun to also directly compete at the level of individual primitives, but alas, time is finite!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants