GELU Fusion with cuBLASLt (SLOWER because it only merges in FP16 mode, not BF16/FP32...) #338

ademeure · 2024-05-03T01:14:36Z

It turns out that not only is cuBLASLt not able to fuse BF16 GELU (or RELU) into a BF16 matmul, it also ends up with a strange kernel that is slower than our own GELU kernel as it does 2 writes per element (?!) rather than the expected 1.

This happens for FP32 and BF16 in GELU, RELU, GELU_AUX, and RELU_AUX modes as far as I can tell. In theory cuBLASLt also supports GELU backwards, but I assume it will have similar problems so I didn't try it.

Obviously this shouldn't be merged, only posting this here for future reference...

… properly for FP32/BF16, and it's actually non-fused and SLOWER!

Fused GELU Forward with cuBLASLt, but turns out they don't support it…

d8e6796

… properly for FP32/BF16, and it's actually non-fused and SLOWER!

ademeure marked this pull request as draft May 3, 2024 01:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GELU Fusion with cuBLASLt (SLOWER because it only merges in FP16 mode, not BF16/FP32...) #338

GELU Fusion with cuBLASLt (SLOWER because it only merges in FP16 mode, not BF16/FP32...) #338

ademeure commented May 3, 2024 •

edited

GELU Fusion with cuBLASLt (SLOWER because it only merges in FP16 mode, not BF16/FP32...) #338

Are you sure you want to change the base?

GELU Fusion with cuBLASLt (SLOWER because it only merges in FP16 mode, not BF16/FP32...) #338

Conversation

ademeure commented May 3, 2024 • edited

ademeure commented May 3, 2024 •

edited