Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GELU Fusion with cuBLASLt (SLOWER because it only merges in FP16 mode, not BF16/FP32...) #338

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ademeure
Copy link
Contributor

@ademeure ademeure commented May 3, 2024

It turns out that not only is cuBLASLt not able to fuse BF16 GELU (or RELU) into a BF16 matmul, it also ends up with a strange kernel that is slower than our own GELU kernel as it does 2 writes per element (?!) rather than the expected 1.

This happens for FP32 and BF16 in GELU, RELU, GELU_AUX, and RELU_AUX modes as far as I can tell. In theory cuBLASLt also supports GELU backwards, but I assume it will have similar problems so I didn't try it.

Obviously this shouldn't be merged, only posting this here for future reference...

… properly for FP32/BF16, and it's actually non-fused and SLOWER!
@ademeure ademeure marked this pull request as draft May 3, 2024 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant