New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: rocblas_gemm_ex with m==1 fp16 inputs/outputs f32 compute slower than a quite naive gemv kernel on MI100 #1425
Comments
yeah this has been an issue for a while: #1238 |
I updated the kernel from my reproducer, it saturates memory bandwidth (contrary to rocBLAS). |
I see that @daineAMD replied to the other issue, so mentioning here as well, in case that helps in any way. |
its also pretty silly since just using the gemv kernels in these cases should be trivial the suboptimiality of this is ofc also easly shown with rocblas's own tool: rocblas-bench -f gemm_ex -m 1 -n 16192 -k 16192 --transposeA N --transposeB N -r s --compute_type s -i 50
rocblas-bench -f gemv -r s -m 16192 -n 16192 --lda 16192 -i 50
|
also rocblas_hgemv would also be great since there is opportunity here to use dual-issue |
Hi @Epliz, thanks for brining this up. Yes, the disparity between gemm with m == 1/n == 1 and gemv has been brought up in the past as noted by @IMbackK. Back when it was originally brought up, it wasn't straightforward on if the best approach would be to re-direct the gemm call to gemv (which has source kernels in rocblas) or to continue to gemm (which is handled within the Tensile library) since performance was somewhat of a mixed-bag; and handling this on a case-by-case basis seemed infeasible. Regardless, it's good that this has been brought up again, and I'll discuss with the team on what the best approach is. If we can get gemv to outperform gemm in every case, then the changes to redirect to gemv would be straightforward, but most of the work would lie in ensuring that gemv is faster. I'll keep you updated with any progress here. The request for Thanks, |
Hi @daineAMD Thank you for the detailed comment on this matter and for: Out of curiosity: |
Hi @daineAMD , Following up after a week. If not, can you please proceed with making gemm call gemv for those cases? If the rocBlas team cannot tackle this task, would a pull request from my side be potentially merged? I can sign whatever contribution agreement you might need. |
Hi @Epliz and @IMbackK, sorry for the delay. Looking at my past notes, it looks like the areas of most concern were where the incx parameter is large (with various exceptions), specifically gemm cases where (transA == transB == T && ldb >> 1) and (transA == transB == N && lda >> 1). Other cases where I'm seeing gemm perform better than gemv is for small sizes, e.g.: I have a ticket to investigate further to see if we can call gemv from cases where it outperforms gemm and/or see what optimizations can be done for the current gemv to make this easier; I'll be looking at this in the coming weeks. You are free to take a look yourself and open a PR, you can take a look at the contributing guide if you're interested, but merging the PR will still take some time as most of the work still lies in ensuring no performance regressions. Thanks again, |
Describe the bug
As described in the title, rocblas_gemm_ex seems quite suboptimal when m==1 inputs/outputs are fp16 and compute is fp32 on MI100.
A quite naive kernel I implemented beats it.
Causes ROCm/pytorch#1408 in pytorch.
It make LLM inference on Mistral 7b fp16 slower compared to what it could easily be.
To Reproduce
Here is a C++ reproducer:
Expected behavior
It should be at least as fast as my naive kernel.
But running the above, I get:
Environment
environment.txt
Additional context
Add any other context about the problem here.
EDIT: put a better kernel than originally included one
The text was updated successfully, but these errors were encountered: