You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GEMM where one of m or n is 1 should perform similarly to the equivalent GEMV call, because it could simply call the GEMV kernel (as cuBLAS does).
What actually happens
GEMM performs much worse than GEMV.
How to reproduce
For simplicity I'm using cupy's wrappers but this just calls the underlying rocBLAS functions with appropriate checks. I have a script gemm.py
import cupy
from cupy_backends.cuda.libs.cublas import CUBLAS_OP_N, CUBLAS_OP_T
u = cupy.random.random((1, 10000,))
V = cupy.random.random((10, 10000))
out = cupy.empty((1, 10,))
for _ in range(100):
cupy.cublas.gemm(CUBLAS_OP_N, CUBLAS_OP_T, u, V, alpha=1, beta=0, out=out)
for _ in range(100):
cupy.cublas.gemv(CUBLAS_OP_N, 1, V, u[0], 0, out[0])
and run the script under rocprof to get kernel timings
rocprof --stats python gemm.py
Which shows two distinct kernels called 100 times each. The gemv kernel takes around 12.5 us but the gemm kernel takes 1273 us to perform the same computation.
Hi @peterbell10,
Thanks for bringing this up. I am also seeing the slowdown in gemm compared to gemv using our rocblas-bench tool. I'll add this to my list and get back to you when I have some changes ready.
Just wanted to update this and let you know that we have changes in the works. The performance comparison for gemv over gemm with m == 1 || n == 1 isn't necessarily clear across all architectures, matrix operations, and sizes, so this might take a little while longer to ensure performance improvements all around as expected.
* Add numerical checking helper to Level 3 rocBLAS
* Added check to see if the input is const
* Enclosed the kernel function of TRSM with brackets to invoke the destructor and release the handle memory
* Addressed the comments
What is the expected behavior
GEMM where one of m or n is 1 should perform similarly to the equivalent GEMV call, because it could simply call the GEMV kernel (as cuBLAS does).
What actually happens
GEMM performs much worse than GEMV.
How to reproduce
For simplicity I'm using cupy's wrappers but this just calls the underlying rocBLAS functions with appropriate checks. I have a script
gemm.py
and run the script under
rocprof
to get kernel timingsWhich shows two distinct kernels called 100 times each. The gemv kernel takes around 12.5 us but the gemm kernel takes 1273 us to perform the same computation.
Environment
The text was updated successfully, but these errors were encountered: