GEMM much slower than GEMV for multiplying column or row vectors #1238

peterbell10 · 2022-03-20T11:44:35Z

What is the expected behavior

GEMM where one of m or n is 1 should perform similarly to the equivalent GEMV call, because it could simply call the GEMV kernel (as cuBLAS does).

What actually happens

GEMM performs much worse than GEMV.

How to reproduce

For simplicity I'm using cupy's wrappers but this just calls the underlying rocBLAS functions with appropriate checks. I have a script gemm.py

import cupy
from cupy_backends.cuda.libs.cublas import CUBLAS_OP_N, CUBLAS_OP_T
u = cupy.random.random((1, 10000,))
V = cupy.random.random((10, 10000))
out = cupy.empty((1, 10,))

for _ in range(100):
    cupy.cublas.gemm(CUBLAS_OP_N, CUBLAS_OP_T, u, V, alpha=1, beta=0, out=out)

for _ in range(100):
    cupy.cublas.gemv(CUBLAS_OP_N, 1, V, u[0], 0, out[0])

and run the script under rocprof to get kernel timings

rocprof --stats python gemm.py

Which shows two distinct kernels called 100 times each. The gemv kernel takes around 12.5 us but the gemm kernel takes 1273 us to perform the same computation.

"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"Cijk_Ailk_Bljk_DB_MT64x32x8_SE_1LDSB0_APM1_AF0EM1_AF1EM1_AMAS0_ASAE01_ASCE01_ASEM1_BL1_DTL0_DVO0_EPS1_FL0_GRVW1_GSU1_ISA906_IU1_K1_KLA_LBSPP0_LPA0_LPB0_LDL1_LRVW1_MAC_MDA2_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SRVW0_SVW2_SNLL0_TT4_4_TLDS0_USFGRO1_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_8_1_WGM1.kd",100,127322978,1273229,98.11602450966042
"void gemvt_kernel<false, 256, double, double, double, double>(int, int, double, long, double const*, long, int, long, double const*, long, int, long, double, long, double*, long, int, long) [clone .kd]",100,1246876,12468,0.9608518281476839

Environment

Hardware	description
GPU	AMD Vega 20
CPU	AMD EPYC 7742 64-Core Processor

Software	version
ROCK	Not sure?
ROCR	v4.3.1
HCC	v4.3.21331-94fc2572
Library	v4.3.1

The text was updated successfully, but these errors were encountered:

daineAMD · 2022-03-22T16:53:21Z

Hi @peterbell10,
Thanks for bringing this up. I am also seeing the slowdown in gemm compared to gemv using our rocblas-bench tool. I'll add this to my list and get back to you when I have some changes ready.

daineAMD · 2022-04-14T22:46:29Z

Just wanted to update this and let you know that we have changes in the works. The performance comparison for gemv over gemm with m == 1 || n == 1 isn't necessarily clear across all architectures, matrix operations, and sizes, so this might take a little while longer to ensure performance improvements all around as expected.

* Add numerical checking helper to Level 3 rocBLAS * Added check to see if the input is const * Enclosed the kernel function of TRSM with brackets to invoke the destructor and release the handle memory * Addressed the comments

daineAMD self-assigned this Mar 22, 2022

IvanYashchuk mentioned this issue Mar 25, 2022

eigsh: Prefer gemv over gemm cupy/cupy#6570

Merged

IMbackK mentioned this issue May 7, 2024

[Bug]: rocblas_gemm_ex with m==1 fp16 inputs/outputs f32 compute slower than a quite naive gemv kernel on MI100 #1425

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GEMM much slower than GEMV for multiplying column or row vectors #1238

GEMM much slower than GEMV for multiplying column or row vectors #1238

peterbell10 commented Mar 20, 2022

daineAMD commented Mar 22, 2022

daineAMD commented Apr 14, 2022

GEMM much slower than GEMV for multiplying column or row vectors #1238

GEMM much slower than GEMV for multiplying column or row vectors #1238

Comments

peterbell10 commented Mar 20, 2022

What is the expected behavior

What actually happens

How to reproduce

Environment

daineAMD commented Mar 22, 2022

daineAMD commented Apr 14, 2022