Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEMM much slower than GEMV for multiplying column or row vectors #1238

Open
peterbell10 opened this issue Mar 20, 2022 · 2 comments
Open

GEMM much slower than GEMV for multiplying column or row vectors #1238

peterbell10 opened this issue Mar 20, 2022 · 2 comments
Assignees

Comments

@peterbell10
Copy link

What is the expected behavior

GEMM where one of m or n is 1 should perform similarly to the equivalent GEMV call, because it could simply call the GEMV kernel (as cuBLAS does).

What actually happens

GEMM performs much worse than GEMV.

How to reproduce

For simplicity I'm using cupy's wrappers but this just calls the underlying rocBLAS functions with appropriate checks. I have a script gemm.py

import cupy
from cupy_backends.cuda.libs.cublas import CUBLAS_OP_N, CUBLAS_OP_T
u = cupy.random.random((1, 10000,))
V = cupy.random.random((10, 10000))
out = cupy.empty((1, 10,))

for _ in range(100):
    cupy.cublas.gemm(CUBLAS_OP_N, CUBLAS_OP_T, u, V, alpha=1, beta=0, out=out)

for _ in range(100):
    cupy.cublas.gemv(CUBLAS_OP_N, 1, V, u[0], 0, out[0])

and run the script under rocprof to get kernel timings

rocprof --stats python gemm.py

Which shows two distinct kernels called 100 times each. The gemv kernel takes around 12.5 us but the gemm kernel takes 1273 us to perform the same computation.

"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"Cijk_Ailk_Bljk_DB_MT64x32x8_SE_1LDSB0_APM1_AF0EM1_AF1EM1_AMAS0_ASAE01_ASCE01_ASEM1_BL1_DTL0_DVO0_EPS1_FL0_GRVW1_GSU1_ISA906_IU1_K1_KLA_LBSPP0_LPA0_LPB0_LDL1_LRVW1_MAC_MDA2_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SRVW0_SVW2_SNLL0_TT4_4_TLDS0_USFGRO1_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_8_1_WGM1.kd",100,127322978,1273229,98.11602450966042
"void gemvt_kernel<false, 256, double, double, double, double>(int, int, double, long, double const*, long, int, long, double const*, long, int, long, double, long, double*, long, int, long) [clone .kd]",100,1246876,12468,0.9608518281476839

Environment

Hardware description
GPU AMD Vega 20
CPU AMD EPYC 7742 64-Core Processor
Software version
ROCK Not sure?
ROCR v4.3.1
HCC v4.3.21331-94fc2572
Library v4.3.1
@daineAMD
Copy link
Contributor

Hi @peterbell10,
Thanks for bringing this up. I am also seeing the slowdown in gemm compared to gemv using our rocblas-bench tool. I'll add this to my list and get back to you when I have some changes ready.

@daineAMD
Copy link
Contributor

Just wanted to update this and let you know that we have changes in the works. The performance comparison for gemv over gemm with m == 1 || n == 1 isn't necessarily clear across all architectures, matrix operations, and sizes, so this might take a little while longer to ensure performance improvements all around as expected.

mlse-lib-jenkins pushed a commit that referenced this issue May 11, 2022
* Add numerical checking helper to Level 3 rocBLAS

* Added check to see if the input is const

* Enclosed the kernel function of TRSM with brackets to invoke the destructor and release the handle memory

* Addressed the comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants