GEMM_AVX2

Introduction

Fast avx2/fma3 sgemm and dgemm subroutines for large matrices, written in C and assembly, able to outperform Intel MKL(2019 update 4) after tuning, achieving >95% serial theoretical performance and >90% parallel theoretical performance.

Interface in C

omp-paralleled: void dgemm_(char *transa,char *transb,int *m,int *n,int *k,double *alpha,double *a,int *lda,double *b,int *ldb,double *beta,double *c,int *ldc) in DGEMM.so; void sgemm_(char *transa,char *transb,int *m,int *n,int *k,float *alpha,float *a,int *lda,float *b,int *ldb,float *beta,float *c,int *ldc) in SGEMM.so.

How to tune

Please edit "dgemm_tune.h" and "sgemm_tune.h". Benchmarking tools can be downloaded from my repository "GEMM_AVX2_FMA3".

Comments:

Any optimizations to the gemm codes are welcomed~

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
OpenBLAS-like_implementation		OpenBLAS-like_implementation
src		src
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
dgemm_tune.h		dgemm_tune.h
sgemm_tune.h		sgemm_tune.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenBLAS-like_implementation

OpenBLAS-like_implementation

src

src

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

dgemm_tune.h

dgemm_tune.h

sgemm_tune.h

sgemm_tune.h

Repository files navigation

GEMM_AVX2

Introduction

Interface in C

How to tune

Comments:

About

Releases

Packages

Languages

License

wjc404/GEMM_AVX2

Folders and files

Latest commit

History

Repository files navigation

GEMM_AVX2

Introduction

Interface in C

How to tune

Comments:

About

Topics

Resources

License

Stars

Watchers

Forks

Languages