hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library
-
Updated
May 28, 2024 - Assembly
hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library
DBCSR: Distributed Block Compressed Sparse Row matrix library
Fast inference engine for Transformer models
Stretching GPU performance for GEMMs and tensor contractions.
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
Tuned OpenCL BLAS
The simplest but fast implementation of matrix multiplication in CUDA.
Matrix Accelerator Generator for GeMM Operations based on SIGMA Architecture in CHISEL HDL
Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.
The fastest Tropical number matrix multiplication on GPU
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers
Fast SGEMM emulation on Tensor Cores
Add a description, image, and links to the gemm topic page so that developers can more easily learn about it.
To associate your repository with the gemm topic, visit your repo's landing page and select "manage topics."