NVIDIA cuSPARSELt is a high-performance CUDA library dedicated to general matrix-matrix operations in which at least one operand is a sparse matrix, which is used in our paper.
[Official Docs] [Official Download] [Official Github]
This file provides instructions for installing the cusparseLt 0.2.0 library, which worked on my device, and setup it for using in Python.
- Firstly, install cusparselt from anaconda
conda install cusparselt -c conda-forge -y
Or you can click Official Download cusparseLt 0.2.0 and choose the target platform.
- Install
spmm
python setup.py install
- Check the hardware(only support cusparseLt on NVIDIA Ampere, e.g., A100, H100.)
python cspmm/test.py
and normally, it should return the result "HARDWARE PASSED"
Please note that the library cusparselt
is updated frequently, but this guide is still valid for now (Last Update: 2023 July 3th ).
We calculate the matrxi multiplication like:
$$
C = A * B
$$
where
Always import the spmm repo after torch.
Always use spmm for the tensor which is on CUDA and contiguous.
initSpmmNum(int num)
: Allocate a series of memory for subsequent sparse-matrix-multiplication.
import torch
import spmm
spmm.initSpmmNum(4)
checkCusparseLt()
: Check if hardware supports cusparselt.
import torch
import spmm
spmm.checkCusparseLt() # it should return 0
initSpmmDescriptor(int index, int num_batches, int num_A_rows, int num_A_cols, int lda, int num_B_rows, int num_B_cols, int ldb, int num_C_rows, int num_C_cols, int ldc)
: Init the Sparse Matrix Descriptor. Function locates the memory byindex
, and normallylda
is equals tonum_A_cols
.
import torch
import spmm
spmm.initSpmmDescriptor(0, 128, 64, 64, 64, 64, 64, 64, 64, 64, 64)
pruneMatrix(int index, float* original_matrix, float* prunned_matrix)
: Prune the Dense Matrix to a Sparse Matrix automatically.
import torch
import spmm
A = torch.rand(64, 64).cuda().contiguous()
A_prunned = torch.rand(64, 64).cuda().contiguous()
spmm.pruneMatrix(0, A, A_prunned)
checkPrunned(int index, float* A_prunned)
: Check if Sparse Matrix meets Structural Requirements. (See more in https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltprunealg-t. Note that 2:4 structure only works for data whose type ishalf
,bfloat16
orint8
, so you need to change the type of input in cspmm/cspmm_imple.cpp fromfloat
tohalf
or use template function.)
import torch
import spmm
mask = ... # 1:2 or 2:4
A = ...
A_prunned = (A * mask).contiguous()
spmm.checkPrunned(0, A_prunned)
compressMatrix(int index, float* A_prunned)
: Compress sparse matrix.
import torch
import spmm
...
spmm.compressMatrix(0, A_prunned)
spmm(int index, float* dB, float* dC)
: Perform sparse matrix multiplication, where dB is another matrix and dC is the results.
import torch
import spmm
...
spmm.compressMatrix(0, B, C)
Here is a pseudo code for using spmm:
import torch
import spmm
spmm.checkCusparseLt()
spmm.initSpmmNum(n) # suppose you have n nn.Linear
for i in range(n):
spmm.initSpmmDescriptor(i, ...)
mask = cal_mask(...) # suppose you get the mask from your own function manually
A_prunned = (A * mask).cuda().contiguous()
spmm.checkPrunned(i, A_prunned)
spmm.compressMatrix(i, A_prunned)
while "(receive a input)":
spmm.spmm(...)