Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add suport for int8 matmul in afcuda #1656

Open
WilliamTambellini opened this issue Nov 30, 2016 · 20 comments · May be fixed by #3508
Open

Add suport for int8 matmul in afcuda #1656

WilliamTambellini opened this issue Nov 30, 2016 · 20 comments · May be fixed by #3508
Labels

Comments

@WilliamTambellini
Copy link
Contributor

I see these new features in cuda 8:
"Native FP16 and INT8 computation for deep learning and other workloads;" :
https://devblogs.nvidia.com/parallelforall/cuda-8-features-revealed/
This feature request is to be able to use it through AF at least for matmul and arithmetic (+ - * /)
Thanks
WT.

@arcfide
Copy link

arcfide commented Nov 30, 2016

I would also like to mention my own interest in having general support for int8 as a feature.

@pavanky pavanky added this to the v3.5.0 milestone Nov 30, 2016
@umar456
Copy link
Member

umar456 commented Nov 30, 2016

This is a great idea but we need to be careful to up/down convert on certain hardware when adding this feature. For example fp16 performance on compute_61(non-Tesla Pascal) cards is absolutely abysmal. We will also need to support this on older hardware. I wonder how cuda provides fall back in those cases.

@pavanky
Copy link
Member

pavanky commented Nov 30, 2016

AFAIK, there is no support for half precision floating point numbers in C and C++ standards. So fp16 is going to be a bit to support in a general purpose manner.

int8 can be supported easily (although it is a bit tedious).

@umar456
Copy link
Member

umar456 commented Nov 30, 2016

Found the half library. It seems to be well documented. Probably need to run some tests on performance and compatibility with native types.

@pavanky
Copy link
Member

pavanky commented Nov 30, 2016

@umar456 Does boost have anything similar ?

@WilliamTambellini
Copy link
Contributor Author

Would int8 easier to implement than fp16, especially the int8 cuda8 pascal gpu backend ?

@pavanky
Copy link
Member

pavanky commented Nov 30, 2016

@WilliamTambellini yes int8 support is much easier to accomplish.

@WilliamTambellini
Copy link
Contributor Author

ok so we should better split this feature/issue in 2: int8 and fp16.
Would you mind ?

@shehzan10
Copy link
Member

The issue with int8 (char) is that we use it for b8. That is a differentiation we will have to keep in mind.

@pavanky
Copy link
Member

pavanky commented Nov 30, 2016

@shehzan10 int8_t is a different datatype which can be used for i8

@WilliamTambellini WilliamTambellini changed the title Add suport for int8 and/or fp16 Add suport for int8 Nov 30, 2016
@WilliamTambellini
Copy link
Contributor Author

Ok thanks.
WARNING: I have renamed this feature to "int8" in order to limit the scope to int8 support. Anyone interested by fp16 should create another github issue/ticket.
Cheers
W.

@pavanky pavanky mentioned this issue Dec 14, 2016
4 tasks
@WilliamTambellini
Copy link
Contributor Author

I do confirm our interest in int8 mainly via cuda.
Could anyone summarize which part of ArrayFire would need to be modified in order to take advantage of INT8 hardware acceleration ?
Cheers

@pavanky
Copy link
Member

pavanky commented Feb 3, 2017

@WilliamTambellini This involves changing a lot of files across 3 backends. This is not fairly straight forward.

@arcfide
Copy link

arcfide commented Feb 4, 2017

I'd just like to add my support for int8 support.

@mlloreda mlloreda modified the milestones: v3.5.1, v3.5.0 May 22, 2017
@pavanky pavanky modified the milestones: v3.6.0, v3.5.1 Jun 16, 2017
@WilliamTambellini
Copy link
Contributor Author

Seen here:
https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/
"cuBLAS is a GPU library for dense linear algebra— an implementation of BLAS, the Basic Linear Algebra Subroutines. cuBLAS has support for mixed precision in several matrix-matrix multiplication routines. cublasHgemm is a FP16 dense matrix-matrix multiply routine that uses FP16 for compute as well as for input and output. cublasSgemmEx() computes in FP32, but the input data can be FP32, FP16, or INT8, and the output can be FP32 or FP16. cublasGemm() is a new routine in CUDA 8 that allows specification of the computation precision, including INT8 computation (which uses DP4A)."

@umar456 would it be possible to do a minimalist implementation in AF in order to call cublasGemm() when af array datatype is int8 or int16 ?
Tks

@mlloreda mlloreda modified the milestones: v3.6.0, v3.7.0 Mar 1, 2018
@WilliamTambellini
Copy link
Contributor Author

WilliamTambellini commented Jun 19, 2018

Attaching a minimalist poc (seen on the nvidia forum) of int8 matmul using cublasGemmEx :
int8cublas.cu.txt
TBC.

@WilliamTambellini WilliamTambellini changed the title Add suport for int8 Add suport for int8 matmul for the afcuda backend Jun 19, 2018
@WilliamTambellini
Copy link
Contributor Author

@umar456 this one is not needed for 3.7.0 on my side.

@WilliamTambellini
Copy link
Contributor Author

Could we please remove that one from the 3.7.0 scope ?

@umar456 umar456 removed this from the v3.7.0 milestone Dec 18, 2019
@WilliamTambellini
Copy link
Contributor Author

up ?

@WilliamTambellini WilliamTambellini changed the title Add suport for int8 matmul for the afcuda backend Add suport for int8 matmul for afcuda Aug 29, 2023
@WilliamTambellini WilliamTambellini changed the title Add suport for int8 matmul for afcuda Add suport for int8 matmul in afcuda Aug 29, 2023
@WilliamTambellini
Copy link
Contributor Author

224754869-690b0aa8-2cb2-4e74-b2e1-1d93d5c06ede

@verstatx verstatx linked a pull request Oct 4, 2023 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants