Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for fp16 #1673

Open
3 of 4 tasks
pavanky opened this issue Dec 14, 2016 · 33 comments
Open
3 of 4 tasks

Support for fp16 #1673

pavanky opened this issue Dec 14, 2016 · 33 comments
Labels

Comments

@pavanky
Copy link
Member

pavanky commented Dec 14, 2016

This is splitting off from the issue mentioned here:
#1656

The problems mentioned in the issue include:

  • no standard fp16 data type
  • performance issues for certain hardware supporting native fp16.
  • lack of library support.

These issues can be solved by:

  • defining a custom 16 bit floating point data type (similar to af_cfloat, af_cdouble)
  • using fp16 for storage only, but performing compute in fp32.
  • supporting only key functionality to support most general use cases.

The types of functions that can easily be supported this way include:

  • All JIT functionality
  • All reductions (since they already take in different parameters for compute and storage)
  • All convolutions (same as above)
  • Matrix multiplication (using reductions or using a library like cuBLAS)
@pavanky
Copy link
Member Author

pavanky commented Dec 14, 2016

@WilliamTambellini thoughts ?

@pavanky
Copy link
Member Author

pavanky commented Dec 14, 2016

@arrayfire/core-devel thoughts ?

@9prady9
Copy link
Member

9prady9 commented Dec 15, 2016

- using fp16 for storage only, but performing compute in fp32. - this is what armclang and gcc for ARM are doing as well. They use a type __fp16 for storage only and promote to float in arithmetic expressions.

But, i think it's good to use use fp16 directly inside kernels (CUDA/OpenCL) in the cases where they are performing well.

In the case of CPU backend, we probably would have to define arithmetic operations we want to support on the custom defined 16 bit floating point data type.

@pavanky
Copy link
Member Author

pavanky commented Dec 15, 2016

@9prady9 We need to have consistent behavior for accuracy reasons. We can't promote to fp32 in one backend while we use use native fp16 in others.

We can perhaps, at a later point, have a flag that switches the behavior as needed.

@9prady9
Copy link
Member

9prady9 commented Dec 15, 2016

I didn't mean use fp16 in one backend or otherwise. what i meant was that, if a given kernel performs well in both CUDA/OpenCL backend, then we can use fp16 for computation as well for that function instead of promoting it to 32-bit type.

@pavanky
Copy link
Member Author

pavanky commented Dec 15, 2016

@9prady9 The problem isn't algorithm specific, it has to do with hardware.

In the other issue, @umar456 linked to the nvidia docs regarding performance. Everything other than the P100 has worse performance for fp16 compared to fp32. According to wikipedia, the performance of fp16 and fp32 on current generation AMD hardware is identical. So I assume they are performing some kind of upcasting in hardware.

Given that fp16 storage, fp32 compute is the only way we can get consistent performance and behavior, I think arrayfire should default to using fp32 computation for fp16 arrays.

@shehzan10 shehzan10 added this to the timeless milestone Dec 29, 2016
@WilliamTambellini
Copy link
Contributor

Hi guys, sorry for the delay: on our side, we are more interested by computation speed up (the usual linear algebra) than lowering mem space. Consequently I m not interested by "using fp16 for storage only, but performing compute in fp32". This is just simply because I m mainly into recent nvidia gpus with cool fp16 computation speed up for NeuralNetworks inference. Happy new year.

@pavanky
Copy link
Member Author

pavanky commented Jan 4, 2017

@WilliamTambellini It can be enabled as an option that user can toggle.

@WilliamTambellini
Copy link
Contributor

WilliamTambellini commented Jan 20, 2017

Hi guys,
Considering:

  • we are more into recent nvidia techno (cuda 8, P100, TitanX, GTX10X0, aka GP10X based...)
  • we are more interested by computation speed up than memory saving

I m ok as long as there is a runtime option in order to impose fp16 computations for fp16 arrays.
Would it be possible ?
Cheers
W.
Ref:
https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/

@pavanky
Copy link
Member Author

pavanky commented Jan 20, 2017

@WilliamTambellini definitely possible, that's what I was suggesting.

@CNugteren
Copy link
Contributor

Regarding OpenCL: Perhaps it's useful to know that the CLBlast OpenCL BLAS library already supports FP16. Perhaps you can use it as a back-end (PR #1727 is already made), but perhaps you can also learn from it's general structure of how was dealt with FP16, or perhaps even just form it's FP32 to FP16 conversion header. Let me know if you have any questions.

@pavanky pavanky modified the milestone: timeless Jul 26, 2017
@velhaco20000
Copy link

Hi, Guys.
I am here because I decided to use ArrayFire instead writing my own OpenCL codes.
The future of heterogeneous computing is in C/C++ standard, a solution similar as you are working here. But This solution will not be in the market before 2020-21.
That's why I've choosen this solution.
So, I just want to write a feedback.
The only backend unsupported is Intel, and this is really not a problem.
Both AMD(VEGA) and NVIDIA (Some Pascal)supports IEE 754-2008 which has FP16 data type defined in item 3.6 (interchange formats).
For a developer, 8 bit operations (AMD) and 16 bit FP operations (AMD & NVIDIA) has tremendous value.
IA or any order of clamp math, logical math, strings , etc.
Specially in VEGA.
I believe the only reason to buy a VEGA is packed math.
And this is very good reason for NVIDIA Titan Xp too (FP16).
I don't think any Intel CPU will implement FP16, beyond ARM for power saving reason.
So, I believe a reasonable solution is create an experimental mode, activated by the user , generating a warning if the proper hardware was not found.
For other hardware, just use and FP32 and print a warning about performance penalty.
For me, just exposing arrays are enough.
Both AMD and NVIDIA have BLAS libraries using FP16 (cuBLAS and rocBLAS) .

@WilliamTambellini
Copy link
Contributor

Hi @velhaco20000
Welcome to AF.
Please keep in mind that int8 support is covered by this one:
#1656
I m neither interested by fp16 Intel CPU support as I m mainly into cuda pascal support.
@velhaco20000 : do you have any plan to fork AF and implement fp16 by yourself ?
Cheers
W.

@velhaco20000
Copy link

velhaco20000 commented Aug 29, 2017 via email

@pavanky
Copy link
Member Author

pavanky commented Aug 29, 2017

@WilliamTambellini @velhaco20000 I am going to work on run-time compilation of most (if not all) CUDA kernels (similar to what is happening in OpenCL). This should make it easier to add support for FP16 and Int8. I am going to target this for 3.6 release.

@pavanky
Copy link
Member Author

pavanky commented Aug 29, 2017

@velhaco20000

The future of heterogeneous computing is in C/C++ standard, a solution similar as you are working here. But This solution will not be in the market before 2020-21.

ArrayFire is going to evolve to incorporate this :) The standard libraries will probably not contain the breadth of the functionality of arrayfire. But the improvements to the C++ standard would mean it would mean a much smaller codebase in arrayfire :)

@pavanky
Copy link
Member Author

pavanky commented Aug 29, 2017

@velhaco20000
Copy link

I think I could help to tune performance with OpenCL when I have my VEGA.
Now I am using a notebook with a NVIDIA part.

@WilliamTambellini
Copy link
Contributor

Ok so looks like:

@pavanky @umar456 is there any remaining blocker points to move forward on this issue ?
Cheers

@WilliamTambellini
Copy link
Contributor

FYI: TensorFlow implementation of fp16:
tensorflow/tensorflow#1300

@WilliamTambellini
Copy link
Contributor

Hello @pavanky
In order to move forward a little, have you created a github issue for : "run-time compilation of most (if not all) CUDA kernels (similar to what is happening in OpenCL). This should make it easier to add support for FP16 and Int8. I am going to target this for 3.6 release." ?
Kind

@WilliamTambellini
Copy link
Contributor

If that could make that one move a little bit forward : this link shows an example about half fused multiply and add via cuda :
https://github.com/parallel-forall/code-samples/tree/master/posts/mixed-precision

@kar-dim
Copy link

kar-dim commented Aug 17, 2019

Hello, is half supported yet? I am using the latest version as of today but there is no af::dtype::f16 so I assume no? IS there a way to use the afcl array constructor to read a half buffer from an opencl buffer to arrayfire? I tried "s16" or anything that uses 16 bits buy obviously this is wrong because the as() converts the values and it is not a simple "blind bit by bit casting".

@umar456
Copy link
Member

umar456 commented Aug 17, 2019

The f16 datatype is currently in the master branch. We haven't made a release with 16-bit floating point as of now. You will have to build ArrayFire to use it.

@kar-dim
Copy link

kar-dim commented Aug 17, 2019

Do you know when there will be the next binary release ? Thanks.

@nsakharnykh
Copy link

Hi, any update on fp16 support in release? Is there any timeline for this?

@9prady9
Copy link
Member

9prady9 commented Jan 15, 2020

fp16 support for some functions/features will be available in next release, 3.7. We are still ironing out some issues which held up the release at the moment. We are sorry for extended waiting, we will try to publish a release as soon as we can. We shall send a release email/post as soon as we do it. I shall update this issue as well once we make a release.

@WilliamTambellini
Copy link
Contributor

Hello @nsakharnykh
On my side, one of the last issues blocking the release is probably that one :
#2701
For some reasons, gemv for half is not implemented in cublas. Would you know why ?
Kind

@mtmd
Copy link

mtmd commented Jan 27, 2020

@WilliamTambellini
Have you tried calling FP16 cublas<t>gemm() with n=1? That should address the issue.

@WilliamTambellini
Copy link
Contributor

@pavanky That one could be now closed I guess ?

@9prady9
Copy link
Member

9prady9 commented Mar 16, 2020

This is splitting off from the issue mentioned here:
#1656

The problems mentioned in the issue include:

* no standard fp16 data type

* performance issues for certain hardware supporting native fp16.

* lack of library support.

These issues can be solved by:

* defining a custom 16 bit floating point data type (similar to af_cfloat, af_cdouble)

* using fp16 for storage only, but performing compute in fp32.

* supporting only key functionality to support most general use cases.

The types of functions that can easily be supported this way include:

* [x]  All JIT functionality

* [x]  All reductions (since they already take in different parameters for compute and storage)

* [ ]  All convolutions (same as above)

* [x]  Matrix multiplication (using reductions or using a library like cuBLAS)

I have marked the finished items. convolve2NN and convolve2GradientNN are the only convolutions that support f16 type. Standard convolutions don't have half support yet.

@BA8F0D39
Copy link

BA8F0D39 commented May 6, 2021

@9prady9
Which functions don't have f16 support?

@9prady9
Copy link
Member

9prady9 commented May 7, 2021

@BA8F0D39 We don't have a single location where this is listed. However, I can say for sure that only signal processing and matrix algebra functions might have this support. JIT has the half support where computations are done in single precision. Image processing definitely doesn't have half support. Given that, most functions when they don't support half type, they return appropriate type not supported error code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests