Support for fp16 #1673

pavanky · 2016-12-14T21:09:01Z

This is splitting off from the issue mentioned here:
#1656

The problems mentioned in the issue include:

no standard fp16 data type
performance issues for certain hardware supporting native fp16.
lack of library support.

These issues can be solved by:

defining a custom 16 bit floating point data type (similar to af_cfloat, af_cdouble)
using fp16 for storage only, but performing compute in fp32.
supporting only key functionality to support most general use cases.

The types of functions that can easily be supported this way include:

All JIT functionality
All reductions (since they already take in different parameters for compute and storage)
All convolutions (same as above)
Matrix multiplication (using reductions or using a library like cuBLAS)

pavanky · 2016-12-14T21:09:29Z

@WilliamTambellini thoughts ?

pavanky · 2016-12-14T21:09:51Z

@arrayfire/core-devel thoughts ?

9prady9 · 2016-12-15T02:02:11Z

- using fp16 for storage only, but performing compute in fp32. - this is what armclang and gcc for ARM are doing as well. They use a type __fp16 for storage only and promote to float in arithmetic expressions.

But, i think it's good to use use fp16 directly inside kernels (CUDA/OpenCL) in the cases where they are performing well.

In the case of CPU backend, we probably would have to define arithmetic operations we want to support on the custom defined 16 bit floating point data type.

pavanky · 2016-12-15T03:05:03Z

@9prady9 We need to have consistent behavior for accuracy reasons. We can't promote to fp32 in one backend while we use use native fp16 in others.

We can perhaps, at a later point, have a flag that switches the behavior as needed.

9prady9 · 2016-12-15T03:07:42Z

I didn't mean use fp16 in one backend or otherwise. what i meant was that, if a given kernel performs well in both CUDA/OpenCL backend, then we can use fp16 for computation as well for that function instead of promoting it to 32-bit type.

pavanky · 2016-12-15T03:19:05Z

@9prady9 The problem isn't algorithm specific, it has to do with hardware.

In the other issue, @umar456 linked to the nvidia docs regarding performance. Everything other than the P100 has worse performance for fp16 compared to fp32. According to wikipedia, the performance of fp16 and fp32 on current generation AMD hardware is identical. So I assume they are performing some kind of upcasting in hardware.

Given that fp16 storage, fp32 compute is the only way we can get consistent performance and behavior, I think arrayfire should default to using fp32 computation for fp16 arrays.

WilliamTambellini · 2017-01-04T17:54:21Z

Hi guys, sorry for the delay: on our side, we are more interested by computation speed up (the usual linear algebra) than lowering mem space. Consequently I m not interested by "using fp16 for storage only, but performing compute in fp32". This is just simply because I m mainly into recent nvidia gpus with cool fp16 computation speed up for NeuralNetworks inference. Happy new year.

pavanky · 2017-01-04T19:00:51Z

@WilliamTambellini It can be enabled as an option that user can toggle.

WilliamTambellini · 2017-01-20T20:45:23Z

Hi guys,
Considering:

we are more into recent nvidia techno (cuda 8, P100, TitanX, GTX10X0, aka GP10X based...)
we are more interested by computation speed up than memory saving

I m ok as long as there is a runtime option in order to impose fp16 computations for fp16 arrays.
Would it be possible ?
Cheers
W.
Ref:
https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/

pavanky · 2017-01-20T20:47:41Z

@WilliamTambellini definitely possible, that's what I was suggesting.

CNugteren · 2017-03-10T20:23:10Z

Regarding OpenCL: Perhaps it's useful to know that the CLBlast OpenCL BLAS library already supports FP16. Perhaps you can use it as a back-end (PR #1727 is already made), but perhaps you can also learn from it's general structure of how was dealt with FP16, or perhaps even just form it's FP32 to FP16 conversion header. Let me know if you have any questions.

velhaco20000 · 2017-08-28T23:40:28Z

Hi, Guys.
I am here because I decided to use ArrayFire instead writing my own OpenCL codes.
The future of heterogeneous computing is in C/C++ standard, a solution similar as you are working here. But This solution will not be in the market before 2020-21.
That's why I've choosen this solution.
So, I just want to write a feedback.
The only backend unsupported is Intel, and this is really not a problem.
Both AMD(VEGA) and NVIDIA (Some Pascal)supports IEE 754-2008 which has FP16 data type defined in item 3.6 (interchange formats).
For a developer, 8 bit operations (AMD) and 16 bit FP operations (AMD & NVIDIA) has tremendous value.
IA or any order of clamp math, logical math, strings , etc.
Specially in VEGA.
I believe the only reason to buy a VEGA is packed math.
And this is very good reason for NVIDIA Titan Xp too (FP16).
I don't think any Intel CPU will implement FP16, beyond ARM for power saving reason.
So, I believe a reasonable solution is create an experimental mode, activated by the user , generating a warning if the proper hardware was not found.
For other hardware, just use and FP32 and print a warning about performance penalty.
For me, just exposing arrays are enough.
Both AMD and NVIDIA have BLAS libraries using FP16 (cuBLAS and rocBLAS) .

WilliamTambellini · 2017-08-29T01:39:12Z

Hi @velhaco20000
Welcome to AF.
Please keep in mind that int8 support is covered by this one:
#1656
I m neither interested by fp16 Intel CPU support as I m mainly into cuda pascal support.
@velhaco20000 : do you have any plan to fork AF and implement fp16 by yourself ?
Cheers
W.

velhaco20000 · 2017-08-29T02:02:49Z

Hi Will, yes ,could be a solution, but I must wait for my VEGA Frontier come. I am mainly interested in VEGA and OpenCL, but my part will suffer a delay. I don't plan to buy a TITAN Xp to test the FP16 in the NVIDIA architecture because, for me the VEGA is much more interesting.

…

On Mon, Aug 28, 2017 at 10:39 PM, WilliamTambellini < ***@***.***> wrote: Hi @velhaco20000 <https://github.com/velhaco20000> Welcome to AF. Please keep in mind that int8 support is covered by this one: #1656 <#1656> I m neither interested by fp16 Intel CPU support as I m mainly into cuda pascal support. @velhaco20000 <https://github.com/velhaco20000> : do you have any plan to fork AF and implement fp16 by yourself ? Cheers W. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1673 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABM5_eF1luxx6mrNlAS2bZGeVpE_50r9ks5sc2vFgaJpZM4LNamt> .

pavanky · 2017-08-29T02:10:11Z

@WilliamTambellini @velhaco20000 I am going to work on run-time compilation of most (if not all) CUDA kernels (similar to what is happening in OpenCL). This should make it easier to add support for FP16 and Int8. I am going to target this for 3.6 release.

pavanky · 2017-08-29T02:15:47Z

@velhaco20000

The future of heterogeneous computing is in C/C++ standard, a solution similar as you are working here. But This solution will not be in the market before 2020-21.

ArrayFire is going to evolve to incorporate this :) The standard libraries will probably not contain the breadth of the functionality of arrayfire. But the improvements to the C++ standard would mean it would mean a much smaller codebase in arrayfire :)

pavanky · 2017-08-29T02:16:25Z

@WilliamTambellini @velhaco20000 also drop by slack: https://join.slack.com/t/arrayfire-org/shared_invite/MjI4MjIzMDMzMTczLTE1MDI5ODg4NzYtN2QwNGE3ODA5OQ if you want to chat about this.

velhaco20000 · 2017-08-29T02:39:48Z

I think I could help to tune performance with OpenCL when I have my VEGA.
Now I am using a notebook with a NVIDIA part.

WilliamTambellini · 2017-12-01T20:33:07Z

Ok so looks like:

CLBlast seems to be able to support fp16:
https://github.com/CNugteren/CLBlast/blob/master/include/clblast_half.h
both Tesla P100 and V100 are native fp16 capable and at least cuda introduced half support:
https://devblogs.nvidia.com/parallelforall/new-features-cuda-7-5/

@pavanky @umar456 is there any remaining blocker points to move forward on this issue ?
Cheers

WilliamTambellini · 2017-12-12T07:00:23Z

FYI: TensorFlow implementation of fp16:
tensorflow/tensorflow#1300

WilliamTambellini · 2018-11-20T21:34:26Z

Hello @pavanky
In order to move forward a little, have you created a github issue for : "run-time compilation of most (if not all) CUDA kernels (similar to what is happening in OpenCL). This should make it easier to add support for FP16 and Int8. I am going to target this for 3.6 release." ?
Kind

WilliamTambellini · 2018-11-30T19:33:36Z

If that could make that one move a little bit forward : this link shows an example about half fused multiply and add via cuda :
https://github.com/parallel-forall/code-samples/tree/master/posts/mixed-precision

kar-dim · 2019-08-17T14:06:26Z

Hello, is half supported yet? I am using the latest version as of today but there is no af::dtype::f16 so I assume no? IS there a way to use the afcl array constructor to read a half buffer from an opencl buffer to arrayfire? I tried "s16" or anything that uses 16 bits buy obviously this is wrong because the as() converts the values and it is not a simple "blind bit by bit casting".

umar456 · 2019-08-17T14:10:23Z

The f16 datatype is currently in the master branch. We haven't made a release with 16-bit floating point as of now. You will have to build ArrayFire to use it.

kar-dim · 2019-08-17T14:46:19Z

Do you know when there will be the next binary release ? Thanks.

nsakharnykh · 2020-01-13T18:36:54Z

Hi, any update on fp16 support in release? Is there any timeline for this?

9prady9 · 2020-01-15T03:34:34Z

fp16 support for some functions/features will be available in next release, 3.7. We are still ironing out some issues which held up the release at the moment. We are sorry for extended waiting, we will try to publish a release as soon as we can. We shall send a release email/post as soon as we do it. I shall update this issue as well once we make a release.

WilliamTambellini · 2020-01-15T17:15:20Z

Hello @nsakharnykh
On my side, one of the last issues blocking the release is probably that one :
#2701
For some reasons, gemv for half is not implemented in cublas. Would you know why ?
Kind

mtmd · 2020-01-27T23:12:56Z

@WilliamTambellini
Have you tried calling FP16 cublas<t>gemm() with n=1? That should address the issue.

WilliamTambellini · 2020-03-15T21:17:56Z

@pavanky That one could be now closed I guess ?

9prady9 · 2020-03-16T14:51:24Z

This is splitting off from the issue mentioned here:
#1656

The problems mentioned in the issue include:

* no standard fp16 data type

* performance issues for certain hardware supporting native fp16.

* lack of library support.

These issues can be solved by:

* defining a custom 16 bit floating point data type (similar to af_cfloat, af_cdouble)

* using fp16 for storage only, but performing compute in fp32.

* supporting only key functionality to support most general use cases.

The types of functions that can easily be supported this way include:

* [x]  All JIT functionality

* [x]  All reductions (since they already take in different parameters for compute and storage)

* [ ]  All convolutions (same as above)

* [x]  Matrix multiplication (using reductions or using a library like cuBLAS)

I have marked the finished items. convolve2NN and convolve2GradientNN are the only convolutions that support f16 type. Standard convolutions don't have half support yet.

BA8F0D39 · 2021-05-06T21:22:08Z

@9prady9
Which functions don't have f16 support?

9prady9 · 2021-05-07T07:22:42Z

@BA8F0D39 We don't have a single location where this is listed. However, I can say for sure that only signal processing and matrix algebra functions might have this support. JIT has the half support where computations are done in single precision. Image processing definitely doesn't have half support. Given that, most functions when they don't support half type, they return appropriate type not supported error code.

shehzan10 added the feature label Dec 29, 2016

shehzan10 added this to the timeless milestone Dec 29, 2016

pavanky modified the milestone: timeless Jul 26, 2017

This was referenced Oct 5, 2017

Updated CLBlast to 1.1.0 #1939

Closed

Make clblast the default OpenCL blas implementation. #1956

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for fp16 #1673

Support for fp16 #1673

pavanky commented Dec 14, 2016 •

edited by 9prady9

pavanky commented Dec 14, 2016

pavanky commented Dec 14, 2016

9prady9 commented Dec 15, 2016

pavanky commented Dec 15, 2016

9prady9 commented Dec 15, 2016

pavanky commented Dec 15, 2016

WilliamTambellini commented Jan 4, 2017

pavanky commented Jan 4, 2017 •

edited

WilliamTambellini commented Jan 20, 2017 •

edited

pavanky commented Jan 20, 2017

CNugteren commented Mar 10, 2017

velhaco20000 commented Aug 28, 2017

WilliamTambellini commented Aug 29, 2017

velhaco20000 commented Aug 29, 2017 via email

pavanky commented Aug 29, 2017

pavanky commented Aug 29, 2017

pavanky commented Aug 29, 2017

velhaco20000 commented Aug 29, 2017

WilliamTambellini commented Dec 1, 2017

WilliamTambellini commented Dec 12, 2017

WilliamTambellini commented Nov 20, 2018

WilliamTambellini commented Nov 30, 2018

kar-dim commented Aug 17, 2019

umar456 commented Aug 17, 2019

kar-dim commented Aug 17, 2019

nsakharnykh commented Jan 13, 2020

9prady9 commented Jan 15, 2020

WilliamTambellini commented Jan 15, 2020

mtmd commented Jan 27, 2020 •

edited

WilliamTambellini commented Mar 15, 2020

9prady9 commented Mar 16, 2020

BA8F0D39 commented May 6, 2021

9prady9 commented May 7, 2021

Support for fp16 #1673

Support for fp16 #1673

Comments

pavanky commented Dec 14, 2016 • edited by 9prady9

pavanky commented Dec 14, 2016

pavanky commented Dec 14, 2016

9prady9 commented Dec 15, 2016

pavanky commented Dec 15, 2016

9prady9 commented Dec 15, 2016

pavanky commented Dec 15, 2016

WilliamTambellini commented Jan 4, 2017

pavanky commented Jan 4, 2017 • edited

WilliamTambellini commented Jan 20, 2017 • edited

pavanky commented Jan 20, 2017

CNugteren commented Mar 10, 2017

velhaco20000 commented Aug 28, 2017

WilliamTambellini commented Aug 29, 2017

velhaco20000 commented Aug 29, 2017 via email

pavanky commented Aug 29, 2017

pavanky commented Aug 29, 2017

pavanky commented Aug 29, 2017

velhaco20000 commented Aug 29, 2017

WilliamTambellini commented Dec 1, 2017

WilliamTambellini commented Dec 12, 2017

WilliamTambellini commented Nov 20, 2018

WilliamTambellini commented Nov 30, 2018

kar-dim commented Aug 17, 2019

umar456 commented Aug 17, 2019

kar-dim commented Aug 17, 2019

nsakharnykh commented Jan 13, 2020

9prady9 commented Jan 15, 2020

WilliamTambellini commented Jan 15, 2020

mtmd commented Jan 27, 2020 • edited

WilliamTambellini commented Mar 15, 2020

9prady9 commented Mar 16, 2020

BA8F0D39 commented May 6, 2021

9prady9 commented May 7, 2021

pavanky commented Dec 14, 2016 •

edited by 9prady9

pavanky commented Jan 4, 2017 •

edited

WilliamTambellini commented Jan 20, 2017 •

edited

mtmd commented Jan 27, 2020 •

edited