Float16 does not work #43

Avmb · 2017-06-13T19:45:54Z

In this branch, I removed all hardcoded references to float32 and I tried to train with float16, but it does not work:

Using cuDNN version 5105 on context None
Mapped name None to device cuda0: TITAN X (Pascal) (0000:02:00.0)
Loading data
Building model
Building sampler
Building f_init... Done
Building f_next.. Done
Building f_log_probs... Done
Computing gradient... Done
Building optimizers...Disabling C code for Elemwise{Cast{float32}} due to unsupported float16
Done
Total compilation time: 198.4s
Optimization
Seen 846 samples
NaN detected

I've also tried increasing the epsilon in the Adam optimizer, but it doesn't solve the issue.

emjotde · 2017-06-13T19:51:45Z

Hi, it might also not be worth it. If I am not wrong float16 is artifically capped in gamer hardware, i.e. the GTX 1080, to laughable performance, about ~30x slower GEMM. Not sure about Titan X though.

kpu · 2017-06-13T19:54:40Z

We'll hopefully get access to some subset of Jade (22 DGX-1 even though everybody lobbied them to buy normal Pascals), Peta5 (P100s on PCI Express), and Azure has private beta for Pascals. Totally worth it for those.

emjotde · 2017-06-13T19:56:05Z

Oh. In that case carry on :)

Avmb · 2017-06-14T17:52:31Z

Using a learning rate 10 times smaller prevents the NaN, though I still get that strange warning, only during training.

In terms of speed training is slightly faster on our machines, I will try to benchmark on a P100 if I have the chance. I didn't measure accuracy.

emjotde · 2017-06-14T18:11:48Z

Interesting. Thing is, it should not be faster. F16 arithmetics are severly capped. We benchmarked cublas hgemm vs sgemm on a GTX1080 once, it was slower by a factor of 28x . And from what I read that's intentional.

Avmb · 2017-06-14T18:19:52Z

Maybe Theano is doing something smart, or accidentally smart (like not using float16 for some Ops because they haven't implemented yet).

emjotde · 2017-06-14T18:21:11Z

Yeah, maybe on the CPU as well? Are float16 operations faster on our CPUs?

kpu · 2017-06-14T20:41:40Z

Current Intel CPUs have float16 storage format but not operations. So there's an instruction to read at 16-bit float and expand to a 32-bit float then do the usual multiply or add instruction.

hieuhoang · 2017-06-15T01:55:21Z

out of interest, do you know if you're likely to get overflows when using fp16, and if you're doing anything about it * Looking for MT/NLP opportunities * Hieu Hoang http://moses-smt.org/

…

On 14 June 2017 at 21:41, Kenneth Heafield ***@***.***> wrote: Current Intel CPUs have float16 storage format but not operations. So there's an instruction to read at 16-bit float and expand to a 32-bit float then do the usual multiply or add instruction. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#43 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAqOFGUW565I2yGE0OmXC7lB6bdz1RSFks5sEEWFgaJpZM4N4_ge> .

Avmb · 2017-06-19T21:06:11Z

I ran more training benchmarks, including some on a Tesla P100 (thanks to Università di Pisa) and the results are that there is no noticeable difference between float32 and float16.
Probably Theano backend still does not properly exploit float16, and does not even seem to handle it well in terms of numerical stability (I got NaNs for some hyperparameter settings)

As for the difference between the P100 and the TITAN X (Pascal), the TITAN X is actually equal or slightly faster, except when training with float64 (which is probably not very useful). I've tried with full-size models (--dim_word 512 --dim 1024) and batch size up to 256 and still got roughly the same speed between different machines.

hieuhoang · 2017-11-30T03:02:54Z

feedback from my own work with fp16 in amun. When running on a P100 (wilkes) it gives about a 20% speedup over using fp32. Most of the speedup is in the large matrix multiplication at the output layer.

About to try again to speed up the rest of the code (element-wise operations etc) which requires much more work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Float16 does not work #43

Float16 does not work #43

Avmb commented Jun 13, 2017 •

edited

emjotde commented Jun 13, 2017

kpu commented Jun 13, 2017

emjotde commented Jun 13, 2017

Avmb commented Jun 14, 2017

emjotde commented Jun 14, 2017 •

edited

Avmb commented Jun 14, 2017

emjotde commented Jun 14, 2017

kpu commented Jun 14, 2017

hieuhoang commented Jun 15, 2017 via email

Avmb commented Jun 19, 2017 •

edited

hieuhoang commented Nov 30, 2017

Float16 does not work #43

Float16 does not work #43

Comments

Avmb commented Jun 13, 2017 • edited

emjotde commented Jun 13, 2017

kpu commented Jun 13, 2017

emjotde commented Jun 13, 2017

Avmb commented Jun 14, 2017

emjotde commented Jun 14, 2017 • edited

Avmb commented Jun 14, 2017

emjotde commented Jun 14, 2017

kpu commented Jun 14, 2017

hieuhoang commented Jun 15, 2017 via email

Avmb commented Jun 19, 2017 • edited

hieuhoang commented Nov 30, 2017

Avmb commented Jun 13, 2017 •

edited

emjotde commented Jun 14, 2017 •

edited

Avmb commented Jun 19, 2017 •

edited