Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Float16 does not work #43

Open
Avmb opened this issue Jun 13, 2017 · 11 comments
Open

Float16 does not work #43

Avmb opened this issue Jun 13, 2017 · 11 comments

Comments

@Avmb
Copy link
Collaborator

Avmb commented Jun 13, 2017

In this branch, I removed all hardcoded references to float32 and I tried to train with float16, but it does not work:

Using cuDNN version 5105 on context None
Mapped name None to device cuda0: TITAN X (Pascal) (0000:02:00.0)
Loading data
Building model
Building sampler
Building f_init... Done
Building f_next.. Done
Building f_log_probs... Done
Computing gradient... Done
Building optimizers...Disabling C code for Elemwise{Cast{float32}} due to unsupported float16
Done
Total compilation time: 198.4s
Optimization
Seen 846 samples
NaN detected

I've also tried increasing the epsilon in the Adam optimizer, but it doesn't solve the issue.

@emjotde
Copy link
Collaborator

emjotde commented Jun 13, 2017

Hi, it might also not be worth it. If I am not wrong float16 is artifically capped in gamer hardware, i.e. the GTX 1080, to laughable performance, about ~30x slower GEMM. Not sure about Titan X though.

@kpu
Copy link
Collaborator

kpu commented Jun 13, 2017

We'll hopefully get access to some subset of Jade (22 DGX-1 even though everybody lobbied them to buy normal Pascals), Peta5 (P100s on PCI Express), and Azure has private beta for Pascals. Totally worth it for those.

@emjotde
Copy link
Collaborator

emjotde commented Jun 13, 2017

Oh. In that case carry on :)

@Avmb
Copy link
Collaborator Author

Avmb commented Jun 14, 2017

Using a learning rate 10 times smaller prevents the NaN, though I still get that strange warning, only during training.

In terms of speed training is slightly faster on our machines, I will try to benchmark on a P100 if I have the chance. I didn't measure accuracy.

@emjotde
Copy link
Collaborator

emjotde commented Jun 14, 2017

Interesting. Thing is, it should not be faster. F16 arithmetics are severly capped. We benchmarked cublas hgemm vs sgemm on a GTX1080 once, it was slower by a factor of 28x . And from what I read that's intentional.

@Avmb
Copy link
Collaborator Author

Avmb commented Jun 14, 2017

Maybe Theano is doing something smart, or accidentally smart (like not using float16 for some Ops because they haven't implemented yet).

@emjotde
Copy link
Collaborator

emjotde commented Jun 14, 2017

Yeah, maybe on the CPU as well? Are float16 operations faster on our CPUs?

@kpu
Copy link
Collaborator

kpu commented Jun 14, 2017

Current Intel CPUs have float16 storage format but not operations. So there's an instruction to read at 16-bit float and expand to a 32-bit float then do the usual multiply or add instruction.

@hieuhoang
Copy link

hieuhoang commented Jun 15, 2017 via email

@Avmb
Copy link
Collaborator Author

Avmb commented Jun 19, 2017

I ran more training benchmarks, including some on a Tesla P100 (thanks to Università di Pisa) and the results are that there is no noticeable difference between float32 and float16.
Probably Theano backend still does not properly exploit float16, and does not even seem to handle it well in terms of numerical stability (I got NaNs for some hyperparameter settings)

As for the difference between the P100 and the TITAN X (Pascal), the TITAN X is actually equal or slightly faster, except when training with float64 (which is probably not very useful). I've tried with full-size models (--dim_word 512 --dim 1024) and batch size up to 256 and still got roughly the same speed between different machines.

@hieuhoang
Copy link

feedback from my own work with fp16 in amun. When running on a P100 (wilkes) it gives about a 20% speedup over using fp32. Most of the speedup is in the large matrix multiplication at the output layer.

About to try again to speed up the rest of the code (element-wise operations etc) which requires much more work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants