New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding CUDA support ? #3
Comments
I am thinking about CUDA. Perhaps simply replacing sgemm with cublas and convolution with cudnn alone could already give a significant boost to performance. However, I can't promise when I will come to this issue. |
Are there still plans for this? Would pull requests be considered? I assume something simple to start with is add another kad_sgemm_simple that wraps cublasSgemm just like there is the option between an sgemm using blas an an sgemm implemented within kautodiff.*. Something like:
Adding openCL support would also be useful, but I don't know of a gemm implementation that is as straight forward as cublasSgemm. I know one of the objectives on kann is to be lightweight and portable so adding this functionality might be too much? Specially when, for more complex models, the big frameworks would be a more natural choice. |
Training and tuning the network is very convenient using some heavy lib. But deploy models on some CPU only platform, would be very nice if we could take advantages of integrated graphic card. If the main branch is kept for simplicity and lightweight. It will still be nice to have CUDA and OpenCL features live in another fork. IMHO. |
It would be interesting to hear what @attractivechaos thinks about this. I tested cublas_sgemm VS cblas_sgemm and as expected you only start getting performance gains on very large matrices. Too much overhead for data transfer between host and GPU. Of course my implementation might not be the best. By the time your models are that large, maybe going with the big frameworks is better. |
Dear contributor @blackball , Well, beside Intel MKL doens't run on ARM hardwares. |
Thank you all. I am not familiar with CUDA. I heard with CUDA, moving data between CPU and GPU can be costly. I wonder how much speedup CUDA will deliver. The sgemm implementation in kann optionally uses SSE and is reasonably efficient. It is several times faster than most naive implementations (see my blog post). OpenBLAS et al are about twice as fast but their multiplication function alone may have more lines of code than the entire kann. Also, kann can optionally call the BLAS sgemm API, so you can link kann to MKL: Lines 900 to 906 in f9cc24b
Actually 2D CNN is probably the slowest part in kann. It would be good to call external libraries for that part. Unfortunately, kann follows the Theano's shape, which is probably not used often these days. Not sure which libraries support the Theano's shape now. |
Indeed, the transfer between host and device is massive (99% of the time in my system). The actual computation of sgemm is insanely fast. In my system: GNU/Linux with gcc 9.3.0, ryzen2 3700x, GTX1050Ti, cuda toolkit 11 with driver 460.39, openblas 0.3.8 For openblas + kad_sgemm SSE:
For openblas + kad_sgemm SSE:
For mkl sgemm + kad_sgemm SSE:
MKL performed quite poorly, I assume it's because I am using an AMD CPU so screw Intel!!!! Once data lives on the GPU, computation is extraordinary fast, but the overhead in data transfer renders using cuda only for matrix multiplication unfeasible. On the other hand, if all data (at least the data important for forward pass and backpropagation) resides already in GPU, significantly larger networks could be trained. This could be achieved with something like:
Of course, this means a non-trivial amount of code would need to be added defeating a core principle of KANN which is to be small, simple, lean and mean. |
Wow, OpenBLAS is 71 times faster than kann's implementation. Several years ago, it was only twice as fast on my even older machine. I need to revisit matrix multiplication on more recent CPUs at some time. Anyway, users have the option to link against OpenBLAS or other BLAS implementations with make CBLAS=/path/to/cblas/root I have just added that to the Getting Started section of README. Thanks! |
Hi,
Is there any plan to add CUDA support in the near future ? It will be very useful if we want to train some medium size network. It will also be very attractive for platforms like Tegra TK1, etc. Libraries like caffe and mxnet rely on too many libraries. Sometimes it will consume too much time to resolve these libraries conflicts during installation.
The text was updated successfully, but these errors were encountered: