Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding CUDA support ? #3

Open
blackball opened this issue Apr 28, 2017 · 8 comments
Open

Adding CUDA support ? #3

blackball opened this issue Apr 28, 2017 · 8 comments

Comments

@blackball
Copy link
Contributor

blackball commented Apr 28, 2017

Hi,

Is there any plan to add CUDA support in the near future ? It will be very useful if we want to train some medium size network. It will also be very attractive for platforms like Tegra TK1, etc. Libraries like caffe and mxnet rely on too many libraries. Sometimes it will consume too much time to resolve these libraries conflicts during installation.

@attractivechaos
Copy link
Owner

I am thinking about CUDA. Perhaps simply replacing sgemm with cublas and convolution with cudnn alone could already give a significant boost to performance. However, I can't promise when I will come to this issue.

@7PintsOfCherryGarcia
Copy link

Are there still plans for this? Would pull requests be considered? I assume something simple to start with is add another kad_sgemm_simple that wraps cublasSgemm just like there is the option between an sgemm using blas an an sgemm implemented within kautodiff.*.

Something like:

'''
#ifdef HAVE_CUDA
#include <cuda_runtime.h>
#include "cublas_v2.h"
//other preprocess statements needed for cuda-C interaction 
void kad_sgemm_simple(...)
{
    //GPU memory allocation
    //memcpy from host to GPU
    cublasSgemm(...);
    //memcpy from GPU to host
}
#endif
'''

Adding openCL support would also be useful, but I don't know of a gemm implementation that is as straight forward as cublasSgemm. I know one of the objectives on kann is to be lightweight and portable so adding this functionality might be too much? Specially when, for more complex models, the big frameworks would be a more natural choice.

@blackball
Copy link
Contributor Author

@7PintsOfCherryGarcia

Training and tuning the network is very convenient using some heavy lib. But deploy models on some CPU only platform, would be very nice if we could take advantages of integrated graphic card. If the main branch is kept for simplicity and lightweight. It will still be nice to have CUDA and OpenCL features live in another fork. IMHO.

@7PintsOfCherryGarcia
Copy link

It would be interesting to hear what @attractivechaos thinks about this. I tested cublas_sgemm VS cblas_sgemm and as expected you only start getting performance gains on very large matrices. Too much overhead for data transfer between host and GPU. Of course my implementation might not be the best. By the time your models are that large, maybe going with the big frameworks is better.

@james34602
Copy link

james34602 commented Mar 30, 2021

Dear contributor @blackball ,
Simply using MKL routine if MKL were detected in the compiler enviroment, that matrix multiplication performance is beast and basically overhead-free.

Well, beside Intel MKL doens't run on ARM hardwares.
Calling matrix multiplication routine from linear algebra library is in general better than naive sgemm implementation.

@attractivechaos
Copy link
Owner

attractivechaos commented Mar 31, 2021

Thank you all. I am not familiar with CUDA. I heard with CUDA, moving data between CPU and GPU can be costly. I wonder how much speedup CUDA will deliver.

The sgemm implementation in kann optionally uses SSE and is reasonably efficient. It is several times faster than most naive implementations (see my blog post). OpenBLAS et al are about twice as fast but their multiplication function alone may have more lines of code than the entire kann. Also, kann can optionally call the BLAS sgemm API, so you can link kann to MKL:

kann/kautodiff.c

Lines 900 to 906 in f9cc24b

#ifdef HAVE_CBLAS
#include <cblas.h>
void kad_sgemm_simple(int trans_A, int trans_B, int M, int N, int K, const float *A, const float *B, float *C)
{
cblas_sgemm(CblasRowMajor, trans_A? CblasTrans : CblasNoTrans, trans_B? CblasTrans : CblasNoTrans, M, N, K, 1.0f, A, trans_A? M : K, B, trans_B? K : N, 1.0f, C, N);
}
#else

Actually 2D CNN is probably the slowest part in kann. It would be good to call external libraries for that part. Unfortunately, kann follows the Theano's shape, which is probably not used often these days. Not sure which libraries support the Theano's shape now.

@7PintsOfCherryGarcia
Copy link

Indeed, the transfer between host and device is massive (99% of the time in my system). The actual computation of sgemm is insanely fast.
I have attached a file that performs sgemm between A(5000x2000) and B(3000x2000) with 1. CUBLAS, 2a.openblas 2b. mkl, 3a. naive kad_sgemm, 3b. SSE kad_sgemm.

In my system: GNU/Linux with gcc 9.3.0, ryzen2 3700x, GTX1050Ti, cuda toolkit 11 with driver 460.39, openblas 0.3.8
assuming openblas is installed in a system directory and can be located during linking, I get:
test.txt

For openblas + kad_sgemm SSE:

gcc -Wall -Wextra -DOPENBLAS -I"PATH/TO/YOUR/CUDA"/include/ -L"PATH/TO/YOUR/CUDA"/lib/ -O2 -o test test.c  -lcudart -lcublas -lopenblas
./test $RANDOM
Starting routines
    CUBLAS:
            time for GPU allocation and transfer 222329394 ns
            time cublas matmut 35090 ns
            time GPU deallocation and transfer 54446963 ns
            cublas total time 276838378 ns
    OPENBLAS:
            openblas 73790051 ns
    kad_sgemm:
            kad_sgemm_simple 15404507226 ns

For openblas + kad_sgemm SSE:

gcc -Wall -Wextra -DOPENBLAS -DSSE -I"PATH/TO/YOUR/CUDA"/include/ -L"PATH/TO/YOUR/CUDA"/lib/ -O2 -o test test.c  -lcudart -lcublas -lopenblas
./test $RANDOM
Starting routines
    CUBLAS:
            time for GPU allocation and transfer 216707647 ns
            time cublas matmut 33340 ns
            time GPU deallocation and transfer 57922556 ns
            cublas total time 274691993 ns
    OPENBLAS:
            openblas 76218451 ns
    kad_sgemm:
            with SSE
            kad_sgemm_simple 5419872690 ns

For mkl sgemm + kad_sgemm SSE:

gcc -Wall -Wextra -DMKL -DMKLBLAS  -m64 -I"/PATH/TO/YOUR/CUDA"/include/ -I"PATH/TO/YOUR/MKL"/include -L"PATH/TO/YOUR/CUDA"/lib/ -L"/PATH/TO/YOUR/MKL"/lib/intel64  -O2 -o test test.c -lcudart -lcublas -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core
./test $RANDOM
Starting routines
    CUBLAS:
            time for GPU allocation and transfer 213338323 ns
            time cublas matmut 34549 ns
            time GPU deallocation and transfer 42650421 ns
            cublas total time 256048673 ns
    MKLBLAS:
            mklblas 761791581 ns
    kad_sgemm:
            with SSE
            kad_sgemm_simple 5486498838 ns    

MKL performed quite poorly, I assume it's because I am using an AMD CPU so screw Intel!!!!

Once data lives on the GPU, computation is extraordinary fast, but the overhead in data transfer renders using cuda only for matrix multiplication unfeasible. On the other hand, if all data (at least the data important for forward pass and backpropagation) resides already in GPU, significantly larger networks could be trained. This could be achieved with something like:

kad_node_t *kann_new_leaf_array(int *offset, kad_node_p *par, uint8_t flag, float x0_01, int n_d, int32_t d[KAD_MAX_DIM])
{
    int i, len, off = offset && par? *offset : -1;
    kad_node_t *p;
    
    if (off >= 0 && par[off]) return par[(*offset)++];
    p = (kad_node_t*)calloc(1, sizeof(kad_node_t));
    p->n_d = n_d, p->flag = flag;
    memcpy(p->d, d, n_d * sizeof(int32_t));
    len = kad_len(p);
    //******************************************
    //Use CUDA runtime routines if model will be run in a GPU
    #ifdef USE_CUDA
    cudaError_t cudaStat;
    cudaStat = cudaMalloc( (void**)&p->x, len*sizeof( (*p->x) ));
    //check for errors
    cudaStat = cudaMemset( (void*)p->x, 0, len )         
    //check for errors
    //default to system
    #else
    p->x = (float*)calloc(len, sizeof(float));
    #endif
    //******************************************
    if (p->n_d <= 1) {
	for (i = 0; i < len; ++i)
		p->x[i] = x0_01;
} else {
	double sdev_inv;
	sdev_inv = 1.0 / sqrt((double)len / p->d[0]);
	for (i = 0; i < len; ++i)
                    //******************************************
                    //Deal with this
		p->x[i] = (float)(kad_drand_normal(0) * sdev_inv);
                   //******************************************
}
if (off >= 0) par[off] = p, ++(*offset);
return p;
}

Of course, this means a non-trivial amount of code would need to be added defeating a core principle of KANN which is to be small, simple, lean and mean.

@attractivechaos
Copy link
Owner

attractivechaos commented Mar 31, 2021

Wow, OpenBLAS is 71 times faster than kann's implementation. Several years ago, it was only twice as fast on my even older machine. I need to revisit matrix multiplication on more recent CPUs at some time. Anyway, users have the option to link against OpenBLAS or other BLAS implementations with

make CBLAS=/path/to/cblas/root

I have just added that to the Getting Started section of README. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants