Skip to content

tirumalnaidu/opencl-hls-cnn-accelerator

Repository files navigation

About

We designed a Neural Network Accelerator for Darknet Reference Model (which is 2.9 times faster than AlexNet and attains the same top-1 and top-5 performance as AlexNet but with 1/10th the parameters) for image classification on Imagenet Dataset.

Table of Contents

Board

Requirements

Files

  • pytorch_model - We used a CNN based on Darknet Framework. So, we had to implemented the model in PyTorch Framework to check the results and collect the model parameters
  • pyopencl_model - To simulate and verify the kernels we wrote in OpenCL, we used PyOpenCL package and it worked with same accuracy as PyTorch model and acheived about 20x speed than PyTorch model.
  • model - This folder contains the pre-trained model parameters of darknet reference model of each layer in seperate txt file.

CNN Architecture

Layer Filters Kernel Size Stride Pad Input Size Output Size
1 conv 16 3 x 3 1 1 256 x 256 x 3 256 x 256 x 16
2 max - 2 x 2 2 0 256 x 256 x 16 128 x 128 x 16
3 conv 32 3 x 3 1 1 128 x 128 x 16 128 x 128 x 32
4 max - 2 x 2 2 0 128 x 128 x 32 64 x 64 x 32
5 conv 64 3 x 3 1 1 64 x 64 x 32 64 x 64 x 64
6 max - 2 x 2 2 0 64 x 64 x 64 32 x 32 x 64
7 conv 128 3 x 3 1 1 32 x 32 x 64 32 x 32 x 128
8 max - 2 x 2 2 0 32 x 32 x 128 16 x 16 x 128
9 conv 256 3 x 3 1 1 16 x 16 x 128 16 x 16 x 256
10 max - 2 x 2 2 0 16 x 16 x 256 8 x 8 x 256
11 conv 512 3 x 3 1 1 8 x 8 x 256 8 x 8 x 512
12 max - 2 x 2 2 0 8 x 8 x 512 4 x 4 x 512
13 conv 1024 3 x 3 1 1 4 x 4 x 512 4 x 4 x 1024
14 avg - 4 x 4 1 0 4 x 4 x 1024 1 x 1 x 1024
15 conv 1000 1 x 1 1 0 1 x 1 x 1024 1 x 1 x 1000

Results

Conv 0  time: 35.898 ms                                                         
Conv 2  time: 79.748 ms                                                         
Conv 4  time: 79.439 ms                                                         
Conv 6  time: 79.442 ms                                                         
Conv 8  time: 79.418 ms                                                         
Conv 10 time: 79.411 ms                                                         
Conv 12 time: 79.404 ms                                                         
Conv 14 time: 17.319 ms                                                         
Total Convolution time: 530.079 ms

Batchnorm 0   time: 143.092 ms                                                  
Batchnorm 2   time: 73.007 ms                                                   
Batchnorm 4   time: 21.486 ms                                                   
Batchnorm 6   time: 5.504 ms                                                    
Batchnorm 8   time: 2.479 ms                                                    
Batchnorm 10  time: 1.259 ms                                                    
Batchnorm 12  time: 0.641 ms                                                    
Batchnorm 14  time: 0.052 ms                                                    
Total Batchnorm time: 247.520 ms   

Maxpool 1  time: 78.848 ms                                                      
Maxpool 3  time: 31.823 ms                                                      
Maxpool 5  time: 8.991 ms                                                       
Maxpool 7  time: 2.890 ms                                                       
Maxpool 9  time: 1.486 ms                                                       
Maxpool 11  time: 0.719 ms                                                      
Maxpool 13  time: 0.286 ms                                                      
Total Pooling time: 125.042 ms                                                  
                                                                                
Total Time: 902.642 ms                                                            
                                                                                
Label   : Egyptian cat                                                          
Accuracy: 35.796 % 

Resource Usage

Kernel ALUTs FFs RAMs DSPs
conv 27822 28705 144 58
batch norm 9949 12211 93 10
pool 8247 10211 36 24
conv1x1 12184 15087 102 17
Total 61872 (56%) 73190 (33%) 405 (79%) 109 (97%)

Planned Improvements

We can further improve the throughput of the accelerator by converting the model to fixed point (8-bit or 16-bit) and pipelining the accelerator by using Intel channels and pipes extension.

License

MIT License