Skip to content

Latest commit

 

History

History
67 lines (51 loc) · 10.2 KB

Parameter_Readme.md

File metadata and controls

67 lines (51 loc) · 10.2 KB

Systolic-CNN supports multiple CNN layers operations, namely, convolution, LRN, average and max pooling, RELU, and ETLWISE.

Convolution kernels

The convolution operation is divided into three kernels with three single-threaded kernels (memrd, mask_read, and memwrite) to read/write the data from/to memory and an autorun convolution kernel that performs the convolution. As the name suggests, memrd is used to read the input feature map, mask_read is used to read the weights, and memwrite is used to write the final results to memory.

Architectural parameters of device kernel

From a system architecture perspective, pe_num defines the number of PEs in the 1-D systolic array that performs temporally paralleled convolution in a deep pipeline. Each PE performs the convolution computation of a different OFM by sharing the same input feature map (IFM) data in a shifted fashion. Thus, pe_num also defines the parallelism of output feature map (OFM) generation. reuse_fac defines the parallelism of the inner product (IP) units inside each PE as well as how many times the same IFM data is reused by each PE for the convolution computation within the same OFM. Increasing reuse_fac will improve the computational throughput without changing the amount of off-chip memory access needed for reading the IFMs, thus relaxes the off-chip memory bandwidth requirement and improves the off-chip memory bandwidth efficiency. vec_fac defines the SIMD width of the partial IP computation between the weight vector and IFM vector across vec_fac different channels inside each IP unit in each PE. Thus, vec_fac and reuse_fac also defines the parallelism of IFM computation along the channel and the row dimension of the IFMs, respectively. In addition, the size of the shift-register-based IFM buffer is defined by reuse_fac times vec_fac. These three parameters allow users to efficiently perform architecture design space exploration to maximize the resource utilization of a given FPGA board subject to the available off-chip memory bandwidth.

From an algorithmic perspective, 𝑝𝑒_𝑛𝑢𝑚, 𝑣𝑒𝑐_𝑓𝑎𝑐, and 𝑟𝑒𝑢𝑠𝑒_𝑓𝑎𝑐 can be interpreted as the unrolling factor of the for loop along the depth of OFM, the depth (channel dimension) of IFM, and the row dimension of the IFM, respectively. It should be noted that the system architecture of Systolic-CNN only depends on the three architectural parameters that are completely invariant to CNN models. Such an invariance is the key to enabling the run-time flexibility needed for handling the dynamic workload in a multi-tenancy edge computing environment.

Intel OpenCL channels

Intel OpenCL channels are used to transfers the data read from the external memory in mask_read, memread kernels to convolution kerneland send the final results to memwrite kernel. Channel depths are determined based on the inital report generated by openCL compiler and can be changed to see the impact on the resources and performance. Channels mainly utilize the RAM blocks. Structure type of channels are used to transer large amount of data from one kernel to other.

Host kernel parameters

To generalize the convolution, all the parameters are send from the host side. Some of the parameters are common for each of the kernel and are explained below:

  1. Input_height : Determines the value of output channel width divided by pe_num

  2. Window : Number of times weight needs to shifted along the x dimension of the image and can be obtained by the equation : window = ceil[image_x_dimension/(stride* reuse_fac)]

  3. Window2 : Number of times weight need to be shifted along the y dimension of the image and can be obtained by the equation : window2 = ceil[image_y_dimension/(stride)]

  4. Maskheight : Defines the y dimension of the weights. For fully connected (FC) layer , to use same kernels as convolution, maskheight is set to 1.

  5. Maskwidth : Defines the x dimension of the weights. For FC layer, to use same kernels as convolution, maskwidth is kept as input_dimension/16, where 16 is vec_fac parameter and is fixed in the given design.

  6. Kernel_width : This parameters is determined by the input feature map channel dimesnion and vec_fac and is given by the equation: kernl_width = input_feature_map_channel_dimension/16

In the following section detailed information about the parameters related to each kernel is discussed

Memrd kernel features and parameters

Memrd kernel is used to read input feature map for three major operations namely, convolution, FC, and max-pooling. Data is always read along the input feature map channel dimension that is per cycle, 16 data which is read from the memory is along the channel dimension and this read window moves along the x dimension till maskwidth parameter. Shift register based buffer is used here in memrd kernel to reuse the already read data. Effective reduction in the memory access is given the equation: memory_access_reduction = pe_num * reuse_fac*vec_fac.

Only difference between FC and convolution is explained above in 1.1.2 section. To determine if the layer is pooling or convolution , pool parameter is defined which is 1 for pooling layer and 0 for covolution layer. Other parameters used are :

  1. InputWidth : It is the x dimension of the input feature map

  2. InputWidth2 : It is the y dimension of the input feature map

  3. Stride_conv : Determines the stride along the y dimension of the input_feature_map and is given by the equation: stride_conv = actual_stride

  4. Stride_conv1 : Determines the stride along the x dimension of the input feature map and is given by the equation: stride_conv1 = (actual_stride > reuse_fac) ? actual_stride : reuse_fac

  5. Window_check : This parameter is used to enable support for the grouped convolution and is given by the equation: window_check = input_channel_dimension/group_parameter

Memrd kernel only send the data either to first convolution kernel or to the max_pooling kernel. As for convolution, we adopted 1-d Systolic array architecture, there is chain of channels which send convolution from one convolution kernel to next. This chain starts from the memrd kernel.

Mask_read kernel features and parameters

Mask_read kernel is used to transfer the weights from the memory banks and send out the processing element to perform convolution. For convolution, number of weights required per cycle is given by the following equation: number_of_weights = vec_fac * pe_num As mentioned earlier, vec_fac for our design is fixed to 16. Convolution is more computationally expensive operation, it is not limited by the available memory bandwidth. However as same kernel is used for the Fully connected layer, which is memory intensive operation, number of weights read from the memory is kept limited to vec_fac * pe_num / 4. But the weights which are stored for FC are quantized and 4 weights are concatenated with each other from the host side. So, for example when 4 set pf weights are read from the memory by the device code, it can be converted to 16 different set of weight as shown in the code from 459-481. Now this different 16 set of weight can be transferred to 16 different PE. This quantization process is only used for Alexnet CNN, to mitigate the memory bandwidth limitation. But for resnet-50 and other bigger CNN, we have avoided using quantization process but still it can be used for bigger CNNs. To differentiate between FC and convolution layer , maskHeight is fixed to 1 for FC as has been done in memrd kernel. Other parameter which enables the quantizatiton logic is FC, which needs to be set to 1 for convolution and FC, however for quantized FC it needs to be set to 0. Other parameters used in mask_read kernel such as input_height, window, window2, maskwidth, kernel_width has similar defination as memrd kernels.

Convolution kernel features and parameters

Convolution kernels are used to perform the MAC operation on the data received from the memrd and mask_read kernel and send out the final result to mem_write kernel. Each of the convolution kernel is autorun kernel, thus they need to be instantiated from the host side, and is replicated pe_num times using num_compute_unit feature of OpenCL which generates 16 processing elements (PEs) of convolution kernel. As, our design using the concept of 1-d systolic array each PE sends out the input feature map to the adjacent one PE. However each PE receives different set of weights. Each PE also uses adder tree kind of structure to accumulate the final convolution result and send to mem_write kernel which can be seen from 684-688 lines of the code. PE receives each paramter that is requires to perform convolution or FC is send via memory channel from memread kernel. So defination of the parameters such as input_height, kernel_width, mask_width, mask_height remains the same as defined in the memread kernel.

Memwrite kernel features and parameters

Memwrite kernel receives final convolution/FC results from the PE. Total size of the results received by the memwrite kernel is given by the following equation : total_size_results = pe_num * reuse_fac * data_bit_size. Since this collection from different PEs results in generation of MUX, to reduce the size of the MUX ( so as to reduce the output MUX), input data in Memread kernel per cycle is read along the channel dimension of the input feature map ( more information is available in the thesis). Bias addition and Sum layer( EltWise) is done in this kernel. Bias4 memory buffer is used for the sum layer and bias buffer is used for Bias addition to the convolution result. Other paramters used in Memwrite kernels are used to determine the location where the output feature maps are stored in memory to be used by the following layers. Those parameters are defines below 1. outputWidth = which determines the output dimension of the output feature maps. 2. pad = which is used if the next layer requires padding of the feature maps. 3. stride_write : as from each PE reuse_fac number of results are generated, but because of the stride there is a possibility that these results cannot be used, this parameter is used to determine, of reuse_fac number of output, how many can be actually written into the memory. It is equal to the stride of the given convolution layer.