Skip to content

Conv2D acceleration using the framework CFU Playground

License

Notifications You must be signed in to change notification settings

sgauthamr2001/Conv2D_CFU

Repository files navigation

Conv2D_CFU

Conv2D acceleration using CFU Playground framework, Mini-Project, Jan - May 2022

This repository holds code/experiments for the project Conv2D Acceleration using the framework CFU Playground. This is a forked repository of:

CFU-Playground

Brief Description

The convolution operation forms a major chunk of cycles spent in the case of Deep Neural Networks, and the acceleration of same could reduce the inference time. There exists several aspects of parallelism in the case of a Convolution which are to be coupled with better dataflow to gain from the benefits of memory re-use. There exists several frameworks to accelerate the convolution operation, but most of them end up being designed in isolation, or tend to accelerate the whole network on hardware where the CPU core merely places the roll initial data-transfer. The CFU playground enables the development of accelerator in an integrated SoC environment solving the storage and network bottlenecks that might arise when designed in isolation, while at the same time significant operations are performed on the VexRiscV core. The accelerator called the CFU (Custom Function Units) are invoked from the TFlite kernels using macros, and since a given Kernel could be re-used for multiple Networks, this offers more flexibility in terms of hardware.

Setting up the environment

The documentation of the framework provides clear guidelines to setup the environment. Most of the dependencies are open-source. Only proprietary toolchain that would be needed is Xilinx Vivado. The following link guides the user on building the environment:

Setup Guide

Getting Started and Software Baseline

Few examples from source were implemented using Renode Simulation and these could help to get started with the framework. The files are available at:

Renode Examples

However, as renode is not cycle accurate, in practise, either Verilator is to be used or an actual FPGA. A software baseline for an MNIST Neural Network was developed using the TFLite model and the same was profiled to identify the bottlenecks. It was further analysed to identify aspects for parallelism as well as the possibilities of input data re-use. The code is made available at:

Software Baseline

CFU Hardware Accelerator

The hardware accelerator was developed on the inferences drawn and same was optimised iteratively including aspects like changing the cache structure, degree of data re-use, amount of parallelism, till significant peformance was obtained. The accelerator was placed and rounted on a Nexys4 Artix-7 FPGA, and was succesfully tested. The code for the Hardware accelerator is available at:

Accelerator

Results and Conlusions

  • The framework CFU-Playground was reviewed in contrast to other existing frameworks, and particularly, its advantages like Integrated SoC environment, significant usage of the CPU core apart from the initial data transfer etc., have been exploited in this work.

  • The Conv2D operation for a 3x3 case in an MNIST Neural Network was analysed and the Software Baseline was set-up in order to check for the bottlenecks. The software baseline for the given network took 335 M cycles for execution on a VexRiscV core placed, routed on a Nexys4 Artix-7 FPGA. The Conv2D operations consumed 334 M cycles of the total cycle count, and within the Conv2D operation the MAC operations were the bottleneck being executed for 310 M cycles. The code was unrolled to reduce the loop overheads and gain from the spaciality of cache, which enhanced the cycle count to 220 M cycles.

  • Using methods for parallelism like SIMD accumulation along input depth, parallel computation of independent strides and input data re-use between strides as well as multiple output channels, an accelerator was built. The integrated core when synthesised had a critical path of 9.446 ns in comparision to 8.785 ns for the core, indicating a minimal increase owing to the Integrated SoC environment. The inference took 15 M cycles on this integrated core.

  • The cache of core was slightly modified to lower number of bytes per line and on synthesis the the critical path was enhanced to 9.338 ns. The network took 13 M cycles to execute on this integrated core.

  • The overall speed-up obtained was 26x for the base network and when tested for kernel re-use on a larger network the speed-up obtained was 33x. This was using the assumption that the baseline and the integrated accelerator were running at same clock since the difference in critical paths is minimal. Thus, the framework provides a better environment for the development of accelerators and the same was used to achieve the acceleration of 3x3 Conv2D kernels in case of a MNIST Neural Network, with a significant reduction in cycles.

References

About

Conv2D acceleration using the framework CFU Playground

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published