A parametric RTL code generator of an efficient integer MxM Systolic Array implementation for Xilinx FPGAs.
This repository is an evil cousin to Libano's Systolic Array Generator, with error detection capabilities.
This repository is also part of an IEEE Transactions on Reliability paper that is currently under review.
In a systolic array, there is a rythmic style of computation, in which, at every clock cycle, input data is pumped in, and output data is pumped out. The term systolic is therefore a reference to the functioning of a biological heart[1].
There are a number of mathematical operations that can be implemented using systolic arrays, but the one in this project is a weight stationary matrix multiplier. Nowadays, systolic arrays are the architectural core of state-of-the-art neural network accelerators, such as Google's DPU[2] and Xilinx's TPU[3].
This implementation uses 8-bit integer representation for the inputs, which allows for simultaneosly executing two multiplications in a single DSP[4]. Furthermore, a time-multiplexing scheme is employed on the DSPs[5][6], allowing them to run twice as fast as the rest of the logic. Thus, overall, each DSP is able to execute four 8-bit integer multiplications per clock cycle. The adders responsible for accumulation are implemented with CLB[7][8] elements, such as LUTs and CARRYs.
Hence, the Processing Elements (PEs) that constitute the array are multiply-accumulate (MAC) units.
Given a systolic array of size NxN:
- DSPs: N2 DSP48E[1[5]|2[6]] (1 for each PE)
- Operations/Cycle: 8N2 (N2 PEs, 2x2xMUL + 4xADD per PE)
- Frequency: Will mostly depend target device, but can also depend on N ()
- 14x14 @ XC7Z020 @ 200MHz
- 32x32 @ XCZU9 @ 300MHz
- : Relevant repository documentation.
- : Python script for generating RTL (edit 'settings.py', run 'main.py', import '/RTL/import_me/*').
- : OOC Vivado projects, scripts, and reports for synth/place/route of 14x14/32x32 arrays on 7000/US+.
- [1]H. T. Kung et al., "Systolic Arrays (for VLSI)"
- [2]N. P. Jouppi et al., "In-Datacenter Perfomance Analysis of a Tensor Processing Unit"
- [3]Xilinx, "Zynq DPU Product Guide"
- [4]M. Vestias et al., "Parallel Dot-Products for Deep Learning on FPGA"
- [5]Xilinx, "7 Series DSP48E1 Slice User Guide"
- [6]Xilinx, "UltraScale Architecture DSP Slice User Guide"
- [7]Xilinx, "7 Series Configurable Logic Block User Guide"
- [8]Xilinx, "UltraScale Architecture Configurable Logic Block User Guide"