

# **Exposing Hardware Building Blocks to Machine Learning Frameworks**

Thesis Submitted in partial fulfillment of the requirements of **BITS F421T Thesis**

By

**Yash Akhauri**

**ID No. (2016A8TS0142P)**

Under the supervision of

**Dr. Surekha Bhanot  
Dr. Yaman Umuroğlu  
Nicholas Fraser  
Michaela Blott**



**BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE PILANI, PILANI CAMUS**

01 December 2019

## CERTIFICATE

This is to certify that the Thesis entitled, **Exposing Hardware Building Blocks to Machine Learning Frameworks** and submitted by ID No. **2016A8TS0142P** in partial fulfillment of the requirement of BITS F421T Thesis embodies the work done by him under my supervision.

**Date:** 01 December 2019

Signature



Name Dr. Yaman Umuroğlu

Designation Machine Learning Researcher

# Exposing Hardware Building Blocks to Machine Learning Frameworks

Bachelor Thesis

Yash Akhauri

01st December 2019



Submitted in partial fulfillment of the requirements  
for the degree of Bachelor of Engineering

to the

Faculty of Electronics and Instrumentation  
at Birla Institute of Technology and Science

|              |                          |
|--------------|--------------------------|
| 1st Reviewer | Prof. Dr. Surekha Bhanot |
| 2nd Reviewer | Dr. Yaman Umuroglu       |
| 3rd Reviewer | Nicholas Fraser          |
| 4th Reviewer | Michaela Blott           |

#### IMPRINT

*Exposing Hardware Building Blocks to Machine Learning Frameworks*  
Copyright © 2019 by Yash Akhauri.  
All rights reserved. Printed in India.  
Published by the Birla Institute of Technology and Science.

#### COLOPHON

This thesis was typeset using  $\text{\LaTeX}$  and the `memoir` documentclass. It is based on Aaron Turon's thesis *Understanding and expressing scalable concurrency*<sup>1</sup>, itself a mixture of `classicthesis`<sup>2</sup> by André Miede and `tufte-latex`<sup>3</sup>, based on Edward Tufte's *Beautiful Evidence*.

The bibliography was processed by Biblatex. All graphics and plots are made with PGF/TikZ.

The body text is set 10/14pt (long primer) on a 26pc measure. The margin text is set 8/9pt (brevier) on a 12pc measure. Matthew Carter's Charter acts as both the text and display typeface. Monospaced text uses Jim Lyles's `Bitstream Vera Mono` ("Bera Mono").

<sup>1</sup><https://people.mpi-sws.org/~turon/turon-thesis.pdf>

<sup>2</sup><https://bitbucket.org/amiede/classicthesis/>

<sup>3</sup><https://github.com/Tufte-LaTeX/tufte-latex>

*If we knew what it was we were doing,  
it would not be called research, would it?*

—Albert Einstein



## Abstract

---

There are a plethora of applications that demand high throughput and low latency algorithms leveraging machine learning methods. This need for real time processing can be seen in industries ranging from developing neural network based pre-distortors [Tar+19] for enhanced mobile broadband to designing FPGA-based triggers [Dua+18b] in major scientific efforts by CERN for particle physics. In this thesis, we explore how niche domains can benefit vastly if we look at neurons as a unique boolean function of the form  $f : B^I \rightarrow B^O$ , where  $B = \{0, 1\}$ . We focus on how to design topologies that complement such a view of neurons, how to automate such a strategy of neural network design, and inference of such networks on Xilinx FPGAs.

Major hardware borne constraints arise when designing topologies that view neurons as unique boolean functions. Fundamentally, realizing such topologies on hardware asserts a strict limit on the 'fan-in' bits of a neuron due to the doubling of permutations possible with every increment in input bit-length. We address this limit by exploring different methods of implementing sparsity and explore activation quantization. Further, we develop a library that supports training a neural network with custom sparsity and quantization. This library also supports conversion of trained Sparse Quantized networks from PyTorch to VERILOG code which is then synthesized using Vivado, all of which is part of the LogicNet tool-flow. To aid faster prototyping, we also support calculation of the worst-case hardware cost of any given topology.

We hope that our insights into the behavior of extremely sparse quantized neural networks are of use to the research community and by extension allow people to use the LogicNet design flow to deploy highly efficient neural networks. We refer the reader to [Umu+20b] and [Umu+20a] to read more about this thesis.

[Tar+19] Tarver et al., *Design and Implementation of a Neural Network Based Predistorter for Enhanced Mobile Broadband*

[Dua+18b] Duarte et al., "Fast inference of deep neural networks in FPGAs for particle physics"

[Umu+20b] Umuroglu et al., *LogicNets: Co-Designed Neural Networks and Circuits for Extreme-Throughput Applications*

[Umu+20a] Umuroglu et al., "High-Throughput DNN Inference with LogicNets"

# *Contents*

---

|                                                            |           |
|------------------------------------------------------------|-----------|
| ABSTRACT                                                   | v         |
| CONTENTS                                                   | vi        |
| LIST OF ALGORITHMS                                         | vii       |
| LIST OF FIGURES                                            | vii       |
| LIST OF TABLES                                             | viii      |
| <b>I BACKGROUND</b>                                        | <b>1</b>  |
| 1 INTRODUCTION                                             | 3         |
| 1.1 Field-Programmable Gated Arrays . . . . .              | 4         |
| 1.2 Deep Neural Networks . . . . .                         | 5         |
| 2 HARDWARE-SOFTWARE CODESIGN                               | 7         |
| 2.1 Fan-in, Neurons and LUTs. . . . .                      | 8         |
| 3 MAPPING NEURONS TO HARDWARE                              | 9         |
| 3.1 Sparsity and Quantization . . . . .                    | 10        |
| 3.2 Quantifying Hardware Costs . . . . .                   | 13        |
| 3.3 Research Recess I – Sparsity . . . . .                 | 13        |
| <b>II LOGICNET: A LIBRARY FOR MAPPING HBBS TO NEQS</b>     | <b>15</b> |
| 4 THE LOGICNET DESIGN FLOW AND LIBRARY                     | 17        |
| 4.1 Quantizer . . . . .                                    | 18        |
| 4.2 SparseLinear . . . . .                                 | 18        |
| 4.3 DenseQuantLinear . . . . .                             | 19        |
| 4.4 SparseConv . . . . .                                   | 19        |
| 5 DESIGN AUTOMATION                                        | 21        |
| 5.1 Truth Table Generation . . . . .                       | 21        |
| 5.2 VERILOG Code Generation . . . . .                      | 23        |
| 5.3 Logic Synthesis and Analytical LUT estimates . . . . . | 25        |
| 5.4 Timing Analysis . . . . .                              | 26        |
| 5.5 Research Recess II – Synthesis . . . . .               | 27        |
| 6 LOGICNET4HEP                                             | 29        |
| 7 MNIST                                                    | 35        |
| <b>III CONCLUDING REMARKS</b>                              | <b>39</b> |
| 8 CONCLUSION                                               | 41        |
| BIBLIOGRAPHY                                               | 43        |

## *List of Algorithms*

---

|   |                                            |    |
|---|--------------------------------------------|----|
| 1 | A Pruning Step for a L-Layer MLP . . . . . | 12 |
|---|--------------------------------------------|----|

## *List of Figures*

---

|     |                                                                                                                                                                                                                                                                                                                                                                      |    |
|-----|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| 1.1 | Example UltraScale+ layout . . . . .                                                                                                                                                                                                                                                                                                                                 | 3  |
| 1.2 | Illustration of a neuron. . . . .                                                                                                                                                                                                                                                                                                                                    | 4  |
| 1.3 | Illustrating Convolution. . . . .                                                                                                                                                                                                                                                                                                                                    | 4  |
| 2.1 | Mapping a 7:1 Look-up Table to 6:1 LUTs. . . . .                                                                                                                                                                                                                                                                                                                     | 8  |
| 2.2 | Mapping a 8:1 Look-up Table to 6:1 LUTs. . . . .                                                                                                                                                                                                                                                                                                                     | 8  |
| 3.1 | Sparsification per neuron. . . . .                                                                                                                                                                                                                                                                                                                                   | 10 |
| 3.2 | Training Pipeline. . . . .                                                                                                                                                                                                                                                                                                                                           | 10 |
| 3.3 | This figure demonstrates the per neuron connectivity for a 3-Layer Neural Network. . . . .                                                                                                                                                                                                                                                                           | 11 |
| 4.1 | The Design Flow for LogicNet . . . . .                                                                                                                                                                                                                                                                                                                               | 17 |
| 4.2 | QuantReLU and QuantHardTanh (1 bit) Respectively. Scaling Factors not depicted. . . . .                                                                                                                                                                                                                                                                              | 18 |
| 4.3 | Depthwise Separable Convolutions in LogicNet. . . . .                                                                                                                                                                                                                                                                                                                | 20 |
| 5.1 | VERILOG Code Generation Sub-Modules. . . . .                                                                                                                                                                                                                                                                                                                         | 22 |
| 6.1 | The LHC big data problem. . . . .                                                                                                                                                                                                                                                                                                                                    | 29 |
| 6.2 | Dimensions of the NAS method. . . . .                                                                                                                                                                                                                                                                                                                                | 30 |
| 6.3 | Discriminating between highly energetic (boosted) $q$ , $g$ , $W$ , $Z$ , $t$ initiated jets. Figure adapted from User: FPGA4HEP on GitHub.                                                                                                                                                                                                                          | 30 |
| 6.4 | Our PFGA4HEP Neural Network Architecture. . . . .                                                                                                                                                                                                                                                                                                                    | 31 |
| 6.5 | Performance of the Deep Neural Network Classifier. The Left figure shows the signal efficiency versus the mis-identification rate for $q$ , $W$ , $Z$ , $t$ , $g$ jet identification. The mis-identification rate is based on an equal admixture of the other non-signal jet types. On the right we show the Normalized Confusion Matrix for the classifier. . . . . | 31 |
| 6.6 | We train a model on the FPGA4HEP dataset and observe the classifiers Receiver Operating Characteristics with respect to the SoftMax Layer. . . . .                                                                                                                                                                                                                   | 32 |
| 6.7 | Increase in Accuracy with increase in LUT cost. . . . .                                                                                                                                                                                                                                                                                                              | 33 |
| 6.8 | Increase in Accuracy with increase in Bit-Width. . . . .                                                                                                                                                                                                                                                                                                             | 33 |
| 7.1 | Analytical LUT cost vs. Accuracy for a 3 Layer MLP on the MNIST Data-Set. . . . .                                                                                                                                                                                                                                                                                    | 36 |
| 7.2 | Accuracy vs. Bit-Width for a 3 Layer MLP on the MNIST Data-Set. 36                                                                                                                                                                                                                                                                                                   | 36 |

## *List of Tables*

---

|     |                                                               |    |
|-----|---------------------------------------------------------------|----|
| 1.1 | FPGA Resource Table . . . . .                                 | 4  |
| 2.1 | Static Mapping Cost . . . . .                                 | 8  |
| 5.1 | Size and Time estimates for VERILOG Generation . . . . .      | 23 |
| 5.2 | Analytical vs. True LUT cost. . . . .                         | 25 |
| 5.3 | Model Resource Costs . . . . .                                | 25 |
| 6.1 | FPGA4HEP Model List . . . . .                                 | 32 |
| 6.2 | FPGA4HEP Main Table . . . . .                                 | 32 |
| 6.3 | Iterative vs A-priori Fixed Sparsity . . . . .                | 34 |
| 7.1 | MNIST MLP Model List . . . . .                                | 36 |
| 7.2 | Accuracy vs Pruning Techniques . . . . .                      | 37 |
| 7.3 | Skip Connections on MLPs . . . . .                            | 37 |
| 7.4 | Bringing Convolutions to LogicNet . . . . .                   | 38 |
| 7.5 | Accuracy and LUT cost of CNNs on the MNIST data-set . . . . . | 38 |
| 7.6 | Skip Connections on MNIST Convolutional models. . . . .       | 38 |

## **Part I**

### **BACKGROUND**



# 1

## Introduction

Artificial Intelligence has transformed our industry in the recent past. With every iteration of every conference we hear a buzz around smaller and faster deployments of neural networks, along with a constant push in making the hardware that supports this boom more efficient. From companies investing millions in specialized accelerators (ASICs) for their specific application to hardware companies transforming their business strategy to keep up with the buzz, more (inference) for less (energy and time) seems to be the holy grail of the coming decade.

This thesis hopes to focus on a novel method of representing neural networks, that lends us advantages during inference. We hope to identify appropriate use cases for this method of developing neural network topologies, and gain insight into the behavior of aggressive sparsification and quantization of neural networks along with the ramifications of design decisions to hardware cost.

*“Nanos gigantum humeris insidentes.”*

—Bernard of Chartres

*“If I have seen further, it is by standing on the shoulders of giants.”*

—Isaac Newton



FIGURE 1.1: Example UltraScale+ layout



FIGURE 1.2: Illustration of a neuron.



FIGURE 1.3: Illustrating Convolution.

### 1.1 FIELD-PROGRAMMABLE GATED ARRAYS

The basic architecture of a Xilinx FPGA is a two-dimensional array of digital logic elements, that are grouped into Configurable Logic Blocks (CLBs). Each CLB is composed of flip-flops, Look-Up Tables (LUTs). These CLBs are connected together with programmable interconnects and switch matrices. Modern FPGAs have many improvements, one of which is larger memory blocks Block RAMs (BRAMs) and Ultra RAMs (URAMs). These larger memory elements are dense and fast, and have many applications. Modern FPGAs also have integrated arithmetic blocks (DSPs) which are capable of multiplication, addition/subtraction, and other logical functions. [Figure 1.1](#) illustrates the resource layout for an UltraScale+ FPGA. At a high level of abstraction, the FPGA device has vertical regions with an array of resources. A large proportion of the FPGA fabric is dedicated to general purpose logic. CLBs (Configurable Logic Blocks) that are composed of LUTs and Flip Flops. The I/O Blocks are arranged in banks and support a wide variety of interfacing standards. The DSP48E2s are Digital Signal Processing slices that support arithmetic, as well as can be used for barrel shifting, pattern detection, and much more.

From [Table 1.1](#), it becomes evident that we have far more LUTs than we have DSP slices. While a DSP slice (*DSP48E2*) has functionality ranging from a 48-bit logic unit to being a  $27 \times 18$  two's complement multiplier, our approach hopes to sparsify and quantize neural networks to an extent where it becomes more beneficial to configure these LUTs to map to neurons. One of FPGA's fundamental building block is a '*K* – LUT', which can perform any *K* input boolean operation. In this thesis, we study how niche domains can benefit vastly by viewing a neuron as an boolean function transform, which maps *X* input bits to *Y* output bits. This is particularly intriguing as we can convert a neuron to a configuration of such '*K* – LUTs' and achieve high throughput.

We propose utilizing the individual *K* – LUTs to perform arbitrary non-linear functions to a fixed number of activation bits. The advantage is that we can have full-precision weights, and the result is discovering a sparse topology with activation quantization.

TABLE 1.1: Resources available in Xilinx UltraScale<sup>TM</sup>FPGAs. Device names with a 'K' prefix refer to the Kintex<sup>®</sup>UltraScale<sup>TM</sup>FPGAs, names with a 'XCV' prefix refer to the Virtex<sup>®</sup>UltraScale<sup>TM</sup>FPGAs.

| Device  | CLB LUTs  | BRAMs (18Kb) | DSP Slices |
|---------|-----------|--------------|------------|
| KU025   | 145,440   | 720          | 1,152      |
| KU060   | 331,680   | 2,160        | 2,760      |
| XCVU065 | 358,080   | 2,520        | 600        |
| KU115   | 663,360   | 4,320        | 5,520      |
| XCVU440 | 2,532,960 | 5,040        | 2,880      |

## 1.2 DEEP NEURAL NETWORKS

A Deep Neural Network can be used for an array of tasks, from classification, segmentation, detection to actually generating new data after learning the semantics of the task. In our thesis, we focus on viewing each neuron as a boolean function, and limit ourselves to identification or classification tasks.

### ► LINEAR LAYERS

We can think of a neuron in a linear layer as a function which applies a linear transformation on an input of dimension  $I$  and results in an output of dimension  $O$ . We can further add a non-linearity to such a layer, like the ReLU activation function. In Figure 1.2, we demonstrate a neuron of fan-in of 3, and has 1 output. From a boolean function perspective, if  $x_i$  is the input, and  $y_0$  is the output, and both the input and output are of 16-bit fixed point data-types, then we can expect the boolean function ' $f$ ' to be  $f : B^{48} \rightarrow B^{16}$ , where  $B = \{0, 1\}$ . This would require a huge truth table, needing around  $4.50 \times 10^{15}$  bits of storage. It is easy to see that such a neuron is unfeasible to implement on any FPGA. We therefore need to aggressively prune the number of inputs to such a neuron, as well as the bit-width of the input and outputs. Later in this document we will describe the relation of input and output bits of a neuron with the hardware cost as well, which will aid us in design decisions.

### ► CONVOLUTIONAL LAYERS

Figure 1.3 represents a simple convolution operation, where the input image has 1 channel, and the filter depth is 1. It is natural to view these filters as neurons, that respond differently to different spatial features. When the filter correlates well with a region of the input image, the response in the corresponding output image is strong. Since the contents of the filter remain the same at inference, we can also represent a filter convolution with simple truth tables. Such a truth table would essentially represent the boolean function  $f : B^{144} \rightarrow B^{16}$ , where  $B = \{0, 1\}$ . This assumes that the Input and Output image pixels are all represented by 16-bit data-types. This is not a fair estimate of the dimensionality of the boolean function due to the fact that we are ignoring the number of input feature maps, which are considerably high in a typical convolutional neural network.

#### 1.2.1 Quantized Neural Networks

Quantized Neural Networks have been a very attractive avenue for researchers in the deep learning community due to the benefits it provides for inference. Many techniques to quantize neural networks have been proposed, which broadly can be classified as deterministic and stochastic quantization. There are three components that are generally quantized. The weights, activations and the gradient.

We see notable inference time improvements in performance when quantizing weight and activations, while the primary benefit of quantizing gradients is to save on communication cost in distributed training. It is worth noting that quantizing gradients is generally not advisable as high precision gra-

[Y.17] Y. Umuroglu et al., “FINN: A Framework for Fast, Scalable Binarized Neural Network Inference”

[Wan+19] Wang et al., “LUTNet: Rethinking Inference in FPGA Soft Logic”

[Gha+18] Ghasemzadeh et al., “ReBNet: Residual binarized neural network”

[MT19] Murović and Trost, “Massively Parallel Combinational Binary Neural Networks for Edge Processing”

[Pra+18] Prabhu et al., “Deep expander networks: Efficient deep networks from graph theory”

[RK19] Robinett and Kepner, “RadiX-Net: Structured Sparse Matrices for Deep Neural Networks”

[Wor+19a] Wortsman et al., “Discovering Neural Wirings”

[SM02] Stanley and Miikkulainen, “Evolving neural networks through augmenting topologies”

dients are required to make the optimizer converge.

Ever since quantized neural networks have been explored, FPGA implementations of Neural Network accelerators have become increasingly better at inference with low precision bit-widths for both weights and activations. [Y.17] proposed BNN accelerators with a dataflow-style architecture where processing engines are instantiated for every layer. LUTs were heavily utilized to provide XNOR-Popcounts and accumulate operations and weights and intermediate activations were stored in BRAMs. This research also implemented a folding scheme to allow for larger networks to be mapped to smaller devices. [Wan+19] introduced LUTNet, that took a pruned version of ReBNet [Gha+18] in which some of the XNOR-Popcount operations are mapped more effectively to  $k$ -input LUTs. [MT19] implemented binarized networks which have been fully unrolled and implemented directly into LUTs of a small FPGA. This unrolling allowed the synthesis optimization tool to simplify significant portions of the compute logic.

### 1.2.2 Sparse Neural Networks

In sparse neural networks, each layer of neurons receive input from only a few connections from the previous layer of activation. In typical dense layers, every neuron receives input from every activation from the previous layer. We identify three key methods to sparsify Deep Neural Networks; A Priori Fixed Sparsity, Iterative Pruning and Learned Sparsity. In A Priori Fixed Sparsity, a network is initialized with a certain connectivity pattern, which remains fixed throughout training. Two examples of this approach are Deep Expander Networks [Pra+18] and RadiX-net [RK19]. Iterative pruning methods generally use the magnitude of weights to prune a dense network. Recent methods utilize other metrics for this task well. Learned Sparsity methods such as Discovering Neural Wirings [Wor+19a] and Neuroevolution of Augmented Topologies [SM02] are approaches that use some form of learning to identify important connections, establishing and pruning connections as desired.

# 2

## Hardware/Software Codesign

---

The process of achieving desired functionality on a system by exploring a design space and mapping tasks to resources within system constraints is referred to as Hardware/Software Codesign [Tei12].

### ► IMPLEMENTING NEURAL NETWORKS ON FPGAs

To aid our understanding of how a neural network is optimized for inference on an FPGA, we glance at the HLS-RFNoC workflow. This is an open source workflow for deploying a neural network for inference on an FPGA for RF signal processing [Kre19]. This implementation uses Vivado HLS to generate custom HDL for the neural network. HLS4ML [Tra+19] is used to create the neural-network C++ code used with Vivado HLS. The steps can be summarized as below:

- Prototype and train a neural network with industry standard machine learning framework.
- Use the HLS4ML C++ generation tool to create a starting point for running Xilinx HLS.
- Synthesize the HLS to evaluate resource usage. Once the network is implemented in fixed-point, run HLS synthesis to create HDL code.
- Module resulting from HDL generation is inserted into an RFNoC Compute Engine. A user testbench is then optionally created to stimulate the CE and validate its functionality.
- The Compute Engine is then built into an FPGA Image using the RFNoC image building workflow.

There are some key abstractions that may make generating an efficient run-time strategy difficult. The first being, there is no insight into how a specific prototype topology would map to an FPGA in terms of cost. The entry point for most FPGA vendor tools are Register Transfer Level (RTL) designs, expressed in VERILOG or VHDL [Umu18]. This brings about the second, the compiler is responsible for translating the neural network into the language of the instruction set, scheduling memory transfers on the target hardware. In the next chapter, we will further discuss how these problems manifest negatively in hardware/software codesign and discuss our proposed design methodology.

*“Because of the nature of Moore’s law, anything that an extremely clever graphics programmer can do at one point can be replicated by a merely competent programmer some number of years later.”*  
—John Carmack

[Tei12] Teich, “Hardware/Software Codesign: The Past, the Present, and Predicting the Future”

[Kre19] Kreinar, *RFNoC HLS NeuralNet*

[Tra+19] Tran et al., *hls-fpga-machine-learning/hls4ml: v0.1.5*

[Umu18] Umuroglu, “Accelerating Sparse Linear Algebra and Deep Neural Networks on Reconfigurable Platforms”

TABLE 2.1: Static Mapping Cost to 6:1 LUTs for Fan-In bits.

| Fan-In | Number of 6-LUTs | Truth table bits | LUT config bits | % utilized |
|--------|------------------|------------------|-----------------|------------|
| 6      | 1                | 64               | 64              | 100%       |
| 7      | 3                | 128              | 192             | 66.67%     |
| 8      | 5                | 256              | 320             | 80%        |
| 9      | 11               | 512              | 704             | 72.73%     |
| 10     | 21               | 1024             | 1344            | 76.19%     |
| 11     | 43               | 2048             | 2752            | 74.42%     |

## 2.1 FAN-IN, NEURONS AND LUTS.



FIGURE 2.1: Mapping a 7:1 Look-up Table to 6:1 LUTs.



FIGURE 2.2: Mapping a 8:1 Look-up Table to 6:1 LUTs.

One of the central concepts we must grasp to be able to delve into how we can effectively generate topologies in a Machine Learning framework which effectively map to FPGAs is how a Neuron with specific total Fan-In bits and output bits are mapped to LUTs. A brief mapping cost has been detailed in [Table 2.1](#). We deal with the mathematics behind it and attempt to give the reader a visual understanding of how the process is carried out.

### ► MAPPING TO 6:1 LUTS

[Figure 2.1](#) and [Figure 2.2](#) depict how we can map a 7:1 and 8:1 Truth Table to 6:1 LUTs respectively. We can extend similarly to larger input-output bit configurations, which is being discussed below in more detail.

### ► CLOSED FORM EQUATION FOR LUT COST

We attempt to develop a Closed Form equation for the theoretical LUT cost of implementing a specific Neuron with arbitrary Fan-in and Fan-out. The recursive form of the equation for the LUT cost of a neuron with  $N$  fan-in bits and  $M$  output bits is given by (2.1)

$$LUT_{N,M} = M \times \left( 2 \times \left( \frac{LUT_{N-1,M}}{M} \right) - (-1)^N \right) \quad (2.1)$$

This can be further simplified to (2.2).

$$LUT_{N,M} = M \times \left( \frac{LUT_{N-2,M}}{M} + 2^{N-6} \right) \quad (2.2)$$

Which in its closed form gives us (2.3). Note that these LUT costs will only be valid for hardware building blocks composed solely from 6:1 LUTs. Getting the minimum LUT cost from the combination of the wide array of memory types that an FPGA has (5:2 LUTs, 6:1 LUTs, different BRAM configurations) is a design space exploration problem. This elementary equation is integrated into the design flow to give the user a rough hardware cost heuristic. There are many ways to decrease this cost, which will be discussed later.

$$LUT_{N,M} = M \times \left( \frac{2^{N-4} - (-1)^N}{3} \right) \quad (2.3)$$

# 3

## *Mapping Neurons to Hardware*

Prototyping a neural network on a target hardware has a recurring cost, as often an engineer has no true estimate of the resources that will be required. This problem is further aggravated by the complex interaction of a topology and the underlying hardware, a compiler which translates the neural network to the target hardware typically yields inefficient deployment strategies.

In this thesis, we present a novel method of mapping neurons to hardware. This is done without the need for a custom accelerator architecture or a scheduler. If we turn back to our initial representation of a neuron as a boolean function  $f : B^{ip} \rightarrow B^{op}$  where  $B = \{0, 1\}$ , we can hope to represent such a neuron with a look up table which requires  $2^{ip} \times (op + ip)$  bits. Here, it becomes evident that in topology design, our most crucial design consideration should be to minimize  $ip$ , which is the number of input bits a neuron takes. However, as a neural network is essentially several layers of neurons placed sequentially, an implicit bound is also imposed upon  $op$ . While this may sound discouraging, a neural network is often heavily over-parametrized. The paper Deep Compression [Han+15] used pruning, trained quantization and huffman coding to reduce the storage requirements of their neural networks by about  $40\times$  without affecting their performance.



—Sneha Khandelwal

[Han+15] Han et al., “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding”

### ► PRIOR WORKS IN ACCELERATING NEURAL NETWORKS

There have been aggressive efforts in quantizing neural networks. Ever since Binary Neural Networks [Cou+16] delivered respectable performance, FPGA implementations of quantized neural networks have been booming as well. [Y. 17] proposed BNN accelerators with dataflow-style architectures (processing engineers instantiated at every layer). [Y. 17] also implemented a folding scheme for larger neural networks so that they can be mapped to smaller devices. LUTs are heavily utilized in this implementation. [Wan+19] introduced LUTNet, which is a LUT optimized compression scheme which achieves very high LUT density. [Abd+18] introduced POLYBiNN, which utilized 110k 6-input LUTs to achieve an accuracy of 97.18% on MNIST. POLYBiNN achieved 100M FPS on MNIST.

[Cou+16] Courbariaux et al., “Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1”

[Y. 17] Y. Umuroglu et al., “FINN: A Framework for Fast, Scalable Binarized Neural Network Inference”

[Wan+19] Wang et al., “LUTNet: Rethinking Inference in FPGA Soft Logic”

[Abd+18] Abdelsalam et al., “POLYBiNN: a scalable and efficient combinatorial inference engine for neural networks on FPGA”



FIGURE 3.1: Sparsification per neuron.

[Pra+18] Prabhu et al., “Deep expander networks: Efficient deep networks from graph theory”



FIGURE 3.2: Training Pipeline.

[DZ19] Dettmers and Zettlemoyer, *Sparse Networks from Scratch: Faster Training without Losing Performance* [DZ19]

### 3.1 SPARSITY AND QUANTIZATION

#### 3.1.1 Sparsity

In sparse neural networks, each neuron receives inputs from only a few connections of the previous layer. Often, it is beneficial to have regular sparsity, where pruning is done in a manner where neurons without value are pruned completely. This is not true for an LogicNet. We need to sparsify a network in a manner that every neuron maintains a small number of connections. This requirement has been demonstrated in [Figure 3.1](#). We identify three methods of Neural Network pruning: *A-Priori Fixed Sparsity*, *Iterative Pruning* and *Learned Sparsity*.

#### ► A-PRIORI FIXED SPARSITY

We adapt concepts developed in Deep Expander Graphs [Pra+18]. We use the concept of Random Expanders to explain our sparsity requirements.

**Random Expanders:** A random bipartite expander of degree  $D$  on two Vertex Sets  $U, V$  is a graph in which for every vertex  $v \in V$ , the  $D$  neighbours are chosen independently and randomly from  $U$ .

We want to construct a neural network which corresponds to a sparse graph, where each vertex has  $D$  neighbours. To extend this concept to convolutions, we need to view each set of convolutional kernels as one neuron, or as a vertex. This implies that for a convolutional filter set  $(1, C, H, W)$ , we can only have  $D$  non-zero weights.

#### ► ITERATIVE PRUNING

Learning the right connections is an iterative process, the pruning is done by zero-ing the weights with the smallest magnitude. Pruning followed by retraining is one iteration, and per neuron pruning decay rates are calculated such that the number of weights zeroed out by the end of training gives rise to a fixed sparsity, where every neuron has  $D$  neighbours. The training pipeline is presented in [Figure 3.2](#). Each iteration is greedy, and unlike the original implementation, we focus on pruning weights inside each neuron. This gives us a strict control over the connectivity of each neuron, consequently helping us in controlling the size of the truth table of the equivalent boolean function of the neuron.

#### ► SPARSE LEARNING

In the paper *Sparse Networks from Scratch: Faster Training without Losing Performance* [DZ19], there are three steps; (a) Re-distribution of weights, (b) Pruning of weights, (c) Re-growing weights. To explain further, we take a simple *MLP* model and index the layers as  $l$ , neurons as  $n$  and synapses as  $s$ . The authors maintain an ‘exponentially smoothed gradient’ for every weight in the *MLP*. This can be described by the equation below.

$$M_{l,n,s}^{t+1} = \alpha M_{l,n,s}^t + (1 - \alpha) \frac{\partial E}{\partial W_{l,n,s}}$$



FIGURE 3.3: This figure demonstrates the per neuron connectivity for a 3-Layer Neural Network.

#### ► RE-DISTRIBUTION OF WEIGHTS

The normalized mean of the element-wise momentum for all non-zero weights in each layer is taken, as described below.

$$M_l^{mean} = \frac{\sum_n \sum_s M_{l,n,s}}{\sum_l \sum_n \sum_s M_{l,n,s}}$$

The resulting proportions are the momentum magnitude contributions for each layer. This is then used to calculate the number of weights to be regrown. For a network with sparsity  $r$ , the Regrowth for each layer is calculated below.

$$Regrowth_l = n(Params) \times (1 - r) \times M_l^{mean}$$

#### Pruning

The authors prune a proportion  $p$  (prune rate) of the weights with the lowest magnitude in each layer.

#### Regrowth

Weights are re-grown by enabling gradient flow of zero-valued weights which have the largest momentum magnitude.

#### ► MODIFYING SPARSE MOMENTUM PRUNING

As detailed in the [Algorithm 1](#), we had to make some notable changes to the Sparse Learning Algorithm. [Figure 3.3](#) Attempts to measure the per-neuron fan-in for each neuron. Each division in the X-Axis represents a Neuron, and each division in the Y-Axis represents an incoming activation pixel. The dots represent which activation pixels each neuron is connected to. In the default sparse learning algorithm shown in [Figure 3.3](#), we can see that a few neurons have very high fan-in, while the others are pruned away. We desire a low, uniform per neuron fan in, which is achieved in the Modified Per Neuron Sparse Learning algorithm.

---

**Algorithm 1** A Pruning Step for a L-Layer MLP

---

**Input:** Layer  $i$  to  $l$  with: Momentum  $M_i$ ; Weight  $W_i$ ; binary  $Mask_i$ ; prune rate  $p$ ; Neuron Fan-in  $F$ ; Neuron Prune  $P1$ ; Neuron ReGrowth  $R1$ ;  
**Output:** Updated weight  $W^{t+1}$ ; TotalMomentum; TotalNonZero;

```

1 for  $i = 1$  to  $l$  do
2   MeanMomentum $_i \leftarrow$  MeanMomentum $_i + mean(abs(M_i[W_i \neq 0]))$ 
3   TotalMomentum  $\leftarrow$  TotalMomentum + MeanMomentum $_i$ 
4   NonZero $_i = \sum(W_i \neq 0)$ 
5   TotalNonZero  $\leftarrow$  TotalNonZero + NonZero $_i$ 
6 for  $i = 1$  to  $l$  do
7   for Neuron in Layer  $i$  do
8     Prune( $P1$ , Neuron)
9     ReGrow( $R1$ , Neuron)

```

---

## ► DISCUSSION

Our modification of the Sparse Learning Algorithm has some shortcomings. Due to the strict requirement of a fixed per-neuron Fan-in, the  $MeanMomentum_i$  has no redistribution utility in our implementation. This implementation simply prunes weights per neuron by magnitude, and re-grows weights per neuron by their momentum. It is important to note that, in the future it may be a worthwhile endeavor to create a better re-distribution strategy which allows for variable fan-in per neuron based on the  $MeanMomentum$  per neuron. In our testing, such a re-distribution strategy had adverse effect on hardware cost, with very little to no gain in accuracy. We leave these loose ends as part of the algorithm so that the reader is aware of all the variables that are still being tracked in our implementation.

3.1.2 *Quantization*

[Pn19] Pappalardo and nickfraser, *Xilinx/brevitas: Pretrained 4b ProxylessNAS Mobile14 r0*

To support quantization, we utilize the Machine Learning PyTorch Library Brevitas [Pn19]. Brevitas implements a set of building blocks to model a reduced precision hardware data-path at train time. Our methodology of design places no constraint on the weight data-type. We however, do have a constraint on the activation bit-width. We thus utilize the *QuantReLU* and *QuantHardTanh* layers to support Binary and uniform integer quantization of activations. In a quantization flow with integer uniform quantization for activations, the bit-width together with the sign determines the min and max integer values used for scaling and clamping. Since the range of *QuantReLU* output is  $QuantReLU(input) \in \mathbb{R}^+$ , we do not have to worry about the sign bit for scaling. The *QuantReLU* and *QuantHardTanh* return a QuantTensor (NamedTuple), which propagate the following information:

- **quant\_tensor:** The quantized tensor in dequantized representation (floating-point order of magnitude)
- **scale\_factor:** The scale factor of **quant\_tensor**
- **bit\_width:** The precision of **quant\_tensor** in bits.

### 3.2 QUANTIFYING HARDWARE COSTS

Now that we have discussed the ways we have tried to enforce sparsity and the quantization scheme being used, it is important to understand how sparsity and quantization work together to reduce the hardware cost for a neuron. We will deal with the implications of different sparsification and quantization methods on accuracy when we discuss the data-sets LogicNet was tested on.

### 3.3 RESEARCH RECESS I – SPARSITY

In the previous sections, we focused on methods to port neural networks of sizeable dimensions to the LogicNet topology. In spirit of research, a discourse on why some methods work, some don't could be of great value.

#### 3.3.1 *What are some methods to distribute non-zero weights?*

At first, this question may seem to have been answered already. While that is true to some extent, it is worth formalizing. The utility of exploring re-distribution methods is to find natural ways to make neurons adhere to specific fan-in constraints. In our A-Priori Fixed Sparsity methods, we have no focus on neuron importance. Our modified implementation of Sparse Learning algorithm simply clips off the same proportion of weights from each neuron. This may be very unnatural. We refer to the work by [Evc+19] and consider two weight distribution strategies. The original work also considers an Erdős-Rényi-Kernel method, which is an extension of Erdős-Rényi method which takes the kernel size into consideration.

- *Uniform:* The Sparsity of each layer is the same as the Total Sparsity.
- *Erdős-Rényi:* The sparsity of each layer scales with  $1 - \frac{n^{l-1} + nl}{n^{l-1} \times n^l}$  where  $n^l$  denotes the number of neurons at layer  $l$ . This method gives higher sparsity to layers with more parameters and lower sparsities to smaller ones.

[Evc+19] Evcı et al., *Rigging the Lottery: Making All Tickets Winners*

#### 3.3.2 *Ensembling and the Erdős-Rényi Method*

Ensembling has shown great benefits in our experiments, thus a way to take sparsity distribution in account and adjusting our model layer costs accordingly is an interesting exercise. We can benefit from the Erdős-Rényi method of weight redistribution, as it naturally aligns with our view of neural networks and allows them to scale a bit better. As the complexity of the LUT cost of a layer is  $O(2^B)$  ( $B$  is the fan-in in bits), if a layer with  $N_1$  neurons is made more sparse than a layer with  $N_2$  neurons where  $N_1 > N_2$ . The LUT cost of the neurons become proportional to  $2^{B_1}N_1$  and  $2^{B_2}N_2$ , note that as the sparsity of the larger layer is greater  $\mathbb{E}(B_2) > \mathbb{E}(B_1)$ . We thus allow ourselves to ensemble many of those smaller layers or the larger layers and balance the cost. The expression for the number of such smaller layers can be 'ensembled' can be provided by simply dividing the two LUT costs as to obtain  $N_2 \frac{2^{B_2-B_1}}{N_1}$ . The inverse is true if the larger layer requires lesser LUTs. We can also try adjusting the Erdős-Rényi method to take the exponential

nature of the cost of a layer into account. To take this idea further, we do not need to proportionally ensemble intermediate layers, rather we can ensemble intermediate layers of arbitrary sizes as long as it respects the above proportionality in cost. Which in itself is only in place to have balanced sparsity between layers. What utility redistributing weights to balance sparsity across layers provides is an interesting research question.

### 3.3.3 Is Sparsification Neural Architecture Search?

[Evc+19] Evci et al., *Rigging the Lottery: Making All Tickets Winners*

[Wor+19b] Wortsman et al., “Discovering Neural Wirings”

[Moc+18] Mocanu et al., “Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science”

[Liu+19] Liu et al., *Sparse evolutionary Deep Learning with over one million artificial neurons on commodity hardware*

[Bel+17] Bellec et al., *Deep Rewiring: Training very sparse deep networks*

If we think more closely about what we are doing, it becomes evident that for all our models, the maximum size of sparse models is bounded by the largest dense model we start with. This was pointed out by [Evc+19] and while it may seem trivial, it implies this statement not only from a parameter masking perspective but more interestingly, from a topological perspective. Some methods of Sparsification can be ‘tagged’ as Neural Architecture Search. It may make sense to look more closely at more ‘esoteric’ methodologies of discovering topologies which are sparse. We may tailor topologies ourselves, as we attempted in the previous subsection on *Ensembling and the Erdős-Rényi Method*, or we can look at recent research in the field of discovering neural connections. Some of which are [Wor+19b] [Moc+18] [Liu+19] [Bel+17].

The Sparse Evolutionary Training algorithm essentially replaces fully connected layers with Sparse Connected layers having the Erdős-Rényi topology. For each training epoch, they take each bipartite Sparse Connected layer, removing a fraction of the smallest positive and largest negative weights. They also randomly add new connections in the same amount as previously removed.

We need an algorithm that not only converts a fully connected layer to a Sparse Bipartite Graph, but treats each neuron with certain fan-in as nodes in a graph with arbitrary connectivity (without hierarchy). Ideally, we would like to implement a neural architecture search which does not pay heed to the hierarchical structure of DNNs. The search space for such a task is too huge. Sparsification of each neuron leads to a significantly smaller search space, but the order of complexity remains the same.

One of the most interesting papers which support such an idea is by [Wor+19b]. In this paper, a set of real edges  $\eta$  is considered, with a set of *hallucinated* edges  $\xi_{hal} = V \times V \setminus \xi$ . The magnitude of a weight is strengthened if the gradient pushes the activation in alignment with a certain edge. Over iterations, if it is strengthened enough it can join the set of real edges from the hallucinated edges.

It is extremely important to conduct this ‘neural architecture search’ in a manner such that the search space is navigated effectively. If the neural network has no ‘hierarchy’, it becomes a much harder task. Trying to discover sparse topologies which are not hierarchical in nature is one of the most exciting future areas of work for this thesis.

## Part II

### LOGICNET: A LIBRARY FOR MAPPING HBBS TO NEQS



# 4

## The LogicNet Design Flow and Library

*“Premature optimization is the root of all evil”*  
—Donald Knuth

- ▶ **SYNOPSIS** This and the next chapter that follows hopes to delve into the specifics of the LogicNet design flow and the library. This chapter discusses how we can train efficient DNNs that map directly to FPGAs along with the layer types we support in this library, briefly touching the functionality provided. The next chapter focuses on the functions supported to aid the design automation process. The basic LogicNet Design Flow:

- Define Hardware Building Blocks (HBBs), Neuron Equivalents (NEQs) and their conversion procedure.
- Define a Deep Neural Network in the LogicNet library and train.
- Convert trained network of NEQs into a netlist of HBBs using the conversion procedure. (All steps of the conversion procedure are supported by the LogicNet library itself).
- Postprocess the netlist and synthesize to obtain a bitfile.



FIGURE 4.1: The Design Flow for LogicNet



FIGURE 4.2: QuantReLU and QuantHardTanh (1 bit) Respectively. Scaling Factors not depicted.

In LogicNet, every layer type has an implicit input Quantizer. The reason for this design choice is that input quantization is one of the most crucial aspects of an LogicNet, and output quantization is optional. The LUT cost increases linearly with the output bit-width of a neuron, whereas increases exponentially with the input fan-in bits. We lend the user some flexibility in output quantization, as it may be particularly important for the final layer.

#### 4.1 QUANTIZER

The Quantizer is a relatively simple module, which uses the Brevitas implementation of QuantHardTanh and QuantReLU. We can see clearly in [Figure 4.2](#) how QuantHardTanh and QuantReLU operate. The scaling factors are ignored so that we can discuss the bit-width with more ease. If a bit-width of 1 is provided to the Quantizer, the class constructor initializes QuantHardTanh. If we provide a greater bit-width, we use a Quantized ReLU. For instance, if a bit-width of 3 is provided, we can see that the QuantReLU will quantize any input to integer outputs from 0 to 8. This would also be then multiplied by a scaling factor.

The Quantizer returns a NamedTuple, which contains the Quantized Tensor in De-Quantized representation, with the scale-factor and the bit-width. [Code Listing 4.1](#) depicts how a Quantizer operates on an input tensor.

```

1 import torch
2 from fpganet.quantizer import Quantizer
3 quantizer = Quantizer(bit_width = 1, max_val = 1.61)
4 print(quantizer(torch.randn(10)))
5 >>> QuantTensor(tensor=[-1.6100, -1.6100, -1.6100,
6                         1.6100, -1.6100, -1.6100, -1.6100,  1.6100,
7                         1.6100,  1.6100],
8                         grad_fn=<DifferentiableGraphBackward>),
9                         scale=tensor(1.6100, grad_fn=<ClampMinBackward>),
10                        bit_width=tensor(1.))

```

[Listing 4.1](#): Example of a Quantizer

#### 4.2 SPARSELINEAR

The SparseLinear layer has an input Quantizer followed by a Linear Layer with a specific per-neuron fan in and Batch Normalization. The per-neuron fan-in is represented by a layer mask, and is initialized randomly when the layer is instantiated. It is important to note that in this implementation, the random sparsity is static and not learnable. For correct Truth Table generation, it expects us to give the next module in the forward pass. This is necessary to get the output bits a neuron produces for a specific input.

##### ► LUTs

The SparseLinear layer has a 'LUTs' attribute, that calculates the total number of LUTs that would be required when the class is instantiated. This functionality was added to aid design space exploration. The SparseLinear layer also has a method `getLUTs()` which returns the LUT cost of that layer.

► TRUTH TABLE GENERATION

The SparseLinear layer has a `generateTruthTable()` method. Once called, it assigns an `OrderedDict` to the `truthtable` attribute of `SparseLinear`. This `OrderedDict` contains the `TruthTable` for each neuron.

► TRUTH TABLE FUNCTIONAL VERIFICATION

The `SparseLinear` layer also supports forward pass which utilizes the `TruthTable`. This is important from a functional verification standpoint. The forward pass of the `SparseLinear` module has an argument `'use_table'`, which when set allows forward pass through that layer using the `truthtable` dictionary.

#### 4.3 DENSEQUANTLINEAR

The DenseQuantLinear layer has an Input Quantizer followed by a QuantLinear layer and Batch Normalization.

► LUTS

The DenseQuantLinear layer has a 'LUTS' attribute as well. The functionality is similar to `SparseLinear`, but the equation used is Equation (4.1). In this formula,  $n(O)$  and  $n(I)$  refer to the output feature count and input feature count respectively.  $BW_{in}$  and  $BW_{wt}$  refer to the input bit-width and weight bit-width respectively.

$$LUTS = n(O) \times (n(I) \times BW_{in} \times BW_{wt} \times 1.0699 + 10.779) \quad (4.1)$$

#### 4.4 SPARSECONV

Implementing Convolutions with a Sparsity that LogicNet can leverage is a tricky task. This is due to the fact that Convolutions typically have kernels with many channels, that are not independent. The LUT cost for a fully unfolded dense convolution is shown in Equation (4.2). Where  $outpix$  refers to the number of output pixels,  $oBits$  refer to output bits,  $n(OFM)$  refer to the number of output feature maps,  $n(IFM)$  refers to the number of Input Feature Maps,  $k$  refers to the kernel size (assuming square kernels) and  $iBits$  refers to the number of Input bits.

$$LUTs = outpix \times oBits \times n(OFM) * LUTCost(n(IFM) * k^2 * iBits) \quad (4.2)$$

The most effective way we could find to deal with this is to introduce 'Depthwise Separable Convolutions' with Input and Intermediate Quantizers and Sparsity into the LogicNet library. Figure 4.3 depicts how such a Convolution would be implemented in the LogicNet library.

We give the user the ability to control the input bit-width, intermediate bit-width, as well as the sparsity of the Depthwise Kernels and Pointwise Kernels. The `SparseConv` also has Batch Normalization in both stages. If the `first_layer` argument is set true for `SparseConv`, it checks if the input channel count is 1. If that is true, it sets the Kernel count of Depthwise stage equal to the number of Output Feature Maps. This is useful because as the kernels are very sparse, it becomes harder to extract useful information out of the input image with just 1 randomly initialized sparse 2D kernel.



FIGURE 4.3: Depthwise Separable Convolutions in LogicNet.

#### ► LUTs

By virtue of using Depthwise-Separable Convolution, our LUT cost is divided to two parts. The Depthwise and Pointwise LUT cost. Depthwise LUT Cost is given by Equation (4.3). Here,  $X_k$  refers to the Kernel Sparsity.

$$LUTs_{dw} = outpix \times obits \times n(OFM) \times LUTcost(X_k \times ibits) \quad (4.3)$$

Pointwise LUT Cost is given by Equation (4.4). Here,  $X_s$  refers to the Pointwise Sparsity.

$$LUTs_{pt} = outpix \times obits \times n(OFM) \times LUTcost(X_s \times ibits) \quad (4.4)$$

As this is a fully unfolded convolution, the LUT cost calculation requires the *outpix* and *n(OFM)*. Thus, the LUTS attribute is populated only after a forward pass is done on the Model.

#### ► TRUTH TABLE GENERATION

The SparseConv layer has a `generateTruthTable()` method. Once called, it assigns an `OrderedDict` to the `truthtable` attribute of `SparseConv`. The `OrderedDict` contains separate `OrderedDict` for the Depthwise and Pointwise stages, with proper indexing for the neurons.

#### ► TRUTH TABLE FUNCTIONAL VERIFICATION

The `SparseConv` layer also supports forward pass which utilizes the `TruthTable`. It is similar in nature to the `SparseLinear`. It is important to note that this is purely for Functional Verification, and is thus quite slow. It loops iteratively through every input window and uses the truth table for the forward pass.

```

1 rlen = int(1 + (x.shape[-2] - self.kernel_size)/self.stride)
2 clen = int(1 + (x.shape[-1] - self.kernel_size)/self.stride)
3 dw_out = torch.zeros(x.shape[0], self.inWCout if (self.first_layer and self.
   inWCin==1) else self.inWCin, rlen, clen)
4 for batch in range(x.shape[0]):
5     for h in range(rlen):
6         for w in range(clen):
7             vert_start = h*self.stride
8             vert_end = h*self.stride + self.kernel_size
9             horiz_start = w*self.stride
10            horiz_end = w*self.stride + self.kernel_size
11            submatrix = x[batch, :, vert_start:vert_end, horiz_start:
12                           horiz_end]
12            dw_out[batch, :, h, w] = lookupConvDW(self, self.truthtable['dw'],
13                                              submatrix, (self.first_layer and self.inWCin==1))

```

Listing 4.2: Implementing Truth Table Functional Verification

# 5

## Design Automation

---

### 5.1 TRUTH TABLE GENERATION

After the network has been trained, we need to generate the truth table for every neuron. Currently, we support a `generateTruthTable` method with the class, to allow the user to generate truth tables for only specific layers for inspection. In practice this is specially useful, as the model size increase. The truth table for a neuron with a fan-in of 6 bits requires  $2^6 = 64$  entries. However, in complicated data-sets it is impractical to keep the fan-in at only 6 bits. If we increase the number of bits for the fan-in of each neuron, the number of entries grow exponentially. For instance if the neuron has 20 input bits, it will have  $2^{20} = 1048576$  entries. Clearly, for such models we not only need to allow on the go calculation of the truth table for each layer, but also the truth table for each neuron. This will allow us to extract more parallelism out of the process. Support for per-neuron calculation of truth tables will be added soon.

#### ► STRUCTURE OF TRUTH TABLE

We can see in Listing 5.1, the structure of the truth table of a model which has 3 neurons fan-in of 3, with a bit-width of 1. The keys are the neuron IDs of the layer (in this case `linear1`). Once the key is accessed, we get a list of lists, the first list containing the binary input to a neuron, and the second containing the output for that specific input. While the first list does not have any utility as the input bits are innumerated in a fixed fashion, we will keep this functionality for now.

```
1 >>> print(model.linear1.truthtable)
2 OrderedDict([('0',
3     [[['000', '001', '010', '011', '100', '101', '110', '111'],
4      ['1', '1', '1', '0', '1', '0', '0', '0']]),
5     ('1',
6     [[['000', '001', '010', '011', '100', '101', '110', '111'],
7      ['1', '0', '1', '0', '1', '0', '1', '0']]),
8     ('2',
9     [[['000', '001', '010', '011', '100', '101', '110', '111'],
10      ['1', '0', '1', '0', '1', '0', '1', '0']]))
```

Listing 5.1: Structure of the Truth Table for a layer

“Nicht Kunst und Wissenschaft allein,  
Geduld will bei dem Werke sein.”  
—Johann Wolfgang von Goethe

“During human progress, every science is  
evolved out of its corresponding art.”  
—Herbert Spencer



FIGURE 5.1: VERILOG Code Generation Sub-Modules.

| Bits | File Size(MB) | Time (seconds) |
|------|---------------|----------------|
| 15   | 0.85          | 56             |
| 16   | 1.8           | 119            |
| 18   | 6.9           | 502            |
| 20   | 29.0          | 2022           |

TABLE 5.1: Rough estimates of the file size to store the truth table for 1 neuron with a specific fan-in in bits, and the time it takes to write the '.v' file.

## 5.2 VERILOG CODE GENERATION

The VERILOG generator has been integrated as a VerilogGenerator class. This class takes in a model, identifies all the layers that are 'SparseLinear'. While Truth Table Generation is supported for all topologies (Linear and Convolution), we do not currently support VERILOG Code Generation for 'DenseQuantLinear' and 'SparseConv'. The SparseConv implementation will need a lot of thought, as it needs to incorporate any form of folding that may be necessary. VERILOG Generator for a Fully-Unfolded SparseConv will be integrated into the library at a later time.

From the code sample below, it becomes evident that we do not use any LUT primitives when generating the VERILOG code for the network. We instead define the entire truth table, and leave it up to the Logic Synthesis tool to generate the optimal 'hardware building block'. It is a very interesting research question, to explore whether a more optimal solution is discovered by the synthesis tool or does it have any benefit to define primitives ourselves. Further, due to the nature of how the verilog code itself is generated, the 'file size' and the time to generate the truth table explodes exponentially as well (Table 5.1). While the current VERILOG generator is purely sequential, we hope to extend functionality to generate the codes in a more parallel fashion. This would allow us to test a wider array of topologies faster.

### ► VERILOG CODE SAMPLE

We attach a code sample for VERILOG code generated for the Neural Network in the previous section (Single Layer, 3 Neurons with each neuron having a fan in of 3 bits). Note that this code sample is for synthesis of a **purely combinational circuit**, no registers have been placed at the input or output. This library supports VERILOG code generation with intermediate registers, and will be discussed later. Figure 5.1 clearly shows how registers are implemented at the input and the intermediate layer.

```
1 module LogicNetModule (input [4:0] M0, output[2:0] M1);
2   LUTLayer0 LUTLayer0_inst (.M0(M0), .M1(M1));
3 endmodule
```

Listing 5.2: LogicNetModule.v

```
1 module LUTLayer0 (input [4:0] M0, output [2:0] M1
2 );
3 wire [2:0] inpWire0_0 = {M0[0], M0[2], M0[4]};
4 LUT_L0_N0 LUT_L0_N0_inst (.M0(inpWire0_0), .M1(M1[0:0]));
5
6 wire [2:0] inpWire0_1 = {M0[1], M0[2], M0[3]};
7 LUT_L0_N1 LUT_L0_N1_inst (.M0(inpWire0_1), .M1(M1[1:1]));
8
9 wire [2:0] inpWire0_2 = {M0[0], M0[1], M0[2]};
10 LUT_L0_N2 LUT_L0_N2_inst (.M0(inpWire0_2), .M1(M1[2:2]));
11
```

12 `endmodule`

Listing 5.3: LUTLayer0.v

```

1 module LUT_L0_N0 ( input [2:0] M0, output [0:0] M1 );
2   reg [0:0] M1;
3   always @ (M0) begin
4     case (M0)
5       3'd0: M1 = 1'b1;
6       3'd1: M1 = 1'b1;
7       3'd2: M1 = 1'b1;
8       3'd3: M1 = 1'b0;
9       3'd4: M1 = 1'b1;
10      3'd5: M1 = 1'b0;
11      3'd6: M1 = 1'b0;
12      3'd7: M1 = 1'b0;
13    endcase
14  end
15 endmodule

```

Listing 5.4: LUTL0N0.v

```

1 module LUT_L0_N1 ( input [2:0] M0, output [0:0] M1 );
2   reg [0:0] M1;
3   always @ (M0) begin
4     case (M0)
5       3'd0: M1 = 1'b1;
6       3'd1: M1 = 1'b0;
7       3'd2: M1 = 1'b1;
8       3'd3: M1 = 1'b0;
9       3'd4: M1 = 1'b1;
10      3'd5: M1 = 1'b0;
11      3'd6: M1 = 1'b1;
12      3'd7: M1 = 1'b0;
13    endcase
14  end
15 endmodule

```

Listing 5.5: LUTL0N1.v

```

1 module LUT_L0_N2 ( input [2:0] M0, output [0:0] M1 );
2   reg [0:0] M1;
3   always @ (M0) begin
4     case (M0)
5       3'd0: M1 = 1'b1;
6       3'd1: M1 = 1'b0;
7       3'd2: M1 = 1'b1;
8       3'd3: M1 = 1'b0;
9       3'd4: M1 = 1'b1;
10      3'd5: M1 = 1'b0;
11      3'd6: M1 = 1'b1;
12      3'd7: M1 = 1'b0;
13    endcase
14  end
15 endmodule

```

Listing 5.6: LUTL0N2.v

| Analytical LUT cost | LUTs After Synthesis | Reduction |
|---------------------|----------------------|-----------|
| 128                 | 80                   | 1.6 ×     |
| 272517              | 54336                | 5.01×     |
| 726180              | 76336                | 9.5×      |

TABLE 5.2: Comparing the Analytical LUT Cost formula with the results we actually get from synthesis of the network. Note that this is for a **purely combinational circuit** implementation of the Neural Network.

| Model |    |                    | Analytical LUTs |       |      |     |      | Resources |  |  |
|-------|----|--------------------|-----------------|-------|------|-----|------|-----------|--|--|
| X     | BW | HL                 |                 | LUT   | FF   | DSP | BRAM | WNS       |  |  |
| 3     | 2  | 64, 32, 32         | 266             | 132   | 160  | 0   | 0    | 4.04      |  |  |
| 5     | 2  | 64, 32, 32         | 5586            | 2132  | 393  | 0   | 0    | 2.54      |  |  |
| 3     | 4  | 64, 32, 32         | 45220           | 7117  | 3394 | 0   | 5    | 1.073     |  |  |
| 7     | 2  | 64, 32, 32         | 90706           | 22146 | 1533 | 0   | 0    | 1.073     |  |  |
| 4     | 3  | 64, 32, 32, 32, 32 | 50235           | 16338 | 1329 | 0   | 5    | 1.44      |  |  |

TABLE 5.3: Analyzing the result of synthesizing neural networks trained on the Jet Substructure Classification problem by FPGA4HEP using different Neuron Fan-Ins. The synthesized networks have registers between layers and at the input. Note that the clock target was 5ns.

### 5.3 LOGIC SYNTHESIS AND ANALYTICAL LUT ESTIMATES

Logic synthesis is a key stage in the computer aided design flow of a FPGA. It is composed of a series of optimizations to improve the performance of the design. To support logic synthesis, our tool-flow is fairly straight forward. Once the VERILOG Generator has generated all the codes for each neuron, we can simply import the *synthesize\_and\_get\_resource\_counts* function from *fpganet.synthesis*. Once called, it returns a list of all the hardware resource that are needed. It also generates a log file which contains a verbose description of the synthesized model.

Upon synthesizing a few neural networks on Vivado using the Logic Synthesis tool-flow, we observed that the Analytical LUT cost was an overestimation. This can be seen in [Table 5.2](#). In general, we observed that our Logic Synthesis tool optimized the implementation such that we only need a fraction of the actual estimate. This raises very interesting question about the topologies we can explore. We had restricted ourselves in topology exploration due to certain fan-ins being beyond the scope of fitting on any FPGA fabric.

The networks we have reported in later sections are all using our Analytical LUT Cost model, which is actually an overestimation as is evidenced by [Table 5.2](#). It seems that as the Analytical LUT cost increases, the reduction in the true LUT resource needed is more notable, thought this claim needs a lot more testing to be proven empirically.

This LUT costs presented in [Table 5.2](#) only present purely combinational implementation of the neural network. After observing the cost of such an implementation, we need to look at the how the resource cost behaves when we have registers at intermediate stages (between layers). We also need to conduct some timing analysis of the circuit generated by this process.

#### 5.3.1 Resource cost with and without Registers in circuit

As we see in [Figure 5.1](#), between every LUTLayer the activations are wired to intermediate registers, this is also true for the input of the LogicNetMod-

ule. In this section we previously looked at the LUT cost reduction after synthesis without any registers (a purely combinational circuit). The LUT cost reduction for a network without registers between intermediate layers could be greater, due to the fact that now the optimization problem does not have a hierarchy (across layers). This could also cause an increase in the synthesis time. We observed a noticeable decrease in synthesis time for a circuit with registers, but do not provide empirical evidence to back this observation.

#### 5.4 TIMING ANALYSIS

In order to do some timing analysis on a basic LogicNet topology, we synthesized a small network by creating a fully-pipelined VERILOG description. After running synthesis, placement and Routing using Vivado, we observed a resource usage of 150 LUTs (from the analytical cost of 212 LUTs), and a minimum clock period of 0.768 ns (frequency of 1.3 GHz), due to a very small circuit size. This indicates that there is potential for further reduction of resource footprint using logic minimization. Also note that the maximum clock frequency supported by the global clock network on the FPGA tested is 666 MHz, which leaves a large amount of slack for larger topologies. This actually stands true for more complicated topologies that we tested. These topologies are listed in [Table 5.3](#), the WNS stands for the 'Worst Negative Slack'. Slack is defined as the difference between the actual and the desired time for a timing path. A good FPGA design should have zero slack, if we have negative slack then the timing constraints we have specific for our design is not met. In such a scenario we can either relax our constraints or come up with a better design to meet these timing constraints we had initially specified. All the reported synthesis experiments in [Table 5.3](#) are with a clock target of 5ns. For the first model in the table, this results in a frequency of  $\frac{1000}{5-WNS} \approx 1042MHz$ . While it is evident that the frequency decreases as the LUT cost increases, we will also be conducting experiments with a more strict 1ns clock target in the future as well as more in-depth analysis of the LUT cost with and without registers.

It is also very interesting to note that in larger topologies, the logic synthesizer not only uses the LUTs, but also stores some neurons in the BRAMs. As we had stored truth tables themselves as primitives, it is really interesting to see what 'Hardware Building Blocks' are generated to store the Neurons. While this is interesting, as we make the clock target more strict, the synthesizer may choose to purely use LUTs. This is yet to be observed and more testing needs to be done to understand how the analytical LUTs differ from the true resource allocation of the synthesizer.

## 5.5 RESEARCH RECESS II – SYNTHESIS

The implications of [Table 5.2](#) to topology design is of great importance. We believe that there are two important questions we need to answer.

### 5.5.1 *Can we design a heuristic that aids LUT cost reduction for a neuron?*

In this library, we train neurons with static random connectivity. While this idea leads to sub-optimal accuracy, we are not too concerned with this as we can adopt any pruning technique and convert that into a mask-weight tensor and port that to a LogicNet model. An interesting question to ask is, is there a way to train each *neuron*, such that during synthesis the LUT cost is dramatically lower than the analytical LUT cost?

This question is unanswered in this thesis. However, in this section, we hope to delve into this to some extent and define a way we can incorporate this into the training procedure. Our LogicNet library supports truth table calculation. This can be a slow process, but if a model only has to be deployed for inference; we can propose a training scheme that does take this into account.

PyEDA is a python library for Electronic Design Automation. Logic minimization is known to be a NP-Complete problem. It supports Truth Table Minimization. Instead of minimizing expressions, we can use our Truth Table Generator and integrate that with PyEDA to get the minimized truth table. It is important to note that this will significantly increase the train time of the neural network. Thus, we keep it beyond the scope of this thesis. We still however, nudge the idea of creating a cost function which takes the truth table minimization into account. Studying the hyper-parameter setting for such a cost function and its effect on the accuracy would be a very interesting topic.

### 5.5.2 *Can we integrate congestion estimation in an FPGA with neural architecture search for non-layered topologies?*

There has been a lot of interest around congestion estimation for FPGAs. In a non-layered scenario, it could be really important to take congestion of an FPGA into account. The most efficient methods are based on fast-to-compute heuristics [[Swa+98](#)] but they often have poor estimates. [[Yea+07](#)] uses a global router during placement, which gives more accurate congestion estimates, but at the cost of runtime.

The aforementioned methods provide estimates during place-and-route. Above a certain complexity of topology design, we may need to develop a heuristic which allows us to give a congestion estimate for a certain topology. This is a very complicated research problem, but there has been some progress in this direction. [[Sam+19](#)] design directly in RTL component-level instantiations of DSP, BRAMs and URAMs and associated controller's for orchestrating data movement and manually leverage dedicated cascade interconnects for high frequency, nearest-neighbour data movement.

Given a netlist representation of a circuit, the elementary objective of congestion minimization is to minimize the total sum of wirelength for each net. Perhaps, it may also be a better endeavor to integrate certain design prin-

[[Swa+98](#)] Swartz et al., “A Fast Routability-driven Router for FPGAs”

[[Yea+07](#)] Yeager et al., “Congestion estimation and localization in FPGAs: a visual tool for interconnect prediction”

[[Sam+19](#)] Samajdar et al., “Scaling the Cascades: Interconnect-Aware FPGA Implementation of Machine Learning Problems”

ples into topology constraints instead of attempting to get a real congestion estimation. This remains in our opinion an interesting research problem. It is a hard problem to formalize, and may be very interesting to implement.

# 6

## LogicNet4HEP

The LHC (Large Hadron Collider) is the world's highest energy particle accelerator. Proton bunches collide at a frequency of 40 MHz, with data-rates in excess of hundreds of terabytes per second. The process of tagging and filtering data at real time is called triggering. Triggering helps reduce data to manageable levels for processing offline. At the CMS detector, this triggering is performed at two stages. The first stage typically uses custom hardware like ASICs or FPGAs [Dua+18a] to handle the initial data rates with latency in the range of a few hundred nanoseconds. The second level of triggering, known as High Level Triggering (HLT) uses commercial CPUs to process filtered data in software with longer processing times. We borrow some definitions from the work of Javier Duarte[Dua+18a] to formalize the task in this section.

Jets are collimated showers of particles that are created from the decay and hadronization of quarks  $q$  and gluons  $g$ . Due to the high collision energy at LHC, jet signatures of interest emerge from overlapping quark-initiated showers produced in the decays of heavy standard model particles. In this task, we focus on the task of classifying a jet as either a quark ( $q$ ), gluon ( $g$ ), W boson ( $W$ ), Z boson ( $Z$ ) or top quark ( $t$ ) jet.

### ► KEY METRICS

To create an optimal L1 trigger that satisfies the requirements of the target algorithm, we list the three key metrics considered by [Dua+18a].

- **Latency:** The total time (usually in units of "clocks") for a single forward pass through the neural network.
- **Initiation Interval:** The number of clock cycles before the neural network can accept a new input. Inference rate is inversely proportional to the initiation interval. This is of key importance as the data should be pipe-lined at the rate of the initiation interval.
- **Resource Usage:** This may encompass the counts of a range of components available on the target FPGA fabric. Some of which may be Block RAMs (BRAMs), Digital Signal Processing Blocks (DSPs), Registers, Lookup-Tables (LUTs).

*"The Europeans and the Americans are not throwing \$10 billion down this gigantic tube for nothing. We're exploring the very forefront of physics and cosmology with the Large Hadron Collider because we want to have a window on creation, we want to recreate a tiny piece of Genesis to unlock some of the greatest secrets of the universe."*  
—Michio Kaku

[Dua+18a] Duarte et al., “Fast inference of deep neural networks in FPGAs for particle physics”



FIGURE 6.1: The LHC big data problem.



#### ► MODELS

As LogicNet style topologies are quite unique in the heavy sparsity and quantization they desire, we had to do extensive topology exploration to gain insights into how we can get the most performant networks with the lowest resource cost.

#### ► NEURAL ARCHITECTURE SEARCH

We considered Neural Architecture Search to find efficient topologies. Neural Architecture Search is a method for automating the design of Neural Networks. There are many ways to define the search space, strategy and getting a performance estimation for evaluating a discovered architecture. In our case, it makes sense to treat neurons as individual blocks, and discovering not only the hierarchical connectivity, but aiming to discover completely unstructured (not layered) topologies. On top of this, we also have to define a search space carefully. Here, the search space is not only limited to the bit-width and connectivity per neuron but also the overall connectivity across layers. The Performance estimation would fairly straightforward, as we have estimates of the resource usage, as well as can easily give an account of the initiation interval and latency. The initiation interval for an LogicNet style topology is 1, and the latency is essentially equal to the number of intermediate activations that are generated, in a layered fashion. However, due to the combinatorial nature of this problem, we were too constrained to do a Neural Architecture Search. We aimed to prove that LogicNet style structures can deliver performance benefits. It is very much within the scope of future research to integrate NAS into the LogicNet library, and the library itself is being built with the same in mind.



FIGURE 6.2: Dimensions of the NAS method.

#### ► AXES OF EXPLORATION

As we are exploring LogicNet type topologies by hand, it becomes very important to specify the 'axes of exploration', and make observations along the same lines. For us, there were 3 primary axes of exploration. The Bit-Width of activation, the number of neurons per hidden layer, and the Fan-In (In Bits) per neuron. We have not attempted to explore variable fan-in for neurons with-in a layer. We do however give insights into variable fan-in for individual layers.

FIGURE 6.3: Discriminating between highly energetic (boosted)  $q$ ,  $g$ ,  $W$ ,  $Z$ ,  $t$  initiated jets. Figure adapted from User: FPGA4HEP on GitHub.



FIGURE 6.4: Our PFGA4HEP Neural Network Architecture.

#### ► CLASSIFICATION PERFORMANCE

To quantify the performance of the classifier we have made in Figure 6.4, we use the same metric used in [Dua+18a]. We use the AUC metric, or area under the Receiver Operating Characteristic (ROC) curve. The ROC curve is given by the background rejection versus signal efficiency computed from sequential cuts on the classifier output, where background rejection is ( $1 - \text{background efficiency}$ ). The Expected AUC is the AUC achieved by a full 32-bit floating point inference of the neural network.

[Dua+18a] Duarte et al., “Fast inference of deep neural networks in FPGAs for particle physics”



FIGURE 6.5: Performance of the Deep Neural Network Classifier. The Left figure shows the signal efficiency versus the mis-identification rate for  $q$ ,  $W$ ,  $Z$ ,  $t$ ,  $g$  jet identification. The mis-identification rate is based on an equal admixture of the other non-signal jet types. On the right we show the Normalized Confusion Matrix for the classifier.

#### ► MODEL DESCRIPTIONS

It is important to know how we are labelling our models. We refer the users to Figure 6.4. Our models are described in Table 6.1. The 2<sup>nd</sup> (HL) column can be referred to as  $(\text{Neurons}_{\text{Layer}1}, \text{Neurons}_{\text{Layer}2}, \text{Neurons}_{\text{Layer}3})$ . The BW refers to the Bit-Width, with X indicating the fan-in of neurons. Note that this fan in is not in bits, but instead the number of ‘synapses’. The last layer is special, its bit-width and fan-in are very crucial. Thus, we have specific the fan-in of some models with  $X_{fc}$ . Where this was not been specified, the final layer would be dense.  $BW_{fc}$  refers to the bit-width of the output. This is important as lower bit-width and removal of the SoftMax layer generally results in degradation of the ROC curve. However, the extra cost of a SoftMax layer has been excluded from our analysis.

#### ► PERFORMANCE WITH AND WITHOUT SOFTMAX

In the previous section, we described our models and briefly touched upon the effect of SoftMax on ROC. Not only does the SoftMax make the training



FIGURE 6.6: We train a model on the FPGA4HEP dataset and observe the classifiers Receiver Operating Characteristics with respect to the SoftMax Layer.

TABLE 6.1: Model descriptions

| Model | HL            | BW | X | $X_{fc}$ | $BW_{fc}$ | LUTL1 | LUTL2 | LUTL3 | LUTL4 |
|-------|---------------|----|---|----------|-----------|-------|-------|-------|-------|
| A     | (64, 64, 64)  | 3  | 3 | -        | 3         | 2112  | 2112  | 2112  | 4125  |
| B     | (128, 64, 32) | 3  | 3 | -        | 3         | 4224  | 2112  | 1056  | 2090  |
| C     | (64, 32, 32)  | 2  | 3 | -        | 2         | 128   | 64    | 64    | 1415  |
| D     | (64, 32, 32)  | 2  | 5 | 6        | 4         | 2688  | 1344  | 1344  | 3400  |
| E     | (64, 64, 64)  | 2  | 4 | 4        | 4         | 640   | 640   | 640   | 200   |

TABLE 6.2: Accuracy and LUT costs of some explored models. Note that the LUTs is the estimate described by (2.3). An in-depth discussion of the closed form equation of cost versus the synthesized cost has been discussed before. Note that the g, q, W, Z, t are reported as Area Under Curve - Receiver Operating Characteristic.

| Model | g    | q    | W    | Z    | t    | Avg AUC-ROC | LUTs  | % FC  |
|-------|------|------|------|------|------|-------------|-------|-------|
| A     | 89.3 | 86.2 | 89.8 | 89.3 | 92.7 | 89.46       | 10461 | 39.43 |
| B     | 86.5 | 86.0 | 90.5 | 90.5 | 91.9 | 89.08       | 9482  | 22.04 |
| C     | 85.4 | 82.9 | 84.9 | 83.1 | 90.3 | 85.32       | 1671  | 84.68 |
| D     | 85.8 | 85.0 | 89.5 | 89.2 | 91.5 | 88.20       | 8776  | 38.74 |
| E     | 86.5 | 85.4 | 90.0 | 89.6 | 92.0 | 88.70       | 2120  | 9.43  |

better, it makes the classifier more robust to noise. While the effects of removing the SoftMax layer after training is not evident on the confusion matrix, upon graphing the ROC, it becomes important to inspect it more closely.

#### ► ROC - CURVE AND SOFTMAX

To test this more closely, we train a model in two ways. The first test is to train a neural network with output quantization, but without SoftMax. We then find its ROC and compare it with the ROC of the same trained model with a SoftMax appended after the final quantizer. The second test is



FIGURE 6.7: Increase in Accuracy with increase in LUT cost.



FIGURE 6.8: Increase in Accuracy with increase in Bit-Width.

to train a neural network with output quantization followed by a SoftMax. We will then find its ROC and compare it with the ROC of this same trained model but we will drop the final SoftMax layer. This will lend us insights as to whether there is any train time benefits of the final SoftMax layer, additionally also allow us to make claims on whether removing the SoftMax layer from a trained model causes detriments in the ROC of the classifier.

#### ► EXPLORING THE AXES

To gain further insights into the effects of architectural decisions on the accuracy and LUT costs, we did some elementary grid search. In Figure [Figure 6.7](#), it is fairly evident that increasing the LUT cost gives higher accuracies. We do however see a clear overlap in two distinct clusters of 'Average AUC-ROC', whereby some models that need very little lutts (as less as 2500 LUTs) perform as well as models requiring about 100000 LUTs. Further, we observe that there are some models that require a million LUT, but barely surpass the performance of a model need as little LUTs as 2500. Upon inspection, we discovered that these were models with unreasonably large final layers, or topologies where the per neuron fan-in in bits is too large, while focusing very little on the relationship between Bit-Width and Expand-Size for accuracy. This becomes more evident in Figure [Figure 6.8](#), where we see that while going from a Bit-Width of 1 to 2 almost undoubtedly reaps benefit. We see diminishing returns when going from bit-width of 2 to 3. While it is not conclusive that increasing bit-width consistently results in diminishing returns, it is still an important insight that we must be aware of

TABLE 6.3: This table depicts the accuracies for particular models trained with A-Priori Fixed Sparsity and Iterative Pruning techniques.

| Model | LUTs  | A-Priori Fixed Sparsity | Iterative Pruning |
|-------|-------|-------------------------|-------------------|
| A     | 2120  | 88.46                   | <b>88.7</b>       |
| B     | 54060 | 88.96                   | <b>89.56</b>      |
| C     | 59840 | <b>88.98</b>            | 88.92             |

the relation between Bit-Width and ExpandSizes. An unwise compromise on one may give sub-optimal results for the same number of LUTs.

#### ► PRUNING TECHNIQUES

As discussed in a previous chapter, we have explored many different pruning techniques. While iterative pruning gave the best accuracies, A-Priori Fixed Sparsity was the easiest to train. It thus becomes important to question what utility an Iterative Pruning strategy has over A-Priori Fixed Sparsity, and whether it has any notable benefits for topology exploration. It may be within reason to say that it is worth doing topology exploration using A-Priori Fixed Sparsity, and then taking the discovered topology and training it with Iterative Pruning. It is important to note that at this point of time, the LogicNet Library only supports A-Priori Fixed Sparsity. As we can see in [Table 6.3](#), there is a very marginal difference in the **Average Area Under Curve of the Receiving Operating Characteristic**. It thus may be to our benefit to leave topology exploration to A-Priori Fixed Random Sparsity, and use more advanced pruning technique once a suitable topology has been discovered. This claim however is up for further introspection.

# 7

## MNIST

---

The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. This is a well known data-set, and is targeted by both MLPs and Convolutional Neural Network. We thus found this to be an excellent starting point for LogicNet. We do some topology exploration on this data-set, and study a wide array of architectural decisions in detail.

*“AI is a tool. The choice about how it gets deployed is ours.”*  
—Oren Etzioni

While exploring our search space, we began with MLPs on the MNIST data-set. To gain insights, we had to ascertain a few things which would make our search more effective. Some of the notable observations while attempting to gain insights on how to discover good topologies were as follows. Note that the observations below are specific to the MNIST data-set and only valid for MLPs.

- For MLPs, going from 2 hidden layers to 3 hidden layers still has benefits, i.e. the intermediate layers do not learn an identity function.
- In our tests, a drop-out at the input layer gives very slight benefits.
- Drop-out in the intermediate layers with A Priori Random Fixed Sparsity causes accuracy to deteriorate irrespective of the size of the hidden layer, or bit-width.
- ADAM and SGD as optimizers give very similar accuracy results.
- Increasing bit-width from 1 to 2 is beneficial, but not at the cost of neuron fan-in in terms of synapses.
- Increasing number of neurons in a hidden layer is beneficial up-to 1024, after which the utility diminishes.
- The last layer cannot have low per-neuron fan in. It causes serious detriments to accuracy of the network. For this reason unless mentioned otherwise; **all our experiments for MNIST report models with the final layer dense.**

To study how our Analytical LUT cost varies with accuracy, we trained an array of models on the MNIST dataset, all of which were 3 Layer MLPs. **Figure 7.1** Shows that there is consistently a lower bound in terms of LUTs for the same performance. It is worth noting that the Y axis is logarithmic,

| HL       | BW | X | LUTL1 | LUTL2 | LUTL3 | LUTL4  | LUTs   | Accuracy |
|----------|----|---|-------|-------|-------|--------|--------|----------|
| (512)×1  | 2  | 6 | 87k   | 43.9k | -     | -      | 131k   | 95.62    |
| (1024)×1 | 2  | 5 | 43k   | 87.7k | -     | -      | 130k   | 95.86    |
| (2048)×2 | 2  | 5 | 86k   | 175k  | -     | -      | 261k   | 94.77    |
| (512)×2  | 2  | 6 | 87k   | 87k   | 43.5k | -      | 217.6k | 96.58    |
| (1024)×2 | 2  | 5 | 43k   | 43k   | 87k   | -      | 173k   | 96.85    |
| (2048)×2 | 2  | 5 | 86k   | 86k   | 174k  | -      | 345.8k | 96.96    |
| (512)×3  | 2  | 6 | 87k   | 87k   | 87k   | 43.5k  | 304.6k | 96.53    |
| (1024)×3 | 2  | 5 | 43k   | 43k   | 43k   | 87k    | 215.9k | 96.92    |
| (2048)×3 | 2  | 5 | 86k   | 86k   | 86k   | 173.8k | 431.8k | 97.41    |

TABLE 7.1: This table describes the models in detail, it also gives the layerwise breakdown of LUTs (From the Analytical LUT Cost Model) and the accuracy. These models have been trained with Fixed A-Priori Sparsity, and is just for topology exploration.

which means that a poor choice of Neuron Fan-In with a sub-optimal Bit-Width and Connectivity configuration can be very detrimental.



FIGURE 7.1: Analytical LUT cost vs. Accuracy for a 3 Layer MLP on the MNIST Data-Set.



FIGURE 7.2: Accuracy vs. Bit-Width for a 3 Layer MLP on the MNIST Data-Set.

From Table 7.1, we gain further insights into the role of depth and width of the network for accuracy on MNIST. We see an upward trend in accuracy as we increase the depth of the neural network. Further, we can see the benefits of increasing the Bit-Width for a 3 Layer MLP on the MNIST data-set from Figure 7.2.

#### ► ITERATIVE AND LEARNED SPARSITY

In our tests, A-Priori Fixed Random Sparsity consistently gave poor accuracies when compared to Iterative and Learned Sparsity. This may be due to the fact

| Model | A-Priori Fixed Sparsity | Momentum Sparsity | Iterative Pruning |
|-------|-------------------------|-------------------|-------------------|
| A     | 97.32                   | 97.68             | <b>97.78</b>      |
| B     | 97.12                   | 97.39             | <b>97.75</b>      |
| C     | 96.58                   | 97.22             | <b>97.63</b>      |

| Model | No Skip | 1 Skip       | 2 Skips      |
|-------|---------|--------------|--------------|
| A     | 92.86   | 93.89        | <b>94.66</b> |
| B     | 94.47   | 95.03        | <b>95.18</b> |
| C     | 95.71   | <b>95.88</b> | 95.78        |
| D     | 94.67   | 95.24        | <b>95.51</b> |

TABLE 7.2: This table depicts the accuracies for models trained with A-Priori Fixed Sparsity, Momentum Sparsity and Iterative Pruning techniques.

TABLE 7.3: We observe the accuracy of 3 Layer MLPs with Skip Connections.

that the MNIST data-set is flattening an Image and thus has definite structure in the flattened input. Initializing random connectivity is thus detrimental, whereas iterative and learned sparsity methods are able to discover important connections. From [Table 7.2](#), we can be relatively confident that Iterative Pruning is the best strategy to train models with such sparsity. Momentum Sparsity gives better results than A-Priori Fixed Sparsity and requires almost the same amount of resources and time. Iterative Pruning takes about  $10\times$  longer to train, based on your pruning rates.

#### ► SKIP CONNECTIONS

Due to the nature of LogicNet, we can introduce 'Skip Connections' to a neural network. As long as the per neuron fan-in remains the same, the LUT cost remains the same. What influence this has on Place and Route and the complexity it introduces is beyond the scope of this thesis.

As evidenced by [Table 7.3](#), we see notable increase in accuracy with no overhead in LUT cost. Thus, this can be an important avenue to explore in the future.

#### ► CONVOLUTIONAL NEURAL NETWORKS

To get better performance, we need to use topologies which are tailored for images. An exploration of how such sparsity translated to computer vision tasks is very important. As discussed before, we use 'Sparse Depthwise Separable Convolution' to facilitate realistic mapping of convolutions to LUTs. We also use heavy quantization. [Table 7.4](#) Shows how accuracy is affected as we change the topology and its bit-precision and sparsity. We test the accuracy of 4 variants of the same topology. The 'FP' being the full precision model with vanilla convolutions, 'FP\_DW' being the full precision model but with Depthwise Separable Convolutions. 'FP\_X\_DW' further introduces sparsity in the depthwise separable convolution and finally 'QUANT\_X\_DW' quantizes the sparse depthwise separable convolutional model. [Table 7.4](#) seems to indicate that quantization causes the most notable degradation in accuracy.

#### ► CONVOLUTION SKIP CONNECTIONS

After testing skip connections on MLPs, it only made sense to explore what its scope is for convolutional neural networks. We present our findings similar to before, with 1 skip connection and 2 skip connections. The activation of

TABLE 7.4: Taking a specific model and observing its accuracy as we change the Pruning, Quantization and Convolution method.

| Model      | A     | B     | C     |
|------------|-------|-------|-------|
| FP         | 98.68 | 98.96 | 99.15 |
| FP_DW      | 98.34 | 97.7  | 98.6  |
| FP_X_DW    | 97.25 | 97.93 | 97.41 |
| QUANT_X_DW | 96.84 | 97.59 | 95.89 |

TABLE 7.5: This table describes the models in detail, it also gives the Analytical LUT Cost and the accuracy. These models have been trained with Fixed A-Priori Sparsity, and is just for topology exploration. BW Stands for Bit-Width, X stands for the ( $X_k$ ,  $X_s$ ).  $X_k$  is the kernel sparsity, and  $X_s$  is the sparsity of the pointwise operation in Depthwise Separable Convolution.

| Model | BW | X      | LUTs | Accuracy |
|-------|----|--------|------|----------|
| A     | 2  | (5,5)  | 145k | 95.87    |
| B     | 2  | (3,5)  | 345k | 97.47    |
| C     | 2  | (5, 4) | 769k | 97.59    |
| D     | 2  | (5, 6) | 420k | 97.57    |

the first convolutional layer and second convolutional layer is concatenated to the activation of the second convolutional layer and activation of the final convolutional layer respectively. From [Table 7.6](#), we see some accuracy improvements. We do not discuss these skip connections in too much detail as it is beyond the scope of this thesis. It is however a direction worth exploring.

TABLE 7.6: We observe the accuracy of a 3 Layer LogicNet style Convolutional Neural Network with Skip Connections.

| Model | No Skip | 1 Skip       | 2 Skips      |
|-------|---------|--------------|--------------|
| A     | 96.74   | 96.93        | <b>96.94</b> |
| B     | 97.08   | 97.56        | <b>97.57</b> |
| C     | 96.94   | <b>97.36</b> | 97.13        |

### Part III

#### CONCLUDING REMARKS



# 8

## Conclusion

---

The neurons in our brains do not *seem* to follow any given structure, rather they can be condensed to specialized graphs which are a far-cry from densely connected topologies we commonly use. We attempt to train extremely sparse structures, that can map directly to hardware building blocks in our FPGA. Much like we have taken jabs at trying to draw parallels between the neurons in our brains and the neurons in our machine learning framework, we take a jab at how to map neurons in these machine learning frameworks to hardware building blocks, we studied the interplay between quantization and sparsity. We developed a library to map hardware building blocks to neuron equivalents as well as used that library to synthesize neural networks. We begin by introducing basic concepts such as the rough architecture of an FPGA, and some of the basic layer types in a Neural Network. We also delve lightly into Quantized Neural Networks and Sparse Neural Networks. In Chapter 2, we hope to explain how neural networks are typically implemented on FPGAs. We use the HLS-RFNoC workflow as an implementation case study. We then delve into the Analytical LUT cost that we use as our pessimistic resource cost for the rest of the paper.

Chapter 3 focuses on the exact process of mapping neurons to aforementioned 'hardware building blocks'. We explore different ways of sparsification of neural networks and delve into interesting research questions on implementing sparsity in neural networks. Here, we learn that A-Priori Random Fixed sparsity is an effective method for exploring topologies but we need more powerful methods such as Sparse Momentum Learning or Iterative Pruning to get the best accuracy. We also draw parallels between Sparsification and Neural Architecture Search, which is an interesting way to unify different tangents of research on discovering performant neural wiring. We then move on to actually introducing to the reader the LogicNet design flow and particulars of the library itself. We explain the reasoning behind certain design decisions. Design automation is also discussed to some level, but in Chapter 5 we primarily focus on the components that are needed for design automation. We familiarize the reader with the truth tables we generate from a specific topology and delve in detail into how we implement our VERILOG code generator. We describe the modules and sub-structures that exist in the VERILOG code. This is followed by a very interesting study on how the resource cost after synthesis is much lower than the analytical

*"The desire to create is one of the deepest yearnings of the human soul."*  
—Dieter F. Uchtdorf

LUT cost model we had introduced previously. This opens up more costly design spaces we had not previously explored due to resource constraints. We also delve into how we can design heuristics for LUT Costs and Congestion Estimation and integrate that with the library in the future to discover topologies that push the gap between the analytical LUT estimate and the actual LUT cost further, and lends us with a less congested neural network. While we are unable to propose methods to integrate congestion estimation, we do give insight into how a future heuristic that aids LUT cost reduction could look like.

Chapter 6 focuses heavily on implementing LogicNet in a real world scenario. We choose Jet Substructure Classification as our target problem, as they have data-rates in excess of hundreds of terabytes per second and require triggers that have very low target latency and use custom hardware like FPGAs or ASICs. We are able to discover several topologies that give very good performance with a very low analytical LUT cost. Further, we gain insight into the important of SoftMax in the Jet Substructure Classification problem, and do some topology exploration. We also compare the accuracy of models pruned by different methods.

Chapter 7 introduces the well known data-set, MNIST. We provide the reader with insights into the behavior of LogicNet on this dataset for both MLPs and Convolutional Neural Networks. We study the interplay of Analytical LUT cost and accuracy, as well as the relation between accuracy and the bit-width. We study heterogeneous architectures with 'skip connections' for both MLPs and Convolution, and find some benefit to it. For MLPs, it is interesting to note that we gain benefits with skip connections without increase in hardware cost. We do not discuss the Place and Route effects of introducing skip connections in this thesis.

We believe that there are an array of problems which could find great use for the LogicNet library, and this thesis hopes to be of use when making architectural decisions. We hope to further continue this work and learn more about how to discover sparse topologies, and drive down the resource cost of such mapping of neural networks to an FPGA fabric.

## Bibliography

---

- [Abd+18] Ahmed M Abdelsalam, Ahmed Elsheikh, Jean-Pierre David, and JM Pierre Langlois. “POLYBiNN: a scalable and efficient combinatorial inference engine for neural networks on FPGA”. In: *2018 Conference on Design and Architectures for Signal and Image Processing (DASIP)*. IEEE. 2018, pp. 19–24 (cit. on p. 9).
- [Bel+17] Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. *Deep Rewiring: Training very sparse deep networks*. 2017. arXiv: [1711.05136 \[cs.NE\]](https://arxiv.org/abs/1711.05136) (cit. on p. 14).
- [Cou+16] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. “Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1”. In: *arXiv preprint arXiv:1602.02830* (2016) (cit. on p. 9).
- [DZ19] Tim Dettmers and Luke Zettlemoyer. *Sparse Networks from Scratch: Faster Training without Losing Performance*. 2019. arXiv: [1907.04840 \[cs.LG\]](https://arxiv.org/abs/1907.04840) (cit. on p. 10).
- [Dua+18a] J. Duarte, S. Han, P. Harris, S. Jindariani, E. Kreinar, B. Kreis, J. Ngadiuba, M. Pierini, R. Rivera, N. Tran, and et al. “Fast inference of deep neural networks in FPGAs for particle physics”. In: *Journal of Instrumentation* 13.07 (2018), P07027–P07027. ISSN: 1748-0221. DOI: [10.1088/1748-0221/13/07/p07027](https://doi.org/10.1088/1748-0221/13/07/p07027). URL: <http://dx.doi.org/10.1088/1748-0221/13/07/p07027> (cit. on pp. 29, 31).
- [Dua+18b] Javier Duarte, Song Han, Philip Harris, Sergio Jindariani, Edward Kreinar, Benjamin Kreis, Jennifer Ngadiuba, Maurizio Pierini, Ryan Rivera, Nhan Tran, et al. “Fast inference of deep neural networks in FPGAs for particle physics”. In: *Journal of Instrumentation* 13.07 (2018), P07027 (cit. on p. v).
- [Evc+19] Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. *Rigging the Lottery: Making All Tickets Winners*. 2019. arXiv: [1911.11134 \[cs.LG\]](https://arxiv.org/abs/1911.11134) (cit. on pp. 13, 14).
- [Gha+18] Mohammad Ghasemzadeh, Mohammad Samragh, and Farinaz Koushanfar. “ReBNet: Residual binarized neural network”. In: *2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)*. IEEE. 2018, pp. 57–64 (cit. on p. 6).
- [Han+15] Song Han, Huizi Mao, and William J Dally. “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding”. In: *arXiv preprint arXiv:1510.00149* (2015) (cit. on p. 9).
- [Kre19] EJ. Kreinar. *RFNoC HLS NeuralNet*. <https://github.com/Xilinx/RFNoC-HLS-NeuralNet>. 2019 (cit. on p. 7).
- [Liu+19] Shiwei Liu, Decebal Constantin Mocanu, Amarsagar Reddy Ramapuram Matavalam, Yulong Pei, and Mykola Pechenizkiy. *Sparse evolutionary Deep Learning with over one million artificial neurons on commodity hardware*. 2019. arXiv: [1901.09181 \[cs.NE\]](https://arxiv.org/abs/1901.09181) (cit. on p. 14).
- [Moc+18] Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H. Nguyen, Madeleine Gibescu, and Antonio Liotta. “Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science”. In: *Nature Communications* 9.1 (2018). ISSN: 2041-1723. DOI: [10.1038/s41467-018-04316-3](https://doi.org/10.1038/s41467-018-04316-3). URL: <http://dx.doi.org/10.1038/s41467-018-04316-3> (cit. on p. 14).
- [MT19] Tadej Murovič and Andrej Trost. “Massively Parallel Combinational Binary Neural Networks for Edge Processing”. In: *Elektrotehniški Vestnik* 86.1/2 (2019), pp. 47–53 (cit. on p. 6).
- [Pn19] Alessandro Pappalardo and nickfraser. *Xilinx/brevitas: Pretrained 4b ProxylessNAS Mobile14 r0*. Version quant\_proxylessnas\_mobile14\_4b-r0. Nov. 2019. DOI: [10.5281/zenodo.3525102](https://doi.org/10.5281/zenodo.3525102). URL: <https://doi.org/10.5281/zenodo.3525102> (cit. on p. 12).
- [Pra+18] Ameya Prabhu, Girish Varma, and Anoop Namboodiri. “Deep expander networks: Efficient deep networks from graph theory”. In: *Proceedings of the European Conference on Computer Vision (ECCV)*. 2018, pp. 20–35 (cit. on pp. 6, 10).
- [RK19] Ryan A. Robinett and Jeremy Kepner. “RadiX-Net: Structured Sparse Matrices for Deep Neural Networks”. In: *CoRR* abs/1905.00416 (2019). arXiv: [1905.00416](https://arxiv.org/abs/1905.00416). URL: [http://arxiv.org/abs/1905.00416](https://arxiv.org/abs/1905.00416) (cit. on p. 6).
- [Sam+19] Ananda Samajdar, Tushar Garg, Tushar Krishna, and Nachiket Kapre. “Scaling the Cascades: Interconnect-Aware FPGA Implementation of Machine Learning Problems”. In: *2019 29th International Conference on Field Programmable Logic and Applications (FPL)* (2019), pp. 342–349 (cit. on p. 27).
- [SM02] Kenneth O Stanley and Risto Miikkulainen. “Evolving neural networks through augmenting topologies”. In: *Evolutionary computation* 10.2 (2002), pp. 99–127 (cit. on p. 6).

- [Swa+98] Jordan S. Swartz, Vaughn Betz, and Jonathan Rose. “A Fast Routability-driven Router for FPGAs”. In: *Proceedings of the 1998 ACM/SIGDA Sixth International Symposium on Field Programmable Gate Arrays*. FPGA ’98. Monterey, California, USA: ACM, 1998, pp. 140–149. ISBN: 0-89791-978-5. DOI: [10.1145/275107.275134](https://doi.acm.org/10.1145/275107.275134). URL: <http://doi.acm.org/10.1145/275107.275134> (cit. on p. 27).
- [Tar+19] Chance Tarver, Alexios Balatsoukas-Stimming, and Joseph R. Cavallaro. *Design and Implementation of a Neural Network Based Predistorter for Enhanced Mobile Broadband*. 2019. arXiv: [1907.00766 \[eess.SP\]](https://arxiv.org/abs/1907.00766) (cit. on p. v).
- [Tei12] J. Teich. “Hardware/Software Codesign: The Past, the Present, and Predicting the Future”. In: *Proceedings of the IEEE 100.Special Centennial Issue* (2012), pp. 1411–1430. DOI: [10.1109/JPROC.2011.2182009](https://doi.ieeecomputersociety.org/10.1109/JPROC.2011.2182009) (cit. on p. 7).
- [Tra+19] Nhan Tran, Ben Kreis, Javier Duarte, vloncar, jngadiub, EJ Kreinar, thesp, Zhenbin Wu, and drankincms. *hls-fpga-machine-learning/hls4ml: v0.1.5*. Version v0.1.5. Aug. 2019. DOI: [10.5281/zenodo.3359214](https://doi.org/10.5281/zenodo.3359214). URL: <https://doi.org/10.5281/zenodo.3359214> (cit. on p. 7).
- [Umu18] Yaman Umuroglu. “Accelerating Sparse Linear Algebra and Deep Neural Networks on Reconfigurable Platforms”. In: (2018) (cit. on p. 7).
- [Umu+20a] Yaman Umuroglu, Yash Akhauri, Nicholas J. Fraser, and Michaela Blott. “High-Throughput DNN Inference with LogicNets”. In: IEEE. 2020 (cit. on p. v).
- [Umu+20b] Yaman Umuroglu, Yash Akhauri, Nicholas J. Fraser, and Michaela Blott. *LogicNets: Co-Designed Neural Networks and Circuits for Extreme-Throughput Applications*. 2020. arXiv: [2004.03021 \[eess.SP\]](https://arxiv.org/abs/2004.03021) (cit. on p. v).
- [Wan+19] Erwei Wang, James J Davis, Peter YK Cheung, and George A Constantinides. “LUTNet: Rethinking Inference in FPGA Soft Logic”. In: *2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)*. IEEE. 2019, pp. 26–34 (cit. on pp. 6, 9).
- [Wor+19a] Mitchell Wortsman, Ali Farhadi, and Mohammad Rastegari. “Discovering Neural Wirings”. In: *CoRR abs/1906.00586* (2019). arXiv: [1906.00586](https://arxiv.org/abs/1906.00586). URL: <http://arxiv.org/abs/1906.00586> (cit. on p. 6).
- [Wor+19b] Mitchell Wortsman, Ali Farhadi, and Mohammad Rastegari. “Discovering Neural Wirings”. In: *ArXiv abs/1906.00586* (2019) (cit. on p. 14).
- [Y. 17] Y. Umuroglu et al. “FINN: A Framework for Fast, Scalable Binarized Neural Network Inference”. In: 2017. ISBN: 978-1-4503-4354-1. DOI: [10.1145/3020078.3021744](https://doi.acm.org/10.1145/3020078.3021744). URL: <http://doi.acm.org/10.1145/3020078.3021744> (cit. on pp. 6, 9).
- [Yea+07] David Yeager, Darius Chiu, and Guy Lemieux. “Congestion estimation and localization in FPGAs: a visual tool for interconnect prediction”. In: *SLIP*. 2007 (cit. on p. 27).